TL;DR: Meta started with a pre-trained language model. They then fine-tuned it o...

avodonosov · on Dec 31, 2024

Thank you for the summary, useful for me as I only managed to skim throught the first half.

But one correction, probably, regarding this bit:

> While in this [latent space thought] mode, the model auto-regressive iterates by copying its final hidden layer back onto its input layer, obviously generating new tokens at the output with each inference step as it always does.

I have impression that output tokens are not generated while in the latent thought mode.

ttul · on Jan 1, 2025

Output tokens are still generated, otherwise the model wouldn’t know when to stop being in latent space mode. The <eot> token emerges as the top token at the output layer when it’s time to switch back.

avodonosov · on Jan 1, 2025

Explicit <eot> is only used in training.

At inference time, the paper says:

> A challenge lies in determining when to switch between latent and language modes. As we focus on the problem-solving setting, we insert a <bot> token immediately following the question tokens. For <eot>, we consider two potential strategies: a) train a binary classifier on latent thoughts to enable the model to autonomously decide when to terminate the latent reasoning, or b) always pad the latent thoughts to a constant length. We found that both approaches work comparably well. Therefore, we use the second option in our experiment for simplicity, unless specified otherwise.

(the bottom of the page 4 in the paper pdf, which can be downloaded from https://arxiv.org/abs/2412.06769)

Why this point in you summary caught my eye, because the article specifically emphasises non-verbal nature or aspect of reasoning. Internal representaions used by a thinking human are largely not words, and the COCONUT approach tries to model that.

Also note, that a whole reasoning step in training data - easily a sentence or more of natural language - can be replaced by a single "Thought" element. (How many Thought elements replace a reasonong step is controlled by a hyperparameter ‘c’; the illustrations are made for ‘c=1’).

BTW, one observation: the aipapersacademy.com article in the subject calls the Thought elements "thought tokens", but the original paper never calls them "tokens", just "Thoughts" or "latent thoughts". I suppose the paper carefully avoids that to prevent confusion, as "token" mainly means a linguistic unit in LLMs.

ttul · on Jan 2, 2025

Thanks for your extensive explanation!

avodonosov · on Jan 8, 2025

I do it for myself - the desire to post a comment motives to read a little more.

A little correction:

> Explicit <eot> is only used in training.

Of course an explicit <eot> is present in the context at inference time, because the LLM was trained to produce verbal tokens after <eot>. It's just that the <eot> is placed into the context in a one of the two ways above.

BTW, I do not understand why the <eot> is not produced by LLM itself, as you describe. It seems reasonable and natural.

Is that to save computational performance on unembedding while in the latent thought mode? But unembedding takes a small fraction of computations, should not be an issue. Something prevents reliable learning of how and when to produce the <eot>? But they managed to train a binary classifier. But why separate classifier, why not rely on LLM learning?

Another though is that maybe better names for special tokens would be not "begin of thought" (<bot>), "end of thought" (<eot>), but rather something like "pause speak", "begin of speak". Because neither human nor LLM stop thinking when speaking.

treprinum · on Dec 31, 2024

Would that mean that we would need to exchange latent "embeddings" between various "reasoning" models for emulating thinking and an LLM will be just about converting to/from human language when interfacing with mere humans, at some point in the future?

ttul · on Jan 1, 2025

No, this all happens inside the model. I suppose it’s possible that the hidden layers of one model could be sent to another model. But the second model would need to be trained to understand the meaning of the hidden layer’s outputs. You could accomplish that through fine tuning of the second model. It would be neat to see someone try this.