Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

TL;DR: Meta started with a pre-trained language model. They then fine-tuned it on step-by-step reasoning examples as you would do if you wanted your model to become particularly good at chain of thought reasoning.

However, they also introduced a couple of new tokens. The <bot> token tells the model to go into latent space thought mode (“beginning of thought”). The <eot> token ends latent space thought mode. While in this mode, the model auto-regressive iterates by copying its final hidden layer back onto its input layer, obviously generating new tokens at the output with each inference step as it always does.

The idea is that by passing the final hidden layer back through a few times, the model can squeeze more insight from the context. And that’s precisely what they found was true.

Training involves progressively replacing language reasoning steps with latent space auto-regression steps. So for instance, you might have a math problem in the training data and at first the model is fed all of the steps of the math problem in language form. But in later iterations of training, step one is replaced with latent space auto-regression. And then step two as well, then also step three, etc…

Eventually, the model learns to enable latent space thinking mode by itself by generating the <bot> tokens and to end it be generating <eot> tokens.

Pretty ingenious!



Thank you for the summary, useful for me as I only managed to skim throught the first half.

But one correction, probably, regarding this bit:

> While in this [latent space thought] mode, the model auto-regressive iterates by copying its final hidden layer back onto its input layer, obviously generating new tokens at the output with each inference step as it always does.

I have impression that output tokens are not generated while in the latent thought mode.


Output tokens are still generated, otherwise the model wouldn’t know when to stop being in latent space mode. The <eot> token emerges as the top token at the output layer when it’s time to switch back.


Explicit <eot> is only used in training.

At inference time, the paper says:

> A challenge lies in determining when to switch between latent and language modes. As we focus on the problem-solving setting, we insert a <bot> token immediately following the question tokens. For <eot>, we consider two potential strategies: a) train a binary classifier on latent thoughts to enable the model to autonomously decide when to terminate the latent reasoning, or b) always pad the latent thoughts to a constant length. We found that both approaches work comparably well. Therefore, we use the second option in our experiment for simplicity, unless specified otherwise.

(the bottom of the page 4 in the paper pdf, which can be downloaded from https://arxiv.org/abs/2412.06769)

Why this point in you summary caught my eye, because the article specifically emphasises non-verbal nature or aspect of reasoning. Internal representaions used by a thinking human are largely not words, and the COCONUT approach tries to model that.

Also note, that a whole reasoning step in training data - easily a sentence or more of natural language - can be replaced by a single "Thought" element. (How many Thought elements replace a reasonong step is controlled by a hyperparameter ‘c’; the illustrations are made for ‘c=1’).

BTW, one observation: the aipapersacademy.com article in the subject calls the Thought elements "thought tokens", but the original paper never calls them "tokens", just "Thoughts" or "latent thoughts". I suppose the paper carefully avoids that to prevent confusion, as "token" mainly means a linguistic unit in LLMs.


Thanks for your extensive explanation!


I do it for myself - the desire to post a comment motives to read a little more.

A little correction:

> Explicit <eot> is only used in training.

Of course an explicit <eot> is present in the context at inference time, because the LLM was trained to produce verbal tokens after <eot>. It's just that the <eot> is placed into the context in a one of the two ways above.

BTW, I do not understand why the <eot> is not produced by LLM itself, as you describe. It seems reasonable and natural.

Is that to save computational performance on unembedding while in the latent thought mode? But unembedding takes a small fraction of computations, should not be an issue. Something prevents reliable learning of how and when to produce the <eot>? But they managed to train a binary classifier. But why separate classifier, why not rely on LLM learning?

Another though is that maybe better names for special tokens would be not "begin of thought" (<bot>), "end of thought" (<eot>), but rather something like "pause speak", "begin of speak". Because neither human nor LLM stop thinking when speaking.


Would that mean that we would need to exchange latent "embeddings" between various "reasoning" models for emulating thinking and an LLM will be just about converting to/from human language when interfacing with mere humans, at some point in the future?


No, this all happens inside the model. I suppose it’s possible that the hidden layers of one model could be sent to another model. But the second model would need to be trained to understand the meaning of the hidden layer’s outputs. You could accomplish that through fine tuning of the second model. It would be neat to see someone try this.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: