Is there any actual evidence that the reasoning tokens output by current models ...

ckrapu · 2025-02-13T06:48:25 1739429305

In the autoregressive decoding framework, the hidden layers' state for computation of token `t` is conditionally independent of all hidden states for `t-1`, `t-2` and so on given the observed tokens.

Put differently, the observed tokens are a bottleneck on the information that can be communicated across tokens. Any scheming performed by an LLM which requires more than one token to formulate must therefore pass through the visible tokens. With opaque vectors transferred across decoding steps, this is not the case.

The computation in the hidden layers, as far as we can tell, is not sufficient for scheming in a single decoding step. It looks like it requires O(10^2) or O(10^3) steps instead, judging from anecdotal evidence like the reports of scheming from o1 (https://cdn.openai.com/o1-system-card-20241205.pdf)

As far as your last point goes, I'd rather have a more transparent system, all other factors held constant.

anothermathbozo · 2025-02-11T01:19:12 1739236752

No and we’ve observed evidence to the contrary

mola · 2025-02-11T09:17:27 1739265447

Do you have some reading material on this? How did they understand the difference between stated cot and "actual processing"