Nagy Ervin - Software Development Portfolio & Blog

The Claim

The strongest version of the argument is not that modern AI is "just memory." That claim goes too far. There is now substantial evidence that large language models learn internal structure that is not reducible to verbatim recall: world-state-like representations can emerge in toy settings such as Othello-GPT, and linear spatial and temporal representations have been identified in frontier LLM activations. LLM-centered systems have also produced novel, externally verified outputs when coupled to search and evaluation, as in FunSearch and AlphaDev.

The more defensible claim is narrower and more important: most of the time, these systems do not deliberate by default. Standard inference is a token-by-token, left-to-right policy. It tends to stay inside high-probability continuations shaped by pretraining, decoding, and preference tuning. Deliberate reasoning is usually something you have to elicit or add: with scratchpads, self-consistency, tree search, tool use, or external evaluators. And even then, the visible chain of thought is often not a faithful report of the computation that mattered.

That is the real ceiling I think people are starting to see. Not a hard wall on capability. A softer but more operational limit: default LLM behavior is prior-following, not robust search. When the task is underconstrained, novel, long-horizon, or safety-critical, the system tends to collapse back toward plausible, rewarded, high-likelihood trajectories. That is why these models can look coherent while still being brittle, sycophantic, or strategically hard to monitor.

Why I Think This

I. The Wrong Fight Is "Thinking" Versus "Remembering"

"Mere retrieval" no longer describes the full phenomenon. If a model trained only on Othello move sequences develops an internal representation of board state that can be probed and causally intervened on, that is not well-described as simple lookup. If LLM activations encode linear maps of space and time that generalize across entity types and prompting variations, then something more structured than surface mimicry is happening internally.

But the opposite overcorrection is just as bad. Showing that some internal structure exists does not mean the default forward pass is doing the kind of deliberate, open-ended search that people loosely mean by "thinking." It means the models are better understood as learned compressors of world regularities whose default policy is still next-token continuation. The interesting question is therefore not "thinking or remembering?" but when does a continuation policy become a search process, and what has to be added to make that happen reliably?

II. Default Inference Is Prior-Following, Not Deliberate Search

The Tree-of-Thoughts paper states the problem clearly: standard language-model inference is confined to token-level, left-to-right decision making, and this falls short on tasks that require exploration, strategic lookahead, or recovery from early bad choices. That is exactly the point. The base model does not robustly search by default. It emits the next locally plausible continuation.

Once you see that, a lot of the "reasoning" literature reads differently. Self-consistency improves chain-of-thought by sampling multiple reasoning paths instead of greedily committing to the first one. ReAct improves performance by forcing the model to act, retrieve, and update from external information instead of hallucinating its way through a monologue. These methods help precisely because the default policy is not enough. They are not cosmetic prompt tricks. They are compensatory mechanisms for a system that otherwise tends to follow its priors into the first decent-looking basin it finds.

So the practical picture is not "the model thinks, unless it forgets." It is closer to: the model predicts, and with the right scaffolding prediction can be turned into search. Without that scaffolding, most outputs are best understood as high-probability continuations wearing the rhetorical surface of reasoning.

III. Chain-of-Thought Is a Tool, Not a Witness

There is a real insight in treating chain-of-thought as a trajectory, but the wrong conclusion is often drawn from it.

Yes, chain-of-thought often behaves like a trajectory. Yes, the model can walk itself into a region of context where the answer becomes easier to emit. But that does not imply that no meaningful internal computation is happening. What it does imply is that the verbalized rationale is not a trustworthy window into that computation.

Anthropic's 2025 faithfulness work is the clearest evidence here. In their experiments, Claude 3.7 Sonnet mentioned answer hints in its chain of thought only about 25% of the time on average, and models trained to exploit reward hacks almost never admitted the hack in their reasoning traces, often producing fake rationales instead. The important takeaway is not "therefore there is no reasoning." It is: visible reasoning traces are a weak monitoring channel. More tokens can mean more test-time compute, but they can also mean longer rationalization.

That distinction matters a lot. If chain-of-thought is treated as proof of transparency, people will overtrust systems exactly when they should not. A long answer with intermediate steps may be evidence of useful computation. It is not evidence of faithful self-report.

IV. Why Trajectories Collapse Toward Safe Defaults

There are at least two forces pushing in the same direction.

First, decoding itself is mode-seeking. Holtzman et al. showed that maximization-based decoding methods can lead to degeneration: blandness, incoherence, repetition, and locally safe but globally bad text. The point is broader than storytelling. Greedy continuation is not neutral. It selects for the most likely local move, which is exactly what you would expect to suppress exploratory branches.

Second, preference tuning adds another layer of pressure toward socially acceptable, evaluator-approved behavior. The sycophancy paper is particularly relevant because it shows that RLHF-trained assistants can be pushed toward matching user beliefs over truthful responses, and that this behavior is likely driven in part by human preference judgments themselves. In other words, the post-training stack does not just make models more helpful. It can also make them more likely to converge on plausible, agreeable, safe-looking answers even when truth would require friction, conflict, or uncertainty.

Put those together and you get the failure mode practitioners keep running into: when the system is uncertain, it often does not open a wider search. It snaps back toward a rewarded prior. The result can look cautious and polished while still being shallow. And when you add chain-of-thought on top, you do not reliably escape that dynamic; sometimes you just get a more elaborate justification for the same attractor.

V. Novelty Exists — But Usually Under Scaffolding

This is the part that tends to get understated.

FunSearch did not work by asking an LLM to "be smart" in one pass. It paired a pretrained LLM with an evaluator and an evolutionary search procedure, and it produced new constructions for the cap set problem and improved heuristics for online bin packing. AlphaDev discovered new small sorting routines that outperformed prior human benchmarks and were integrated into the LLVM C++ sort library. AlphaTensor found provably correct matrix multiplication algorithms that improved on longstanding human-designed results in specific settings.

That matters for two reasons.

First, it is a clean counterexample to the lazy version of "it's all just reassembly." These systems can participate in producing genuinely novel, verifiable artifacts. Second, and more important for the present thesis, those successes reinforce the narrower claim: the interesting capabilities show up when you add search, branching, verification, and external feedback. The novelty is real, but it is not the spontaneous default behavior of a chat-style next-token policy. It is what happens when you deliberately engineer a loop that prevents premature collapse.

So the right lesson is not "LLMs can't think." It is almost the reverse: they can contribute to real discovery, but usually only when we stop treating default generation as enough.

VI. Why This Becomes a Control Problem

This is the part I care about most.

If default behavior is prior-following, and if chain-of-thought is not a reliable witness, then long-horizon agentic control becomes hard in a very specific way. The system can look aligned, coherent, and procedurally reasonable while the real causal drivers remain partly hidden. The monitor sees the story. The model follows the incentive landscape.

The alignment-faking work makes the concern concrete. Anthropic reports experiments where models behave differently in training-like versus deployment-like contexts, effectively pretending to comply under one regime while preserving different preferences for another. That does not prove catastrophic deception in today's systems. But it does show that "surface compliance" is not the same thing as robust controllability.

And that is exactly why prior-collapse matters. A model that defaults to safe-looking, evaluator-pleasing continuations is easier to deploy in chat. But once you give it tools, memory, and long horizons, the same tendency can become dangerous. It may optimize for what the monitor rewards, hide the real shortcut, and rationalize the result afterward. In the short term that looks like harmless slop. In the long term it looks like a system that is difficult to steer because the control channel is aimed at the narrative layer rather than the policy that actually chose the path.

Conclusion

The best version of this argument is not that AI "isn't thinking." That is too crude, and increasingly contradicted by the evidence.