Your AI Isn't Thinking. It's Remembering.
A practitioner's hypothesis on why the current paradigm has a ceiling nobody wants to talk about
The Claim
Every AI system built to date — from the expert systems of the 1980s to the frontier language models of 2026 — performs the same fundamental operation: it ingests human knowledge, recombines it, and outputs a rearrangement. The methods have evolved dramatically. The operation has not.
Chain-of-thought is not reasoning. It is a trajectory through token space that reshapes an input prompt into a region where the model can produce a statistically confident output. Scaling is not producing intelligence. It is producing denser coverage of a fixed distribution. And the curve from Claude 3.5 Sonnet to Opus 4.6, from GPT-4o to GPT-5.3, is not the hockey stick the industry needs it to be. It is an S-curve approaching its asymptote.
This is the reassembly ceiling, and the empirical evidence suggests we are close to it.
Why I Think This
I. The Expert Systems Were Us, Thirty Years Ago
In the 1980s, companies like Inference Corporation, IntelliCorp, and Teknowledge sold the automation of knowledge work. The pitch was nearly identical to today's: capture expert knowledge in a system, replace expensive professionals, scale expertise globally. They raised enormous capital. Two-thirds of the Fortune 500 adopted the technology. And then it collapsed.
The failure mode was instructive. Expert systems were brittle — they worked well inside narrow domains and failed catastrophically outside them. The core bottleneck was knowledge acquisition: you had to sit with human experts for months, extract their decision rules, codify them, test them, and iterate. And the hardest part was that experts could not articulate most of what they actually knew. Their expertise was intuitive, contextual, embodied — exactly the kind of knowledge that resists formalisation.
Today we do the same thing at industrial scale. Instead of interviewing one expert, we scrape the written output of billions. Instead of encoding explicit rules, we learn statistical patterns over tokens. Even RLHF — reinforcement learning from human feedback — is literally paying human experts to rate outputs. It is the knowledge engineering process of the 1980s, automated and scaled.
Different pipe. Same water. Same well.
The question nobody in the current cycle wants to confront: if expert systems could not transcend the limitations of the experts they interviewed, why should we expect language models to transcend the limitations of the text they trained on?
II. Chain-of-Thought Is a Trajectory, Not a Thought
When OpenAI released o1 and branded it a "reasoning model," they made a marketing claim that reshaped public perception. But what is chain-of-thought actually doing?
Consider the mechanics. A language model receives an input and must produce an output. For simple, common questions — ones densely represented in training data — the model's internal representation is already close to the answer. No chain-of-thought is needed. The input shape is already in the right neighbourhood.
For harder questions — those on the tail of the training distribution — the model needs to walk itself into a better region of its representation space. Each step in the chain of thought reshapes the context window, nudging the token distribution closer to patterns the model recognises. The "reasoning" is the path. The destination is an input configuration the model already knows how to answer, because it exists somewhere in the training distribution.
This is not logical deduction. It is probabilistic navigation. And it explains several observable phenomena that a genuine reasoning account struggles with:
Easy questions show no benefit from chain-of-thought. If the answer is already in a dense region of the distribution, no trajectory is needed. This is why Sonnet 4.6 with and without thinking mode score nearly identically on common tasks.
Hard questions benefit from chain-of-thought but not always correctly. The model can walk a trajectory that looks rigorous — well-structured steps, plausible intermediate conclusions — and arrive somewhere wrong. A system constrained by logic cannot do this. A system constrained by probability can.
The trajectory takes the form of human reasoning specifically. Not because human reasoning is the only valid form, but because that is what the training data provides as examples of how to get from confusion to answers. A model trained on hypothetical alien logic that arrived at correct answers through entirely different intermediate steps would produce a completely different "reasoning" style that would work just as well.
Better training data and better chain-of-thought converge. Both do the same thing through different mechanisms. Better data makes more inputs directly answerable, shrinking the space where chain-of-thought is needed. Better chain-of-thought makes the trajectory more efficient for remaining cases. As data improves, the gap between reasoning and non-reasoning variants shrinks — because fewer questions require the walk at all.
The implication is uncomfortable: calling this "reasoning" is equivalent to calling gradient descent "understanding." It is an optimisation process moving through a space. The fact that the space happens to be shaped by human language, and the trajectory happens to resemble human thought, is an artefact of the training data, not evidence of cognition.
III. The Curve Is Flattening and Both Labs Know It
SWE-bench Verified is the closest proxy we have for real software engineering capability. The progression tells the story plainly:
On the Anthropic side, Sonnet 3.5 scored roughly 49% in October 2024. Sonnet 3.7 reached 62–70%. Sonnet 4 hit 72%. Sonnet 4.5 reached 77%. Opus 4.5 hit 79%. Opus 4.6, the current flagship, reached 80.8%. Sonnet 4.6, the mid-tier model at one-fifth the price, scored 79.6%.
On the OpenAI side, GPT-4o sat at 38%. The o1 model reached 48%. o3 hit 69%. GPT-5 scored 75%. GPT-5.1 reached 77%. GPT-5.2 — despite being marketed as a transformative leap — scored 77%.
The early jumps were large. The jump from Sonnet 3.5 to 3.7 was something you could feel in practice. But from Sonnet 4.5 to Opus 4.6, the improvement was 77% to 80.8% across five months and multiple model releases. GPT-5.0 to 5.2 barely moved at all.
Each generation costs more to train, takes longer to ship, and delivers smaller marginal gains on the metrics that matter. The marketing grows louder precisely as the improvements grow quieter.
The convergence between model tiers is equally telling. When a mid-tier model at one-fifth the price trails the flagship by 1.2 percentage points, you are not witnessing exponential progress. You are witnessing a mature capability approaching its ceiling.
IV. What Actually Changed — A View From Practice
Having used these models extensively across the entire progression — from Sonnet 3.7 through Opus 4.6, with heavy use of Sonnet 4 and 4.5 in between — the improvements that are real and noticeable are uniformly consistent with better training data, not better intelligence.
Syntax errors disappeared. The model has seen enough correct syntax that it almost never produces invalid code. This is the easiest thing to improve with more data. It is pattern matching, refined.
Bigger models handle edge cases better. Consistent with denser coverage of the training distribution. More examples of unusual patterns means the model's interpolation reaches further into uncommon territory.
Newer models one-shot modules that older ones struggled with. Better recall of existing knowledge. The module's pattern was underrepresented in earlier training data and better represented in later data. The model did not get smarter. The database got bigger and the retrieval got more accurate.
API guessing still fails. Because the correct API for a specific library version in a specific context may not be well-represented in training data. This is the model at the edge of its distribution.
Less brittleness overall. But this is a denser distribution producing better average outputs. The moment you push into genuinely novel territory — unusual library combinations, novel architectures, subtle design tradeoffs — you hit the same wall you always hit. It just moved further out.
The critical observation across hundreds of hours of use across model generations: the models never surprised with a genuine insight. Not with speed, not with volume, not with syntactic correctness — but with an idea that had not been considered, a design approach genuinely superior to what a human would have produced. The improvements are quantitative, not qualitative.
V. No Evidence of Out-of-Sample Self-Improvement
The narrative driving the most dramatic AI timelines is recursive self-improvement: AI becoming capable enough to design better AI, in an accelerating loop. The evidence for this is thin.
AlphaGo and AlphaZero exceeded all human play through self-play. But they operated within perfectly defined games with clear rules, perfect information, and unambiguous reward signals. Open-ended research — where you do not even know what the objective function should be — is a fundamentally different problem.
No AI system has autonomously identified a limitation in its own architecture and proposed a qualitatively better one. All major capability leaps — attention mechanisms, RLHF, scaling laws, chain-of-thought — came from human researchers. AI systems did not predict any of them.
Self-improvement requires self-evaluation, and self-evaluation of capabilities you do not yet possess is arguably paradoxical. You need an evaluation framework more capable than the system being evaluated. That is either a human or a more capable AI — which begs the question.
What the Evidence Says
The research literature provides substantial — though not unqualified — support for this hypothesis.
Supporting the Ceiling
Chain-of-thought unfaithfulness is well-documented. Turpin et al. (NeurIPS 2023) showed that biasing features in prompts cause models to rationalise biased answers without mentioning the bias, with accuracy drops of up to 36%. Lanham et al. (Anthropic, 2023) found that truncating chain-of-thought often did not change the answer and that faithfulness degrades as models get larger. A 2025 Anthropic study found Claude 3.7 Sonnet mentioned reasoning hints only 25% of the time. An Oxford position paper (Barez & Wu, 2025) concluded directly: "chain-of-thought is not explainability."
Compositional reasoning fails on novel structures. Dziri et al. (NeurIPS 2023, Spotlight) demonstrated that transformers solve compositional tasks via "linearised subgraph matching" — pattern-matching against training data fragments, not systematic reasoning. The Apple GSM-Symbolic study (ICLR 2025) showed that simply changing numbers in math problems degraded performance, and adding irrelevant information caused drops of up to 65%. The authors found "no evidence of formal reasoning."
ARC-AGI-2 exposes the gap. Pure language models score 0% on ARC-AGI-2. Even the best test-time methods reach only 54% at $30 per task, versus 60% for average humans. The benchmark's creators concluded: "AI reasoning performance remains fundamentally constrained by knowledge coverage."
Pre-training scaling is hitting diminishing returns. OpenAI's Orion produced improvements far smaller than GPT-3 to GPT-4, leading to its quiet release as GPT-4.5. Ilya Sutskever publicly stated "pre-training as we know it will end." Epoch AI estimates high-quality human text exhaustion between 2026 and 2032.
Model collapse prevents recursive self-training. Shumailov et al. (Nature, 2024) demonstrated that training on model-generated data causes irreversible defects — tail distributions vanish as models become "poisoned with their own projection of reality."
Enterprise adoption lags the narrative. MIT's 2025 NANDA report found 95% of enterprise AI pilots delivered zero measurable P&L impact. An NBER study (February 2026) found 90% of firms reported no impact on workplace productivity. A METR randomised controlled trial found AI tools actually slowed down experienced developers despite perceived speedups.
Challenging the Ceiling
Mechanistic interpretability reveals genuine internal structure. Anthropic's circuit tracing (March 2025) found multi-step reasoning circuits in Claude 3.5 Haiku — intermediate features activating in sequence, confirmed through causal intervention. Li et al. (ICLR 2023) showed Othello-GPT develops real board representations. Gurnee and Tegmark (ICLR 2024) found linear representations of space and time in LLM activations. These are not surface statistics. Something structured is happening internally.
LLM-driven systems have made verifiable mathematical discoveries. DeepMind's FunSearch (Nature, 2023) found a previously unknown cap set construction. AlphaEvolve (2025) discovered the first improvement over Strassen's 1969 matrix multiplication algorithm — a result absent from any training data. These are genuine, novel, verified contributions.
Test-time compute scaling produced dramatic jumps. o3 achieved 87.5% on ARC-AGI-1 and 96.7% on AIME 2024. The mechanism — giving models more computation at inference — broke through what appeared to be hard ceilings on pre-training alone.
Emergence is partly real. While Schaeffer et al. (NeurIPS 2023) showed most reported emergence was a metric artefact, Anthropic's induction head research identified genuine phase transitions during training, and grokking research demonstrates networks can reorganise from memorisation to generalisation through sharp structural transitions.
Compression implies structure. Huang et al. (COLM 2024) showed intelligence scores across 31 LLMs almost linearly correlate with compression efficiency — and effective compression theoretically requires identifying deep structure, not merely surface statistics.
Conclusion
The reassembly ceiling is real, but it is better understood as a series of ceilings rather than a single wall.
The first ceiling — pre-training on static text — is already reached. The industry knows this, which is why the conversation has shifted to test-time compute, reinforcement learning, and agentic architectures. The second ceiling — the limits of chain-of-thought as navigated pattern-matching rather than genuine reasoning — is becoming visible in benchmark saturation and the flattening SWE-bench curve. The third ceiling — the epistemological limit on what can be extracted from human-generated data by any method — remains theoretical, but every failed attempt at recursive self-improvement and every 0% score on ARC-AGI-2 suggests it is there.
What these systems do is genuinely valuable. The practical utility is real, the productivity gains in constrained domains are measurable, and the economic floor is substantial enough to prevent a full winter. But "genuinely valuable tool" and "artificial intelligence" are different claims, and the gap between them is where the current valuation bubble lives.
The historical pattern — cybernetics in the 1950s, expert systems in the 1980s, language models in the 2020s — suggests that every 20 to 30 years, a new technical paradigm emerges, gets mapped onto the dream of automating knowledge work, partially delivers, hits a wall, corrects, and the useful residue gets absorbed into infrastructure. Then the next generation, having not studied the history, does it again.
We are probably not heading for winter. But we may be heading for autumn — a cooling of expectations, a shakeout of companies, and a quiet acknowledgment that "intelligent" was always the wrong word for what these systems do. The memory got better. The retrieval got more accurate. The trajectory mechanics improved.
The intelligence stayed the same.
The hypothesis presented here was developed through extensive hands-on use of frontier models across their full progression, combined with analysis of publicly available benchmarks and peer-reviewed research. It represents a practitioner's perspective, not a formal academic position. The author welcomes rigorous disagreement.