Artificial Intelligence June 2026 7 min read
The Bitter Lesson: Why Raw Scale Keeps Beating Clever AI
Twice now — first with search, then with scale — the simplest general method has beaten our most carefully crafted theories, and the win arrives with a bill we are only beginning to read.
Garry Kasparov, sitting across from a machine in May 1997, was beaten by a method that understood nothing. IBM’s Deep Blue held no theory of the Sicilian Defence, no feel for a poisoned pawn, no concept of chess at all. It searched — roughly two hundred million positions a second — and let raw enumeration stand in for everything a grandmaster calls judgement. Decades of painstaking work to encode positional wisdom into software had produced weaker programs than this one, which mostly just looked further. That asymmetry, repeated across domain after domain, is the seed of what Richard Sutton would later name, with deliberate sourness, the bitter lesson.
Sutton, a founder of reinforcement learning and, with Andrew Barto, a 2024 Turing Award laureate, set the argument down in a short 2019 essay that now reads like prophecy. Its first sentence is the whole creed compressed: across seventy years of artificial intelligence, the great advances have come from general methods that leverage computation, and by a large margin. Not cleverness. Not the careful sculpting of human insight into rules. Just learning and search, turned loose on more compute. The bitterness is not decoration. It names a real wound to a particular kind of researcher’s pride.
The pattern repeats
Consider where the field has been humbled. In speech recognition, the linguists who modelled phonemes, vocal tracts and formant transitions were overtaken by statistical methods that knew nothing of the mouth and merely fitted patterns to data. In computer vision, decades of hand-engineered feature detectors — edges, corners, the SIFT and HOG descriptors that careers were built on — were swept aside in 2012, when a convolutional network learned its own features straight from pixels. In Go, AlphaGo leaned less on human expertise than its predecessors, and AlphaGo Zero discarded human games entirely, reaching superhuman play by self-play alone. Each time, the knowledge-rich approach lost to the knowledge-poor one with more computation behind it.
The most general method keeps winning because generality is what compute rewards.
The mechanism beneath this, Sutton argues, is Moore’s law — or rather its generalisation, the steady exponential fall in the cost of a unit of computation. If compute will be far cheaper next decade than this one, then any method whose performance scales with compute must eventually overtake any method that does not, however ingenious the latter looks today. Built-in human knowledge feels good in the short run and resembles progress at every conference. But it plateaus. It does not scale. It is, in the long arithmetic of the field, a local optimum that the rising tide of computation drowns.
Why we keep losing
The lesson is bitter precisely because researchers cannot help themselves. We are pattern-finding animals; we look at a problem, perceive its structure, and want to teach the machine what we see. To embed our hard-won grasp of language, or vision, or strategy feels like the very content of intelligence — the part worth doing. The bitter lesson says that this instinct, the most satisfying move in the work, is usually the trap. The methods that win do not encode how we believe a mind should reason. They encode almost nothing, and discover the rest. Our self-portrait, painted into the algorithm, is the thing that holds it back.
“building in how we think we think does not work in the long run”— Richard Sutton, The Bitter Lesson
Tasted twice
Here is the turn. The large language model is the bitter lesson’s second and most total course, and it has humbled a prouder generation of theories than the first. Through the 2010s, computational linguistics still held a vision of understanding built on syntax trees, semantic frames, grammars — the structured representations that Chomsky’s heirs had spent fifty years refining. Then the transformer, from Vaswani and colleagues in 2017, arrived with no grammar at all, only attention and the brute statistics of next-token prediction at scale. Scaling laws made the consequence quantitative: pour in more parameters, more data, more compute, and capability climbs a smooth curve, no linguistic theory required. Cleverness was beaten not once but a second time, by the same blunt instrument.
The strange part is that it works on us, the readers of the essay, exactly as it worked on the linguists. We knew the lesson. Sutton had written it down. We could recite it. Yet the architecture that now writes and reasons and codes was met, at first, with disbelief that something so theoretically empty could be so capable. The bitter lesson is not a fact you learn once and keep. It is a temptation you fall to again each time a new domain looks too rich, too human, too meaningful to surrender to mere computation. We taste it twice because we forget it once.
What the victory costs
So the most general method keeps winning. The question the essay leaves unspoken is what the victory takes from us. The first cost is intellectual: a method that scales by computation rather than comprehension hands us power without understanding. Deep Blue beat Kasparov and taught us nothing new about chess; a language model can write a proof and leave the mathematician no wiser about why it holds. We are building minds we do not understand, by a method that succeeds to exactly the degree that it refuses to be legible to us. Capability and explanation, long assumed to advance together, have come apart.
The second cost is concentration. If progress is governed by the falling price of computation, the frontier belongs to whoever can buy the most of it. The bitter lesson is also an economic verdict: it routes the future of intelligence through data centres and capital, away from the lone theorist with a good idea and toward the institutions that can afford the scale. The hand-crafted theory was, for all its failings, democratic — anyone with insight could contribute. The general method that wins is owned by the few who can run it. That, perhaps, is the bitterest taste of all, and the one Sutton’s essay does not name.
None of this makes the lesson false. The evidence has only hardened since 2019, and a researcher who bets against scale now bets against the whole recent record of the field. The honest response is not to wish the lesson away but to hold two things at once: that the general method is genuinely the most effective, and that its effectiveness costs us legibility and scatters its rewards unequally. We wanted machines that thought as we do. We got machines that work because they do not. To taste that twice and reach for cleverness a third time would be human — and, Sutton would gently remind us, a mistake we already know how to name.