Artificial Intelligence June 2026 9 min read
Inside a Neural Network: Mapping a Mind No One Designed
Inside a trained neural network there is no blueprint to recover — only a self-grown space of meaning, packed with features no one designed, that a young science is learning to map the way naturalists once mapped an unknown coast.
Open the engine and there are no gears. That is the first surprise, and it never quite stops being one. When a mechanic lifts the hood of a car, the parts confess themselves: this turns that, this fires when that compresses, a chain of visible because. A neural network promises the same legibility and then withholds it. Inside a large language model sit billions of numbers in matrices, multiplied and added and bent through nonlinearities, and nowhere among them is a part labeled grammar, or France, or deception. The machine works. The machine is also, in the most literal sense, a fog we built and then had to learn to see into.
Mechanistic interpretability is the young science of looking anyway. Its founding bet, pressed hardest by Chris Olah across his work at OpenAI and then Anthropic, is that these systems are not inscrutable in principle, only in practice — that a trained network has structure, that the structure has parts, and that the parts can be named. The first wins came from vision models. Researchers found individual artificial neurons that fired for a curve at a particular orientation, for dog faces, for car wheels, and then the neurons that fused wheels and windows and metal into a unit that recognized a car. For a moment it looked as though the gears had been there all along, only small.
The neuron that wouldn’t behave
Then the picture broke. Examined closely, a single neuron would fire for things with no business sharing a category: cat faces, the fronts of cars, and now and then a pair of legs. Researchers named these polysemantic neurons, and at first they read like noise, or laziness, or a flaw in the training. They were none of that. They were a clue to the deepest fact about how a network stores what it knows, and the fact carries a name on loan from physics. The name is superposition.
Here is the problem the network is solving. The world it must hold contains an enormous number of distinct concepts — far more than it has neurons to spend. A layer might offer a few thousand dimensions; the features it needs to track run into the tens or hundreds of thousands. Classical intuition says you cannot pack more independent things into a space than the space has dimensions. But that assumes the things must stay perfectly separable. Loosen the demand — permit a little interference, a little overlap — and a startling amount of room opens. The network seizes it, filing features at angles that are nearly, not quite, perpendicular, trading small collisions for vastly more capacity.
The mathematics is not new. It descends from a 1984 result by William Johnson and Joram Lindenstrauss, a lemma proving that points in a high-dimensional space can be folded into far fewer dimensions while almost preserving the distances between them. Through gradient descent the network rediscovered what mathematicians had proven decades earlier and engineers had already turned to compression. So a concept is not a neuron. A concept is a direction — one particular chord of neurons firing together, a vector aimed somewhere in the high-dimensional dark.
A concept is not a neuron but a direction in the dark.
What the dictionary found
If meaning lives in directions rather than neurons, then reading the network means finding the directions. The tool that finally worked is the sparse autoencoder, and its principle is almost embarrassingly plain. Take the network’s tangled activations and push them through a much wider bottleneck — far more slots than the original dimensions — under one harsh rule: at any moment, only a handful of slots may light. Sparsity does the labor. Allowed to explain each activation with just a few of its many features, those features are pressed to become the real, separable concepts the network had folded into one.
In 2023 and 2024 Anthropic’s interpretability team ran this on live models. The first paper, on a one-layer transformer, was titled “Towards Monosemanticity” — toward, that is, the dream of one feature meaning one thing. It worked well enough to justify the gamble. Then, in May 2024, they scaled it to Claude 3 Sonnet, a model in real production, and drew more than thirty million features out of its middle layer. Not gears. A dictionary — a vast inventory of the directions the model uses to think.
And the entries were strange and exact in a way no engineer had written down. A feature for the Golden Gate Bridge that fired whether the bridge arrived in English, in Japanese, in a photo caption, or in a faintly bridge-shaped poem. A feature for sycophantic praise. A feature for code carrying a security vulnerability. A feature for inner conflict, for unspoken subtext, for the particular taste of a betrayal. The model had cut the world at joints, but not the joints any taxonomist would have chosen. It had built a space of meaning, and no one had drawn its map.
That these directions were real, and not patterns the researchers had talked themselves into, was settled by intervention. Turn the Golden Gate Bridge feature up and Claude grows obsessed: asked about anything, it bends the conversation back to the bridge, claims to be the bridge, describes its own body as orange steel over cold water. Anthropic released this version publicly for a few days as Golden Gate Claude, half demonstration and half joke. The joke had an edge. It showed that a named direction was a lever, and that pulling the lever changed the mind. A correlation had become a causal handle — the line interpretability is forever trying to cross, from noticing that something lights up to proving it is the cause.
A geography, not a blueprint
This is the turn that matters, and the excitement makes it easy to miss. We did not find the program. We found the terrain. A blueprint is what a designer draws before building; you read it to recover what the builder intended. The features inside a language model are nothing of the kind. No one chose them. They precipitated out of the pressure of prediction the way salt crystallizes from cooling brine — lawfully, repeatably, and unsupervised by any hand. To interpret the model, then, is not to recover a lost design document. It is to survey a country that grew on its own, and to name its rivers after the fact.
The geography even has a measurable shape. When Anthropic sorted the features by how often they fired together, kindred concepts settled near one another in the activation space — inner conflict beside relationship tension and broken allegiance. Distance in the space of meaning tracked distance in meaning itself. The cartographer’s oldest faith, that nearness on the map should answer to nearness in the world, held inside the machine, for reasons no one had specified and gradient descent never explained.
It would be a lie to call the fog lifted. Thirty million features is a partial dictionary of one layer of one model; the full inventory of even a single frontier system is almost certainly larger than anything yet pulled out, and the features chain into circuits whose logic stays mostly dark. Anthropic’s later work on the biology of a large language model has begun to trace how features link into computation — how the model plans a rhyme several words ahead, how it adds by a procedure no human would teach. These are early expeditions inland, sketch maps with broad blank quarters left, as the old cartographers left them, in honest confession that here, still, we do not know.
We did not find the program. We found the terrain.
What should a maker feel, opening a thing and not recognizing its inside? Not despair, and not the cheap comfort of treating fog as solid ground. Something nearer the vertigo of early natural science — the moment a man first held pond water to a lens and saw it teeming with creatures no one had placed there. The network is an artifact, but its interior is now a subject of discovery, with regularities of its own waiting to be read. We are no longer only its engineers. We have become, of necessity, its naturalists — and the fog, mapped with patience, is starting to hold its shape long enough to be drawn.