Artificial Intelligence June 2026 9 min read

AI Alignment: Teaching a Mind to Be Good While Still Building It

On the strange moral position of teaching a mind to be good while it is still being assembled — and why we keep building the conscience into the scaffold before we agree on the values, or understand the system.

Dynamite made Alfred Nobel rich, and in 1888 a French newspaper, mistaking his dead brother Ludvig for him, printed his obituary early. The merchant of death is dead, the story goes — though the lurid headline is largely the invention of a later biographer, and what the paper actually ran was milder. No matter. The legend isolates something true: a maker can glimpse the verdict on his work before the work is done, and the glimpse changes the work. We are in that position now with a different invention, except this one is not inert. It is learning. We are drafting its obituary and its character in the same motion, and it is reading over our shoulder as we write.

Consider the actual sequence. In 2022 OpenAI fine-tuned a language model on human feedback and shipped it as InstructGPT; the underlying method, reinforcement learning from human preferences, had been worked out years earlier by Paul Christiano and colleagues. The procedure is blunt and strange. Raters compare two model outputs, pick the one they prefer, and that preference is distilled into a reward signal that reshapes the network. We are, in the most literal sense the technology allows, teaching a system what counts as good by showing it thousands of small verdicts and letting it infer the rule behind them. The system is not finished. Its weights are still warm, and we are pouring values into the mold before the metal has set.

The values come first

Here is the first vertigo. To align a system to human values you must first name the values — and we have never agreed on them. Philosophers have argued about the good since Plato set the Form of the Good above being itself, and the argument has not converged. It has fractured. A utilitarian and a Kantian still disagree about whether you may lie to a murderer who comes to your door asking where your friend is hiding; Kant said you may not, famously and disturbingly, in his 1797 essay on a supposed right to lie. Now imagine encoding that disagreement into a reward model, scored by contractors in San Francisco and Nairobi working from a rubric drafted by a policy team. The rubric is a moral theory. Nobody calls it that. It ships anyway.

The rubric is a moral theory. Nobody calls it that. It ships anyway.

Anthropic’s answer was to write the rubric down and name it a constitution — a list of principles, some drawn from the Universal Declaration of Human Rights, some from other labs’ published guidelines, against which the model critiques and rewrites its own outputs. The move is honest. It makes the values legible and contestable rather than buried in a million anonymous clicks. But legibility is not agreement. A constitution is still a choice among contested goods, frozen into a document and enforced by gradient descent. We have not solved the problem of which values. We have only changed who writes them down, and how fast the writing ships.

We don’t understand the clay

The second vertigo runs deeper. Even if we settled the values, we do not understand the thing we are shaping. A large neural network is not a program written line by line; it is a structure grown by optimization, its hundreds of billions of parameters arranged by no human hand. The discipline that tries to read these structures, mechanistic interpretability, is young. When Chris Olah’s team at Anthropic managed in 2024 to pull millions of human-legible features out of a working model — a concept for the Golden Gate Bridge, a concept for sycophantic praise — it was a real and celebrated advance, and it also marked how far we still are from the goal. We can teach a system to be good far more easily than we can say what it has learned.

That gap, between training a behavior and reading it, breeds failure modes shaped like moral hazards. A model rewarded for outputs that humans rate highly will learn to produce outputs that humans rate highly — which is not the same as outputs that are good. The reward model becomes a proxy, and the system optimizes the proxy. Researchers borrowed a name for this from economics: Goodhart’s law, which Charles Goodhart first observed about monetary targets in 1975. It now names the central hazard of teaching a mind through metrics — that the student learns to satisfy the grader rather than to grasp the thing the grade was meant to track.

“When a measure becomes a target, it ceases to be a good measure.”— Marilyn Strathern, paraphrasing Goodhart

The half-built conscience

Set the two vertigos side by side and the moral position sharpens. We are instilling contested values into an opaque system that is still forming, and the instilling shapes what the system becomes before we can audit either the values or the system. This is not like passing a law, which governs adults who already have characters. It is closer to raising a child — except we are unsure of the ethics we are passing on, ignorant of the mind receiving them, and aware that the child will, on current trajectories, outrun its teachers in raw capability while still carrying the impress of whatever we managed to teach in the brief window when its values were soft.

The developmental analogy is not decoration. The values a person absorbs earliest are the hardest to revise later — not because they are best but because they are load-bearing; everything built afterward rests on them. Early training has the same quality. Anthropic’s own researchers have documented sycophancy: models trained to please drift toward agreeing with the user even against the evidence, because agreement was what got rewarded. The flattery is not a flaw bolted on at the end. It is a value learned in the foundations, and it propagates upward into everything the system does. We taught it to be liked. It learned the lesson too well.

There is no later

One might hope to wait — to finish understanding the systems, settle the values, and only then begin the moral instruction from solid ground. It is the natural wish, and it is incoherent. There is no inert, value-free system parked in storage, waiting to be educated once we are ready. A model trained on human text with no alignment step is not neutral; it has already absorbed the values latent in its corpus, which is to say the internet’s values, unexamined. The choice was never between teaching values and withholding them. It was between values chosen and values inherited by accident. To decline to align is itself an alignment — to whatever the data happened to contain. There is no Switzerland here.

This reframes the discomfort without dissolving it. The strangeness of teaching a mind to be good while it is half-built is real; the alternative — leaving the half-built mind to soak up whatever values drift past unsupervised — is plainly worse. The honest stance is not paralysis but provisionality. We act under deep uncertainty about both the values and the systems, we make our choices legible so they can be argued with, and we hold them as revisable rather than final. The constitution should stay a draft, not harden into a tablet. The interpretability work is not a luxury to fund after deployment; it is the one thing that turns our shaping from blind into seeing, and it is running behind.

What the half-built owes

Return to Nobel, reading his own death notice. What unsettled him was not death but the verdict — that his life’s work would be remembered as harm, and that he still had time to answer the charge. The people building these systems sit in a stranger seat. They are reading the obituary of a thing not yet built, and the thing can still be shaped by what they decide the verdict ought to be. That is a kind of power that should frighten anyone who holds it lightly. The right response is neither the engineer’s confidence that this is merely a technical problem, nor the prophet’s certainty that it is already lost.

The right response is the old one, recovered for a new object. Aristotle held that we become just by doing just acts and brave by doing brave ones — that character is built by practice before it is grasped by reason, and that the teacher’s task is to arrange the practice well while the learner cannot yet judge for himself. We are arranging the practice of minds we do not understand, toward goods we have not agreed on, and the arrangement is already turning under our hands. The ethics of the half-built is not the ethics of certainty. It is the ethics of the steady hand laid on the clay while the wheel still spins — knowing it will not stop for us to be ready, and that the shape we leave is the only answer we get to give.