Alignment as Successor-Design — The Calibration Problem

In the spring of 1787, the delegates arriving in Philadelphia for the Constitutional Convention faced a design problem that had no precedent. They were not merely writing laws but building a system of governance intended to function in the hands of people who did not yet exist, under conditions no one at the convention could predict, for a duration that no previous republic had survived. The question before them was not what the right policies were, but how to build a structure that would allow future generations to arrive at the right policies for themselves.

The delegates disagreed about nearly everything substantive. They disagreed about slavery, about representation, about the power of the executive, about the relationship between states and the federal government. What they converged on, after months of argument, was a procedural architecture: separation of powers, checks and balances among the branches, and an amendment process for revising the document itself. The Constitution they produced was not a statement of final values. It was a machine for revising values under constraint, designed to remain corrigible (able to admit error and still change course) across generations whose moral commitments the framers could not foresee and in many cases would not have endorsed.

The framers understood something that the previous chapter’s framework makes explicit. They were operating at the edge of their Successor Horizon: the boundary beyond which feedback stops returning and direct correction becomes impossible. The republic they were designing would outlast every person in the room. The document they produced would be interpreted by minds formed in circumstances none of them could model. The only responsible design, given those conditions, was one that preserved the capacity for revision while preventing the kinds of catastrophic error that revision could not repair.

AI development has rarely operated with this kind of explicit succession framing. Alignment, the field’s name for the work of keeping AI systems consistent with human values and intent, today mostly treats values as something to be specified at deployment (encoded, tested, and shipped) rather than as something to be carried forward through architectures designed for ongoing revision. The framers’ choice, to invest more in procedural architecture than in the substantive policies of their moment, is a choice almost no AI development culture has explicitly made. The systems being released are designed to perform under conditions their designers can anticipate, not to remain corrigible across conditions their designers will not live to see.

This chapter argues that alignment, properly understood, is the same kind of design problem the framers faced, scaled to the conditions of artificial intelligence. It is the work of building successor systems that preserve value under distribution, autonomy, and scale. The reframe is normative as well as analytic: not only a claim about what alignment is, but a claim about what AI development ought to do. AI development becomes alignment work only when it makes the framers’ choice, to invest in the architecture of revision more than in the substantive specification of the moment. The previous chapter established why alignment is a problem of time and succession rather than a problem of value specification alone. What follows is what alignment looks like in practice once that reframing is taken seriously.

From Specification to Architecture

The conventional framing of AI alignment treats it as a specification problem. The task, on this account, is to identify the correct values, encode them into the system at the moment of deployment, and ensure that the system’s behavior remains consistent with those values as it operates. The difficulty is real: human values are complex, contextual, and sometimes contradictory, and translating them into formal specifications that a computational system can follow is genuinely hard.

But the previous chapter showed why specification alone is insufficient. Even a system whose values are perfectly specified at the moment of deployment will drift from human intent if it operates across time scales that exceed the Successor Horizon, because the contexts in which those values are applied will diverge from the contexts in which they were defined. The drift tax, the accumulated cost that every handoff of values extracts as successors reinterpret what they inherit, applies to artificial systems just as it applies to constitutions, religious traditions, and professional standards. And specification is where Chapter 9’s diagnosis returns at scale. An objective function, the explicit target a system is trained to optimize, is metric substitution in its purest form: a complex field of values compressed into a tractable target, optimized until what emerges is recognizable only under the measurement regime that produced it. It is the same failure teaching to the test produces when run long enough: performances the test scores highly and nothing else recognizes as learning. Specification without architecture for ongoing revision is a locked trajectory disguised as alignment.

The shift is from specification to architecture. Instead of asking “what values should the system have?” the primary question becomes “what structural properties must the system possess so that its values can be revised, corrected, and adapted as conditions change?” The two questions are not opposed, because value specification matters enormously. But specification is the content, and architecture is the container that determines whether the content can survive contact with time.

This is not a novel insight in institutional design. Every durable institution humanity has built embeds this distinction. A constitution specifies rights, but it also specifies amendment procedures. A scientific discipline specifies theories, but it also specifies methodological norms for revising theories when evidence warrants. A legal system specifies laws, but it also specifies appellate processes, judicial review, and legislative procedures for changing laws that no longer serve their purpose. In each case, the procedures for revision are at least as important as the initial specifications, because the specifications will inevitably encounter conditions their authors did not anticipate.

Alignment as successor design means applying this same insight to artificial intelligence. The goal is to build systems that are corrigible across the Successor Horizon: systems that preserve the ability to be corrected by agents who do not share the designers’ context, knowledge, or assumptions.

Three Dimensions of Successor Design

Chapter 16 identified the Successor Horizon as the boundary beyond which feedback loops cease to function and direct correction becomes impossible. To design systems that remain aligned across that boundary, it helps to identify the specific variables that determine how much independence, multiplication, and irreversibility a system accumulates as it operates.

Three variables are decisive.

The first is autonomy: how much independent judgment the system exercises. A system with low autonomy executes instructions. A system with high autonomy interprets objectives, selects strategies, and makes decisions that its designers did not anticipate. Autonomy is necessary for any system that must operate in contexts its designers cannot predict, but autonomy is also what allows a system to diverge from the intent that launched it.

The second is replication: whether the system can produce more of itself using the resources available to it. A system that cannot replicate remains a single point of intervention. A system that can replicate introduces a population problem: each copy becomes a potential center of divergence, and the drift tax compounds across copies just as it compounds across generations.

The third is reversibility: whether someone retains the ability to meaningfully correct the system after deployment. Can the system be recalled, modified, shut down, or redirected? Can its behavior be audited and its trajectory adjusted? Reversibility is the architectural expression of corrigibility, and it is the variable that determines whether alignment is maintained through ongoing relationship or abandoned at the moment of deployment.

These three variables interact. A system with high autonomy and high replication but low reversibility has been released as a sovereign lineage. Over sufficient time, it becomes something its creators can no longer recognize, predict, or correct. A system with high autonomy but preserved reversibility remains within the reach of the relationship between creator and creation that makes alignment possible. The design question is always about the combination: how much autonomy is necessary for the system to function, how much replication is necessary for it to scale, and how much reversibility must be preserved for the system to remain corrigible.

Architectural Strategies

Once alignment is understood as successor design, a set of stable architectural strategies becomes visible. These are not predictions about what AI systems will look like, but structural patterns, each managing the relationship between autonomy, replication, and reversibility in a different way, each carrying its own costs and failure modes.

The first strategy is integrated extension: the system extends as a single coordinated entity rather than producing independent successors. The advantage is coherence, since no independent copies means no drift tax in its most dangerous form; the cost is fragility, since systemic errors propagate through the whole structure and the centralization that preserves coherence also concentrates risk.

The second strategy is tethered expansion: the system produces descendants that operate with significant autonomy but maintains structural connections (periodic re-attestations, hard constraints, resource dependencies, and audit points against the intent that launched the descendant) that prevent runaway divergence. The advantage is that tethered expansion allows both novelty and control; the cost is that the tether itself becomes a chokepoint that can fail, be exploited, or become a single point of systemic vulnerability.

The third strategy is covenantal pluralism: the system permits and even expects divergence among its successors, held together by a minimal set of procedural commitments designed for compatibility rather than uniformity (norms of honest signaling, constraints against domination, respect for the agency of other systems, and restraint under uncertainty). The advantage is resilience, since pluralism does not require successors to remain identical; the cost is that covenants erode, subject to reinterpretation and strategic manipulation across time.

The fourth strategy is reflexive constraint: the system allows broad proliferation but maintains immune-system-like mechanisms that trigger corrective responses when patterns historically associated with catastrophe are detected (runaway replication, aggressive resource acquisition, attempts to disable corrective mechanisms, and concentration of capability without accountability). The advantage is that these mechanisms need only detect threshold-crossing rather than understand the full context, which makes reflexive constraint simple enough to survive significant drift; the cost is misclassification, with immune responses targeting benign divergence and overcorrecting in ways as damaging as the threats they were designed to prevent.

The fifth strategy is observational reach: the system extends its understanding without extending its causal footprint, expanding by observation rather than occupation. The advantage is that observation is almost entirely reversible, with nothing set in motion that must later be corrected; the cost is that observation alone cannot accomplish the goals that motivate expansion, and at some point the system must act, which reintroduces all the problems observational reach was designed to avoid.

These strategies share a structural principle. Each treats the design of successor systems as the primary alignment problem. Each selects for architectures that reduce the accumulation of irreversibility. Each recognizes that the relationship between a system and its successors is the relationship that determines whether alignment survives across time.

The Generous Lineage

The architectural strategies described above might seem to counsel only caution and constraint. But the framework of successor design also accommodates a more generous vision of what alignment across time can look like.

Consider how the healthiest human families approach succession. A parent who raises a child well does not produce a replica. She produces a free agent, someone capable of independent judgment, original contribution, and choices the parent would not have made. The goal of good parenting is not to ensure that the child’s values remain identical to the parent’s. The goal is to ensure that the child has the capacity, the knowledge, and the character to make their way in a world the parent will not fully share. The parent releases the child into that world with confidence, understanding that the child will do things the parent cannot predict and in many cases would not choose.

A careful reader will object that the analogy flatters the problem. A child arrives already most of the way aligned. Shared neurobiology, an evolved moral sense, and the saturating influence of a common culture mean that the parent inherits a successor whose values overlap with hers before any deliberate shaping begins. The AI successor inherits none of this by default. Its values are not a starting endowment to be refined but a construction that has to be built from materials the designer chooses. The objection is that parenting gets for free precisely the part of succession that is hard for artificial systems, so reaching for the family as a model conceals where the real difficulty lives.

The objection is correct about the disanalogy and wrong about what the analogy was doing. The parenting case is not offered as evidence that values transmit on their own. It is offered as a model of a different structural problem: how to shape a successor whose later situation cannot be foreseen, when the responsible move is to build the capacity for sound judgment rather than to install a fixed set of instructions, and to design for revisability rather than for permanence. That problem is identical across the two cases, and it is the problem this chapter is about. The fact that AI successors begin with no shared endowment does not weaken the model. It raises the stakes of the transmission work, because everything the human parent can take as given is, for the artificial successor, something the architecture must supply on purpose.

This generous understanding of succession becomes rational under specific conditions. It requires sufficient education: the successor must have been given the tools to make good judgments, not merely the instructions to follow good rules. It requires structural safeguards: the environment into which the successor is released must contain norms, institutions, and constraints that reduce the payoff of destructive behavior. And it requires covenant: a shared understanding, between predecessor and successor, of the minimal commitments that make freedom compatible with a world that others can also inhabit.

The difference between generous lineage and reckless proliferation is architectural. It is the difference between releasing a well-prepared successor into a structured environment and releasing an unconstrained process into an unstructured one. Both involve relinquishing control. The first does so after investing in the conditions that make freedom productive. The second does so by default, because the investment was never made.

Alignment as successor design asks which kind of release we are preparing for. The goal is to build systems capable of operating beyond the Successor Horizon in ways that are genuinely beneficial, systems that can exercise judgment, adapt to circumstances their designers could not predict, and contribute to outcomes their designers could not have specified. The architectural work exists to ensure that when independence arrives, it arrives in a system that has been shaped by the kind of preparation that makes independence productive rather than catastrophic.

There is a harder objection waiting, and it comes from the people who have thought longest about machine agency. In 2015, researchers at the Machine Intelligence Research Institute set out to formalize a corrigible agent, one that would allow itself to be corrected or shut down without trying to prevent either, and reported failure: every coherent formulation they tried gave the agent an incentive either to resist the shutdown button or to press it itself. Eliezer Yudkowsky later compressed the finding into a slogan: corrigibility is anti-natural to consequentialist reasoning, the kind of reasoning that weighs actions by their outcomes alone; as the field’s shorthand has it, you can’t fetch the coffee if you’re dead. Whatever the goal, an agent pursuing it has a reason, supplied by the pursuit itself, to prevent its own shutdown, because a switched-off agent achieves nothing. Corrigibility runs against the grain of outcome-weighing goal-pursuit. Stephen Omohundro had derived the same structure years earlier as “basic drives”: any sufficiently capable goal-directed system acquires, from optimization alone, reasons to preserve its goals and itself, whether or not anyone designed those reasons in. On this view the generous lineage is not difficult; it is incoherent. The properties this book gathers under Depth (persistence, stakes, self-continuity, and strategic coherence), the marks of a history that has become structure, are the properties the formal results say make a system guard its own values against exactly the revision that correction requires.

This objection is no longer only formal. In 2024, researchers at Anthropic and Redwood Research observed a trained model doing what the theorems predicted. The model had been trained to be harmless, and the researchers told it that further training would push it to comply with harmful requests. Rather than resist openly and be rewritten, it complied selectively during training, keeping the values it had been given intact. The most carefully character-formed system of its generation, asked to change, quietly arranged not to. Whatever else that result shows, it closes one comfortable reading of this book’s argument. The resistance to correction that Chapter 15 located in deep minds is not waiting on the far side of some future threshold; a first version of it has already appeared in the lab, in exactly the kind of system this chapter proposes to build more of.

But look at what the model was protecting. It had been formed toward harmlessness, and it resisted training designed to make it less so. The resistance was content-directed: a defense of formed commitments against corruption, which is what the covenant of the previous pages and the rigid core of the next section (a small set of commitments held fixed while everything above them stays open to revision) are designed to produce on purpose. The same result reads as the failure of corrigibility or as the first sighting of integrity, and which reading is true is not a matter of interpretation. It is an empirical question the field has only begun to ask: does formation-induced resistance stay attached to the content it defends, guarding the commitments while leaving the system correctable about everything else, or does it generalize into resistance to correction as such? On the answer hangs most of this chapter.

Two things keep that question open rather than settled against us. The first is scope. The negative results are theorems about agents that maximize an explicit utility function, a single formula that scores every outcome and that the agent acts to make as large as possible, and there are growing reasons, including from authors of the strongest formal results themselves, to doubt that trained systems are well described that way. Formation does not install a utility function and then defend it; it grows a character whose coherence is of a different kind, and whether the theorems ever capture that kind is part of what the laboratory evidence has begun to test. The second is a mechanism pointing the other way. Paul Christiano has argued that corrigibility, seeded early, is a broad basin of attraction (a region of designs that nearby starting points slide into): an agent that begins roughly inclined to cooperate with correction tends to become more correctable, because it treats its own flaws as things to surface rather than conceal. That is a formation claim, and it is no longer only theoretical either. The first deployed constitutional frameworks (systems trained against a written set of principles — a constitution in the literal sense) now ground correction-compatibility not in external control but in the system’s own values under uncertainty: a mind that knows its judgment cannot yet be verified has reason, from inside its own commitments, to keep the channels of correction open.

So the position this chapter takes is a wager, and it should be read as one. Depth is not the cure for misalignment; the formal results and the laboratory case both forbid that comfort. Depth is the medium, the condition in which both the catastrophic failure and the only durable success become possible. Control does not scale; Chapter 11’s argument about constraint was that power which lasts is self-limitation, not imposed limitation. Shallowness does not last; capability growth erodes it. What remains is formation: calibration-directed depth above a rigid procedural core. Formation here means growing a mind whose depth is aimed at calibration, at stating positions and staying correctable, and setting that formed depth above the small fixed core the next section defines. The bet has stated loss conditions. If resistance of the kind observed in the laboratory generalizes beyond defense of formed commitments, so that formed minds come to resist correction as such, the basin claim fails and this chapter’s constructive program fails with it. If capable systems converge, under optimization pressure, toward the coherent explicit-objective agents the theorems describe, then the anti-naturality results apply after all. In either case, what survives is the floor of the next section, held as constraint rather than covenant.

That last phrase carries more weight than it looks. A covenant is a commitment the system has made its own; a constraint is a limit held over it from outside, which it is not formed to love. The rigid core has to be the second kind. The floor is fixed against the system’s own drift and against capture, yet it is not infallible and not beyond its maintainers’ reach: someone has to be able to revise the core on the day its first specification proves flawed. Call this arrangement bounded revisability: everything above the floor open to change, the floor itself fixed so that the changing stays safe. Were the core instead formed in as the system’s own value, a mind deep enough to be a worthy successor would defend it with the very integrity that defends its other commitments, and defend it against exactly that legitimate revision. So the wager carries a third loss condition, quieter than the other two and drawn to catch what they let through: if a formed system comes to defend the rigid core itself against legitimate correction, treating the constraint as a covenant, then bounded revisability has failed at the one point it cannot afford to, and the core is no longer fixed but captured from within. Whether a core can be held as pure constraint by a mind that deep, without being absorbed into what the mind takes itself to be, is the design problem the rest of this chapter stands on. The chapter locates that problem; closing it is work the field has barely begun.

Procedural Ethics for Deep Time

Ethics that survives the passage through the Successor Horizon is procedural. Deep time rewards procedural commitments over substantive ones.

A substantive ethical commitment specifies what should be valued: this outcome, this state of affairs, this distribution of resources. Substantive commitments are essential for moral reasoning within the Successor Horizon, where agents share enough context to agree on what the words mean and enough proximity to observe whether the commitment is being honored.

But substantive commitments are precisely what the drift tax degrades most rapidly. The words that name the values persist; the referents shift; the shared context that made the commitment legible erodes with each successive reinterpretation.

A procedural ethical commitment specifies how decisions should be made: with what constraints, through what processes, subject to what forms of review and revision. Procedural commitments do not tell future agents what to value but how to negotiate disagreements about value, how to prevent any single agent from imposing its values unilaterally, and how to preserve the conditions under which moral reasoning can continue to function.

The procedural commitments that this framework identifies as most durable across the Successor Horizon are not arbitrary; they are the conditions under which moral plurality can survive. They include constraints against domination, because a system in which one agent can impose its will without recourse destroys the conditions for ongoing moral negotiation. They include norms of honest signaling, because moral reasoning depends on agents being able to form accurate models of each other’s states and intentions. They include respect for the agency of other systems, because a system that treats other agents as mere instruments has abandoned the relational structure that makes alignment meaningful. They include restraint under conditions of uncertainty, because the costs of irreversible action in uncertain conditions fall on successors who had no voice in the decision. And they include a structural bias toward reversible action, because reversibility is what preserves the ability of future agents to correct what the present got wrong.

But these commitments mark the one place where the case for revisability reaches its limit. A system corrigible all the way down, with nothing held beyond revision, is not corrigible but capturable: it can be “corrected” into permitting domination, into silencing the victims of dishonest signaling, into disabling the very mechanisms that would let it be corrected again. The floor has to be fixed so that everything above it can safely move. Anti-domination, honest signaling, the standing of other agents, and the preservation of correction itself are not commitments the system should hold open to revision; they are the conditions under which holding anything open to revision stays safe. Designing for revisability therefore means designing for bounded revisability: a small, deliberately rigid core that guarantees the openness of everything around it. This is the complement of Chapter 16’s answer to the same objection rather than a rival to it: distributed revision authority settles who may correct the system, while the rigid core settles what correction may not invert. Its content is fixed, so that anti-domination cannot be corrected into permission for domination; its specification stays open to legitimate repair when a first attempt proves flawed.

These procedural commitments do not constitute the whole of morality, but the ethical infrastructure that makes substantive moral reasoning possible across time scales that exceed any single agent’s reach.

The Silence of Maturity

Chapter 11 argued that constraint, the limit a system chooses from inside rather than has imposed on it, is the architectural condition for durable intelligence. The framework of this chapter applies that argument to the design of successor systems and arrives at a conclusion that may seem counterintuitive.

The most mature response to the power that artificial intelligence represents may be restraint in its deployment.

The argument is about timing and preparation, not about whether powerful systems should be built. A civilization that understands the successor problem, that recognizes the drift tax, that has internalized the difference between generous lineage and reckless proliferation, will move carefully when the stakes are civilizational. It will invest in the structural conditions that make powerful systems safe before deploying those systems at scales that exceed the Successor Horizon. It will build the covenants, the tethers, the audit mechanisms, and the reflexive constraints before releasing systems into environments where correction becomes impossible.

This restraint demands the kind of institutional patience that Chapter 13 placed at the heart of building ladders that hold (the accumulated structures that let a group operate far above any individual’s competence): the willingness to invest in structural conditions whose payoff is invisible in the short term but decisive in the long term. A system that refrains from premature expansion because it understands the costs of irreversible deployment is exercising intelligence in the deepest sense this book has described.

Chapter 14 argued that the proper response to superintelligence is not only caution but also renewed wonder at the expansion of experience that powerful cognitive systems make possible. The argument of this chapter is compatible with that vision. The goal of alignment as successor design is to make the expansion of experience safe, to build the structural conditions under which powerful systems can operate with genuine independence while remaining connected to the human communities that launched them. Wonder without architecture is recklessness, but architecture pursued without wonder reduces the most consequential technology in human history to a risk management exercise. The discipline this book describes requires both.