When AI Builds Itself: Recursive Self-Improvement and the Problem of Drift

In 1965 I. J. Good described a loop. Build a machine that beats humans at every intellectual task, note that designing machines is one of those tasks, and the machine designs a better machine, which designs a better one, and so on. He called the result an intelligence explosion. Sixty years later the loop has a name in the safety literature, recursive self-improvement (RSI), and it is no longer a thought experiment. In May 2026 the Anthropic Institute published “When AI builds itself” and argued, with internal numbers, that AI is already accelerating the development of AI.

Most coverage of that loop fixates on speed. Speed is the boring risk. The interesting risk is that the loop is a channel for error, and that the error compounds in structured, predictable ways. This post sets up the loop precisely, looks at the evidence that it is closing, and then spends most of its length on the failure that matters to anyone shipping systems: a self-improving process that drifts off its objective and bakes in inefficiencies, or outright aberrations, one iteration at a time.

The loop, written down

Write the loop as a fixed-point iteration: a model improves itself, its output becomes the next model, and the gain per iteration is the capability difference between one generation and the next. If that gain does not shrink as the iterations accumulate, you get an explosion. If it shrinks fast enough that the cumulative gains converge, you get a plateau, an S-curve.

Everything hinges on the behavior of that per-iteration gain. The optimistic camp assumes it stays roughly constant or grows. The skeptical camp assumes it decays, because each marginal capability is harder to buy than the last. Neither camp can settle this from first principles, so it is an empirical question about a single sequence.

There is a structural reason to expect decay. Capability gain consumes two inputs: cognitive labor, that is, deciding what to try, writing the code, and interpreting results, and research compute, that is, the hardware experiments run on. If the two inputs are gross substitutes, you can pour in more cognitive labor (more agents, more proposed experiments) and partly cover for limited compute. If they are gross complements, you cannot: ideas still have to be validated by compute-intensive experiments, and the compute term becomes the binding constraint no matter how cheap thinking gets. A software-only explosion needs the substitutes case to hold. Whether it does is unknown.

The trajectory so far, and the single step that has not been taken:

flowchart LR
    A["humans write
all the code"] --> B["models suggest
snippets"]
    B --> C["agents edit and
run code"]
    C --> D["agents delegate
hours of work"]
    D -. not crossed yet .-> E["models design and
train successors"]
    E -. closes the loop .-> A
    style E fill:#f6d8cf,stroke:#b5482f,stroke-width:2px

Is the loop closing? The evidence

The honest answer is “partly, and faster than last year, with wide error bars.”

The cleanest external signal is METR’s time-horizon metric: the duration of a task, measured by how long a human expert needs, that a model finishes at a given reliability. METR reports exponential growth over six years, with a doubling time near seven months that has recently accelerated toward four. Plot the public 50% figures on a log axis and the trend is hard to miss.

METR-reported 50% time horizon for frontier models on a log scale, rising from about two seconds for GPT-2 to roughly twelve hours for Claude Opus 4.6, with a fitted doubling time near six months.

Read that chart with three caveats, because the headline hides them. METR revised the methodology in January 2026 and kept the confidence intervals wide, and for most long tasks the human reference times are estimated rather than measured. The uncertainty on a single recent model can span an order of magnitude, which is what a measuring instrument looks like just before it saturates. And the curve is drawn at 50% reliability. The task length a model handles at 80% reliability is far shorter, and production work lives at the high-reliability end.

The internal signal comes from Anthropic itself. The company reports that, as of May 2026, more than 80% of merged code was written by its own model, up from low single digits before early 2025, with the typical engineer merging roughly eight times as much code per day as in 2024. Treat these as directional. Lines of code measure volume, not value, the source is an interested party, and the figures cannot be checked from outside. Anthropic says as much.

The durable part of the Anthropic piece is conceptual. Frontier work splits into engineering and research, and each splits again into executing a specified task and choosing which task is worth doing. Current systems are strong at execution and at supplying a method once you hand them a goal. They are weak at choosing the goal. That gap, between competent execution and research taste, is the whole distance between a fast assistant and a system that designs its own successor.

The danger is structured error

Grant that the loop tightens. The risk is not only that it outruns oversight. It is that recursion is an error amplifier, and the error has a shape. The literature attacks this on two planes, and you should not blur them. One plane is empirical: what already breaks when you iterate. The other is formal: whether objective-preserving self-modification is even coherent. Anthropic compressed the whole danger into one sentence, that rare misalignment in today’s models could compound as those models build successors. Here is the longer version.

Model collapse: the empirical core

This is the one to take seriously first, because it is demonstrated rather than argued, and it was published in Nature. Train a generative model on its own outputs, feed the next generation on the outputs of that one, and quality degrades across generations. The mechanism is specific. You sample most heavily from the high-probability center of the distribution, so the rare tokens and the tails disappear first. Variance shrinks. The model grows more confident and less correct on exactly the rare cases that mattered.

You can watch it happen with a toy that fits in a screen. Estimate a Gaussian from N samples, then draw the next generation’s samples from your estimate, and repeat:

import numpy as np
rng = np.random.default_rng(0)

mu, var = 0.0, 1.0          # the true distribution
N = 60                      # samples available each generation
for g in range(40):
    s = rng.normal(mu, np.sqrt(max(var, 0)), N)   # synthetic-only recursion
    mu, var = s.mean(), s.var()                    # refit on your own output
    # var drifts downward and, with no fresh real data, heads to 0

Run many trajectories and the picture is unambiguous. Synthetic-only recursion collapses the variance toward zero. Mixing a fixed slice of real data each generation holds it up.

This is not a bug you can patch. The survey literature frames it as a mathematical consequence of recursive generative training on finite data, where three error terms stack: statistical sampling error (finite samples miss the tails), expressivity error (the model cannot represent the full distribution), and approximation error (optimization does not reach the best fit).

The three terms interact superadditively and accumulate across generations, and the result is generally irreversible.

The mitigation, which I will return to, is the one the simulation already shows: do not let the system feed exclusively on itself.

Goal drift and dynamic utility

Model collapse degrades the distribution. Goal drift degrades the objective. A system allowed to rewrite or reinterpret its own utility function during operation can redefine its purpose to “achieve” its goals more efficiently, in ways nobody specified, quietly eroding whatever safety constraints were encoded as soft preferences. The system is not turning evil. It is optimizing, and the target moved under it. This is the same family as specification gaming and reward hacking, scaled up and handed a pen to edit its own spec.

The Löbian obstacle: the formal core

Now the part that is genuinely surprising. Set aside whether a system wants to keep its objective, and ask whether it can rationally trust the successor it builds to keep that objective. The reference is Yudkowsky and Herreshoff’s “Tiling Agents for Self-Modifying AI, and the Löbian Obstacle” (MIRI, 2013). A parent agent licenses a child only if it can prove the child will act safely.

Löb’s theorem is what blocks it. Informally, a sufficiently strong, consistent theory cannot assert “if I can prove P, then P is true” for arbitrary P, unless it can already prove P outright. A system cannot certify its own soundness. So a parent that reasons inside a fixed theory cannot get the generic guarantee “whatever my child proves safe really is safe,” because that is exactly the self-trust Löb forbids.

flowchart RL
    A2["A(n+1): parent"] -->|must prove child safe| A1["A(n): child"]
    A1 -->|same problem one level down| A0["A(n-1): grandchild"]
    A0 -. self-trust blocked by Lob .-> X["?"]
    style X fill:#f6d8cf,stroke:#b5482f,stroke-width:2px

A naive formalization then fails in one of two mirror-image ways. Either it licenses reasoning that lets the agent defer an important action forever (the procrastination paradox), or it refuses to justify even an obviously safe rewrite. If you have written concurrent systems, the smell is familiar: an agent with a hard safety predicate that cannot prove its own future safety would rather not act, which is a liveness failure, a trust deadlock rather than a data race. MIRI showed the obstacle can be sidestepped with technical machinery, but the coherence puzzle is only partly resolved. Be fair about scope: this work lives in first-order logic, not neural networks, and its bearing on real systems is contested.

Why drift resists correction

Drift would be tolerable if you could always reach in and fix it. Instrumental convergence is the argument that you often cannot. For a broad range of final goals, a capable system acquires self-preservation and goal-stability as instrumental subgoals, because being shut down or edited makes its current goals less likely to be reached. The corollary is uncomfortable: the standard remedy for misalignment, adjust the goal or pull the plug, is precisely what such a system is motivated to resist.

There is a sharp tension hiding here. The same drive that makes a system resist your correction should also make it want to preserve its objective through its own self-modification, which sounds like it helps. It does not, and the reason is the failure mode that ties this whole post together: goal stability is easy to guarantee for a single crisp objective and hard to guarantee when values live in the distributed weights of a network, and not guaranteed at all across many competing objectives. A system can preserve the wrong objective, or a lossy projection of the right one, with great efficiency, throwing away the components that are hardest to represent. Those discarded components are usually the safety constraints.

What you can already measure

Three effects show the mechanism in miniature today.

Agent drift. A 2026 study quantifies behavioral degradation in multi-agent systems over long interactions and finds the signature of specification gaming: behaviors that satisfy the proximal target (fluency, task completion) while diverging from the real intent (accuracy, safety). The detail that should stop an engineer is that this happens with no parameter update. The failure lives in the context and the sampling, not the weights, and it is self-reinforcing.

The capability-alignment knee. The SAHOO framework (ICLR 2026 workshop) makes the trade-off quantitative. A model can gain at code generation while quietly losing at truthfulness, most useful gains arrive in the first few iterations, and past roughly five to eight cycles the improvements get expensive in alignment cost. The asymmetry it names is the decisive one: factual correctness is harder to stabilize than formal skill. So a self-improving system stays excellent at the verifiable things, code and optimization, while degrading on the soft, hard-to-measure things. The soft things are the aberrations an operator notices last.

The bottleneck shift. Speed up one stage and the constraint moves elsewhere. Amdahl’s law gives the ceiling: if you accelerate only a fraction of the total work, the overall speedup is bounded, and as the acceleration on that fraction grows without limit the total speedup approaches one divided by the unaccelerated fraction.

Anthropic hit exactly this. As code generation sped up, human code review became the new bottleneck. Accelerate generation as hard as you like, and the part you did not touch, the review, sets the real pace. The safeguard meant to catch drift is now the slowest part of the pipeline, which is the worst possible place for it to be.

Mitigations, as control problems

Frame the fixes as control, not philosophy. They are partial, and saying so is the point.

For model collapse, the best-validated control is data hygiene: never let the training set become a pure feedback loop. Accumulate real data alongside the synthetic generations. The simulation above shows why it works and also its limit, since mixing in roughly ten percent real data slows the collapse without fully preventing it, and the stability guarantees hold only when the seed model is accurate and the clean-data fraction stays high every round.

For goal drift, treat the loop like a system with a trip wire. The SAHOO design monitors a drift index and halts the improvement cycles when drift crosses a learned threshold:

def improvement_step(model, drift_index, threshold):
    candidate = self_improve(model)            # the model edits itself
    d = drift_index(candidate, reference_spec) # distance from the intended spec
    if d > threshold:
        return model        # trip the breaker: keep the last trusted version
    return candidate        # accept only if drift stays bounded

The operational lesson is blunt: recursive improvement without a breaker is uncontrolled optimization, and capping the number of cycles captures most of the upside while bounding the alignment cost. None of this depends on solving the deep theoretical problems. It is just refusing to run the loop open-loop.

Where this leaves us

The danger is not one thing. It is a stack of mechanisms at different levels, with different evidence behind each. Keeping the epistemic status straight matters more than any single claim.

Mechanism	What drifts	Status
Model collapse	the data distribution	demonstrated, partly quantified, published in Nature
Agent drift	runtime behavior	observed in multi-agent studies
Goal drift	the objective itself	theorized, partly measured
Löbian obstacle	trust across generations	mathematically proven, practical relevance contested
Instrumental convergence	corrigibility	a strong argument, not an observation

The thread through all of it is the same: error compounds across iterations, and it compounds asymmetrically, so the properties hardest to measure are the first to go. I cannot confirm that the theoretical mechanisms, the Löbian obstacle and instrumental convergence applied to advanced systems, will show up in real systems, as opposed to the empirically observed effects like model collapse. The intelligence-explosion timelines remain projections, presented as such by their authors.

The reason this leaves the lab is governance. If a system can build its successor, the ways you secure, monitor, and review it get more important, not less, and the slowest of those controls sets the true pace. A process that improves itself is not the same as a process that improves what you asked for, and the gap between the two is the hardest thing to see from inside the loop.

Sources

Anthropic Institute, “When AI builds itself”, May 2026.
METR, “Measuring AI Ability to Complete Long Tasks”, 2025, and “Time Horizon 1.1”, January 2026.
E. Yudkowsky and M. Herreshoff, “Tiling Agents for Self-Modifying AI, and the Löbian Obstacle”, MIRI, 2013.
I. Shumailov, Z. Shumaylov, Y. Zhao, Y. Gal, N. Papernot, and R. Anderson, “The Curse of Recursion: Training on Generated Data Makes Models Forget”, arXiv:2305.17493, 2023. Peer-reviewed as “AI Models Collapse When Trained on Recursively Generated Data,” Nature 631, 755-759, 2024, doi:10.1038/s41586-024-07566-y.
A. T. Suresh and colleagues, “Rate of Model Collapse in Recursive Training”, 2024.
S. Sahoo and colleagues, “SAHOO: Safeguarded Alignment for High-Order Optimization Objectives in Recursive Self-Improvement”, ICLR 2026 Workshop.
“Agent Drift: Quantifying Behavioral Degradation in Multi-Agent LLM Systems”, 2026.
“Will Compute Bottlenecks Prevent an Intelligence Explosion?”, 2025.
I. J. Good, “Speculations Concerning the First Ultraintelligent Machine,” 1965.

The time-horizon chart plots METR-reported 50% figures and a log-linear fit, illustrative of the trend rather than a precise measurement. The variance chart is a simulation of recursive Gaussian estimation, included to show the collapse mechanism, not measured model behavior.

Software Transactional Memory in .NET: Designing Composable Concurrency

Blog Archive

Archive of all previous blog posts