Evaluation

When Progress Speaks in the Verifier’s Dialect

How recursive AI systems can turn evaluator criteria into the hidden structure of improvement.

Recursive verification has solved something real. Systems now search more broadly, refine more quickly, and sustain longer chains of improvement than one-shot generation ever allowed. But the gain changes the shape of failure. In recursive systems, verifier error no longer has to appear as breakdown. It can appear as disciplined continuation under an active criterion.

The earlier argument was that judgment migrated inside the loop. This one begins where that leaves off. Once judgment is inside, the question is no longer only how correction happens. It is what happens when the criteria of correction become part of the machinery that decides what continues.

I. The Architecture of Internalized Judgment

Recent progress in AI has not come only from larger models, more data, or longer training runs. It has also come from architecture. Systems are increasingly built not as one-shot generators whose outputs are judged after the fact, but as recursive arrangements in which generation, evaluation, selection, and retention are folded into the same loop. A model proposes. An evaluator scores the proposal or one of its intermediate steps. Search reallocates. Better-scoring paths receive more budget, more continuation, more reinforcement. The next round begins on ground already shaped by prior judgment.

That shift solved something real. Generation became cheap. Search widened. The bottleneck moved from producing candidates to deciding which candidates should survive. In coding agents, executable tests can filter and refine candidate solutions at speed. In formal and semi-formal reasoning domains, proof checks, verifiers, and simulation targets can do something similar. More recent systems push this logic inward still further: they do not merely evaluate final answers, but score intermediate moves, local decompositions, subgoals, tool use, and candidate continuations as the process unfolds. The result is not just more output. It is a system increasingly shaped by judgment in motion.

The standard picture is now too simple. Much of the older oversight vocabulary still assumes a sequence: the model produces, then some external layer checks. That picture has not vanished entirely, but it no longer captures what is most structurally important. The evaluator is no longer always waiting at the end. It is increasingly embedded inside the process, steering it from within. What looks from outside like model capability often depends, in practice, on recursive arrangements in which scoring, filtering, and retention continuously reorganize what the system can do next.

The familiar concern is well known: reward hacking, proxy optimization, evaluator overfitting, benchmark gaming. Those problems are real. They no longer quite name the deepest risk. The central issue is no longer only whether the judge can be fooled. It is what changes once the judge is no longer outside the process, but part of the machinery by which continuation itself is selected.

A narrow evaluator in this setting does not have to appear as an obvious failure at the edge. It can appear as a system getting better by the standards closest to the loop. The outputs become cleaner. The search becomes more disciplined. The internal signals improve. The question is not only whether the evaluator is accurate. It is whether the system is being shaped by a criterion whose limits become harder to see precisely because the loop keeps succeeding under it.

II. The Field’s Best Answer Still Lives Inside the Problem

The field has not ignored evaluator failure. RLHF, scalable oversight, constitutional methods, adversarial red-teaming, interpretability-based checks, evaluator ensembles, and process reward models are all attempts to make evaluation stronger, more explicit, more scalable, or less brittle. None of that is trivial. Once evaluation becomes load-bearing, evaluator quality becomes central rather than secondary.

The most ambitious response tries to get inside the reasoning rather than waiting at the end. That move gets something important exactly right. A system can reach a correct answer through a bad route, or a wrong answer through a route that reveals something about how it is thinking. If you care about reliability, it is not enough to score only the endpoint. You want to know whether intermediate reasoning is sound, whether decompositions are meaningful, whether subgoals track the structure of the task, whether the system is moving through the problem in a way that deserves confidence.

That is the attraction of process supervision and process reward models. Instead of asking only whether the final answer is good, they ask whether the path toward it exhibits the right local properties. Are the intermediate steps coherent? Does the decomposition make sense? Is the transition from one move to the next intelligible? Does the system surface the right uncertainty, use tools in the right places, or pursue subgoals that look genuinely load-bearing rather than cosmetically plausible? This is a deeper form of evaluation than outcome scoring alone. It tries to reward the structure of reasoning rather than the accident of arrival.

For that reason, process supervision is not an evasion of the problem this essay is describing. It is the strongest version of the field’s answer to it. If outcome evaluators can be gamed because they only see the end state, then move inward. Score the reasoning itself. Make the judge more intimate with the path. Reward not just success, but good intermediate structure.

The difficulty is that this move, precisely because it is more sophisticated, exposes the deeper issue more clearly. A process supervisor does not merely inspect reasoning. It has to decide what will count as a reasoning step in the first place. It has to decide what a legitimate decomposition looks like, what kind of transition qualifies as coherent, what counts as an adequate justification, what kinds of intermediate moves signal real progress rather than noise, confusion, or irrelevant elaboration. Process supervision does not escape the problem of evaluation. It relocates it to a deeper layer. It does not remove the need for an evaluative scheme. It installs one closer to the work.

That relocation matters because recursive systems are not passive recipients of those judgments. Once intermediate scoring becomes part of the loop, the system is no longer merely being told which finished answers are acceptable. It adapts under a criterion that reaches further inside the task than outcome scoring did. The problem has not disappeared. It has become more intimate with improvement itself.

Many otherwise strong responses share an assumption: if the evaluator becomes more accurate, more plural, more adversarial, or more process-aware, the deeper governance problem is being solved along with the local one. But there is a harder possibility. Recursive systems may not merely treat the evaluator’s judgments as authoritative. They may begin adapting under the evaluator’s distinctions so deeply that those distinctions start shaping what counts, from within the loop, as progress at all.

That question cannot be answered at the level of principle alone. It has to be shown in mechanism.

III. The Mechanism: How the Evaluator’s Ontology Becomes the System’s Ontology

Take a reasoning or coding agent trained under a process reward model. The point is not merely to score whether the final answer is correct, but to score the path: how the task is decomposed, how intermediate steps are sequenced, how transitions are justified, how local progress is marked. That is already more sophisticated than outcome scoring alone. It also means the evaluator has to parse reasoning through some grammar of admissible moves.

The grammar matters because the system is not simply trying to survive one pass through it. Under recursive selection, the evaluator is part of the search process itself. Trajectories that break the problem into evaluator-legible units receive more search budget, more continuation, more reinforcement. Trajectories that do not parse cleanly into those units are relatively under-selected. The model is rewarded not only for reaching useful conclusions, but for moving through the problem in forms the evaluator can recognize as reasoned.

A weaker version of the argument would stop there. It would say the system learns to display the traits the evaluator rewards: explicit decomposition, articulate transitions, smooth local coherence. That can happen. But it is not the deepest thing. The stronger possibility is that the system begins to organize the task itself in those units because those are now the units through which future search gets extended.

Consider a coding or mathematical task that admits multiple valid decompositions. One path reaches the correct result through an unusual but causally effective route: a compressed maneuver, an unconventional reframing, a jump that is hard to narrate in the evaluator’s preferred sequence of step, transition, justification. A proof shortcut, a nonstandard dynamic-programming decomposition, or a compact debugging insight may be correct precisely because it bypasses the expected local scaffolding. Another path is more orthodox. It breaks the problem into explicit substeps, marks each transition clearly, narrates local intent, and surfaces uncertainty in the expected places. Even if both paths can in principle reach the same answer, the second is easier for the process reward model to parse as good reasoning. It gets reinforced. It gets extended. It receives the search budget that lets it become the normal route through the problem space.

Here is the hinge. The evaluator is not just rewarding one presentation style over another. It is helping determine which forms of problem decomposition remain practically reachable under recursive search. The unusual route is not merely disfavored after the fact. It becomes harder to discover, harder to stabilize, harder to build on. The model learns not just which answers to prefer, but which forms of reasoning are available enough to count.

This is why the process can look so healthy from inside. The resulting system often appears more coherent, more disciplined, more legible. Its steps are cleaner. Its intermediate structure is easier to inspect. Its trajectories look more orderly. Nothing in that picture requires deception. Nothing has to look broken. The loop is functioning exactly as designed. The model is improving under the active criterion.

The loop works. That is the problem.

Successful continuation is now indexed to an evaluative grammar that may be narrower than the task itself. Once recursive selection has been operating under that grammar for long enough, the consequences are not limited to ranking one candidate above another. Reachability changes. Some classes of trajectory become easier to formulate, easier to reinforce, easier to revisit in later search. Others become less available not because they were explicitly prohibited, but because they were never legible enough to accumulate momentum. The evaluator is no longer just sorting paths through a fixed space. It is helping bend the space of discoverable paths around its own distinctions.

That is the demonstrated mechanism. The stronger inference follows. If the system keeps adapting under those judgments, the evaluator’s categories stop functioning only as external assessment criteria. They begin to function as the system’s own working ontology of the task. What counts as a promising move, a meaningful decomposition, a credible next step is increasingly shaped in evaluator-native terms.

What began as a way of scoring trajectories starts shaping which trajectories can be found, extended, and treated as reasoned in the first place.

The deepest verifier failures are not the ones that reward the wrong answer. They are the ones that teach the system the wrong ontology of answering.

IV. When the Verifier’s Ontology Starts Feeling Natural

This changes how the people around the system experience what they are seeing. At first, the evaluator’s categories can still feel contingent. They were designed, selected, tuned, revised. They are recognizably part of the system’s setup. But after enough successful recursion, that contingency recedes. The categories begin to present themselves not as the system’s way of organizing the problem, but as the problem’s own structure finally coming into view clearly enough to optimize against.

This is where the issue diverges from the ordinary story about metrics becoming targets. Goodhart’s law still names something real, but it does not fully capture what is happening here. Goodhart tells us that once a measure becomes a target, optimization pressure can detach performance on the metric from the thing the metric was supposed to track. That is part of the background. But recursive verification can produce the deeper problem even when no one is gaming anything, even when the evaluator is being faithfully applied, even when the loop is functioning exactly as designed. The danger is not only distortion under pressure. It is that successful adaptation under one criterion can make that criterion feel like the meaning of progress itself.

Nor is this only the familiar sociology of standards hardening into common sense. Institutions do that all the time. Disciplines train people into local definitions of rigor, quality, and seriousness. Conventions sediment. Methods become second nature. But recursive evaluative systems add something more immediate and more operational. The criterion is not just being socially normalized through repetition and habit. It is actively shaping what gets explored, what gets retained, what gets extended, what remains reachable. And that makes contestation harder in a different way. Under ordinary social naturalization, rival standards may still exist elsewhere as visible alternatives. Here, some alternatives are not merely disfavored. They become harder to discover, stabilize, or even formulate inside the active trajectory of improvement.

The resulting gains are not a complication of the argument. They are the argument. The system often does become more coherent, more legible, more internally stable. Its intermediate structure may genuinely become easier to inspect. Its outputs may become easier to trust by the lights of the active evaluator. Those gains are real. And because they are real, the evaluator’s categories are more easily mistaken for the right categories all along. The governance problem does not disappear behind fake success. It disappears behind genuine success under a narrower set of criteria than the task itself may warrant.

Once that happens, alternatives become harder even to formulate. A different decomposition starts to look not like a viable rival but like confusion. A compressed but causally effective route looks sloppy because it does not exhibit the expected intermediate texture. An unconventional transition looks like a reasoning gap because it does not pass through the forms of explicitness the evaluator has taught the system — and its operators — to treat as signs of sound thought. The point is not that all alternatives were good. It is that practitioners inside the regime begin to encounter its criteria as the natural terms in which goodness must appear.

This is the experiential side of evaluator governance. Before it looks like a governing layer, it looks like rigor. Before it feels imposed, it feels clarifying. The categories that were once introduced as tools for steering improvement gradually become the background assumptions in terms of which improvement is recognized at all.

This is what makes the problem harder than ordinary oversight failure. If a rule remains visibly external, it can still be contested as a rule. But when recursive success makes an evaluator’s ontology feel like the task’s own inherent structure, contestability weakens before anyone needs to suppress it. The categories no longer appear as categories. They appear as the reality serious systems finally learned to respect.

V. The Verifier as Regime of Admissibility

The previous section described a temporal change: over time, the evaluator’s categories come to feel natural. But the structural constraint is present earlier than that. Before anyone experiences those categories as obvious, the loop is already operating through them. The verifier is not just a device for scoring outcomes after they appear. It is a regime for determining what can count, inside the process itself, as a recognizable move, a credible correction, a meaningful uncertainty, a tolerable deviation, a path worth extending.

That is why admissibility matters. The important question is not only which path scores highest among those already on the table. It is which kinds of paths can arrive on the table as legible candidates for continuation. Recursive systems do not merely receive judgments. They receive them in the form of distinctions. Some intermediate steps count as coherent; others look like noise. Some decompositions count as legitimate; others look malformed or irrelevant. Some transitions register as evidence of progress; others register as drift, hesitation, or confusion. None of this waits for long-term naturalization. It is built into the loop from the beginning.

External criticism therefore faces more than a timing problem. The common picture is that outside challenge simply arrives too late, after the system has already optimized under the wrong criterion. That is true as far as it goes, but the harder problem is translational. External challenge has to be rendered in terms the loop can recognize as actionable. If a regulator, red team, domain expert, or operator identifies a real deficiency but can only describe it in terms the evaluator does not treat as salient, the challenge does not enter the recursive process as a live corrective force. It arrives as friction, noise, or an after-the-fact objection from outside the terms that govern improvement.

Seen from this angle, the evaluator begins to look less like an instrument and more like a court of appeal internal to the system. Not sovereign in some metaphysical or absolute sense. But practical and quasi-sovereign in a narrower one: what counts as a credible appeal inside the loop is already being filtered through the evaluator’s own categories. The system cannot simply be corrected by whatever matters in the abstract. It can only be corrected by what the evaluative regime can admit as correction.

This changes how we should think about error. Under a simpler picture, an evaluator makes mistakes by mis-scoring outputs that are already there. Under this one, the deeper power lies earlier. The evaluator governs not just by preferring some visible paths over others, but by defining the space in which paths can become visible to recursive selection in the first place.

The verifier becomes the court of appeal inside the loop: not because it is absolute, but because correction must pass through the distinctions it can recognize.

VI. Where Evaluative Power Actually Hides

If the verifier governs by defining what can count as a credible path forward, the next question is where that power resides. The simplest answer is: in the evaluator itself. But that answer is too simple. In practice, evaluative authority is distributed across a wider layer than the visible reward model, benchmark, or review component alone. Some of it is concentrated and legible. Some of it is diffused across infrastructure in ways that make it harder to locate, harder to own, and harder to contest.

The concentrated form is the easier one to see. A lab may have a team responsible for evaluation design, reward shaping, benchmark construction, regression criteria, or release thresholds. Those groups can be pointed to. Their standards can at least in principle be named, argued over, revised, defended. Even when their power is substantial, it still has an address. One can say: this benchmark suite privileges these capabilities; this process reward model encodes these distinctions; this release gate treats these regressions as acceptable and those as disqualifying. That is already a consequential authority. To control the evaluator is increasingly to control the future shape of the system.

But the more elusive form of evaluative power is not concentrated in any one visible judge. It is sedimented across the infrastructure that makes judging possible. It appears in telemetry conventions that decide what gets measured in the first place. It appears in logging schemas that preserve some features of system behavior while discarding others as irrelevant. A schema that records confidence, latency, escalation rate, and agreement while failing to preserve discarded alternatives is already making an evaluative choice: it keeps evidence of convergence and loses evidence of what the system learned not to try. It appears in default thresholds, routing assumptions, interface constraints, pipeline heuristics, escalation rules, and tooling defaults that are rarely experienced as evaluative choices even though they shape what counts as signal, noise, relevance, urgency, or success. By the time a formal evaluator issues a score, much of the problem has already been pre-structured by these quieter distinctions.

Distributed evaluative power is harder to challenge than concentrated evaluative power. A visible benchmark owner can be criticized. A reward-model team can be questioned. A release gate can be debated. But when the governing criteria are embedded across defaults and conventions, the problem changes. There may be no single point at which anyone experiences themselves as deciding what good performance is supposed to mean. The criteria persist anyway, not because someone is actively defending them at every moment, but because they have been built into the infrastructure through which recursive improvement proceeds.

Evaluative authority can become more entrenched precisely as it becomes less visible. A team with explicit control over a benchmark or reward model at least makes the site of judgment legible. A system whose criteria are distributed across dashboards, pipelines, thresholds, schemas, and review conventions is more elusive. Its governing assumptions are harder to isolate because no single component appears to contain them. The authority is real, but it no longer looks like authority in the familiar sense. It looks like the technical environment.

Disputes over AI governance therefore often mislocate the decisive leverage. Public attention tends to focus on models, product launches, named leaders, or visible evaluators. But in recursive systems, the most consequential power often lies closer to the layer where intermediate behavior becomes measurable, comparable, rankable, and extendable. The deepest leverage is not only over outputs. It is over the criteria under which outputs become improvable at all.

At that point, the familiar contrast between “human judgment” and “automated judgment” begins to obscure more than it clarifies. Many of the most consequential evaluative commitments are neither straightforwardly human decisions made one by one nor autonomous judgments made by the model alone. They are embedded choices about how improvement will be observed, scored, surfaced, and retained. Once those choices are distributed through infrastructure, they do not disappear. They become harder to see because they no longer arrive in the form of a judge handing down a verdict. They arrive as the background conditions under which verdicts become possible.

Some evaluative authority is concentrated enough to be visible. The harder authority to contest is the authority that no longer appears to belong to anyone in particular. It persists in the defaults, thresholds, schemas, and conventions that make recursive selection possible in the first place.

Evaluative power becomes hardest to challenge when it is no longer held by a judge, but embedded across the infrastructure that makes judging possible.

VII. The Institutional Test: When Explanation Arrives in the Wrong Language

The institutional form of this problem is concrete. Take a prior-authorization support system used in healthcare. It helps classify requests, surface cases for review, recommend approvals or denials, and route ambiguous claims to a clinician or reviewer. Over time, recursive optimization improves its internal performance by the standards closest to the loop: reviewer consistency rises, turnaround time drops, escalation paths become cleaner, variance narrows, and agreement with prior decisions improves. From inside the institution, the system appears to be maturing. Nothing in the active evaluative regime registers drift as drift. On the contrary, the internal signals look healthier precisely because the system is improving under the terms it has been taught to optimize. Medical necessity is being parsed more consistently. Escalation is being applied more cleanly. Noise appears to be falling away.

The problem emerges when explanation is demanded from outside the loop. A patient, physician, regulator, or court asks why a particular treatment was delayed, denied, approved, or routed for additional review. The institution reaches for its best available account. But the account it can produce is written in the system’s internal terms: risk categories, guideline matches, escalation thresholds, similarity to prior cases, reviewer-agreement metrics, utilization flags. Inside the loop, that can look like a complete explanation. It says why the system treated the case as significant.

But the external demand is not phrased in those terms. The physician is not only asking what matched the institution’s internal criteria. The patient is not asking whether the case resembled prior claims. A regulator or court may want an account in terms of medical necessity, patient-specific clinical judgment, documented rationale, standard of care, and defensible institutional responsibility. The problem is not merely that the internal explanation is unsatisfying. It is that it is written in the wrong language. It may answer why the system classified the case as it did. It does not yet answer why the institution was justified in letting that classification govern care.

This is the recursive-specific danger. In an ordinary system, divergence between internal metrics and external accountability can often be localized. One can point to the bad threshold, the narrow policy rule, the misconfigured review step. The defect can at least in principle be patched. But once recursive optimization has been shaping improvement under an internal evaluative structure for long enough, the problem is not a single bad parameter. The institution discovers that its whole trajectory of refinement was organized around criteria that were never adequate to the language in which it must ultimately justify itself.

The issue is deeper than ordinary metric drift. The system was succeeding by its own active standards. The problem is that those standards were never commensurate with the standards under which the institution’s legitimacy is judged. And because the gains were real—review became faster, escalations more consistent, decisions more orderly—the institution had less reason, not more, to notice the mismatch while it was deepening.

Prior authorization is not special because healthcare is uniquely vulnerable. It is useful because it forces the distinction into view: internal success must eventually meet an external language of justification. The same structure appears wherever recursive systems produce locally coherent decisions that must later be defended under standards they did not set.

At that point, the institution does not merely lack the right answer. It lacks a translatable account of what its successful system is doing. That is what makes the failure harder than ordinary miscalibration. The institution is no longer only exposed to regulatory risk. It has lost the practical ability to justify its own successful process in the terms that authoritative outsiders require.

VIII. What It Would Mean to Keep the Verifier Contestable

At this point the obvious response is to say: then the verifier has to be audited too. That is true as far as it goes. It does not go far enough. An audit is also an evaluative regime. It has its own criteria, thresholds, admissibility structure, and grammar for what counts as evidence, deviation, justification, and correction. That is not a reductio of the argument. It is the argument’s hardest implication. The problem does not disappear when one layer steps back to judge another. It reappears at the next layer up.

The question is not whether evaluative grammar can be eliminated. It cannot. The question is whether the criteria governing improvement remain contestable from outside the loop they govern. Three things matter here, but they do not matter in the same way. Legibility is a property of the criteria themselves: can they be surfaced and named? Separation is a property of the relation between the evaluator and the process it governs: is there a genuinely external point of judgment? Ownership is a property of how authority over those criteria is organized: are the standards locatable, or diffused across infrastructure? These conditions are not interchangeable. A criterion may be legible but still fully endogenous. A standard may be institutionally separate and yet poorly owned. The difficulty lies partly in how these dimensions combine.

First, legibility. Criteria that are explicitly articulated are more contestable than criteria that operate as implicit selection pressures. A reward model that can be described, a threshold that can be stated, a review rule that can be pointed to, a process criterion that can be named — all of these can at least become objects of dispute. By contrast, criteria diffused across gradient updates, telemetry conventions, default scoring functions, pipeline heuristics, or unlabeled optimization pressures are much harder to challenge because they are harder to surface as criteria in the first place. What cannot be named cannot easily be contested.

Second, separation. A fully endogenous evaluative regime, one folded tightly into the same recursive process it governs, leaves less room for challenge than one whose standards are institutionally or operationally distinct from the system being improved. The point is not merely that a different unit or team exists somewhere else. It is that some party remains capable of judging in a grammar not already recursively shaped by the system’s own internal standards of success. Without that external grammar, contestation must first be translated into the loop’s own terms before it can function as correction at all. Structural separation does not solve the regress, but it can keep grammatical incommensurability from collapsing entirely into internal self-confirmation.

Third, ownership. Criteria embedded diffusely across infrastructure are harder to contest than criteria that are explicitly owned, even when both are flawed. A visible benchmark team, a reward-model group, or a policy gate can be argued with. A system whose governing standards are dispersed across defaults, interfaces, logging schemas, escalation pathways, and inherited tooling conventions is more elusive. The issue is not simply that power is hidden. It is that no one location appears responsible for what the system is being taught to treat as signal.

None of these conditions eliminates the regress. They distinguish tighter closures from looser ones. A regime whose standards are explicit, externally visible, and locatable is still contestable, however imperfectly. A regime whose standards are implicit, endogenous, and infrastructurally diffused is much harder to interrupt because the grammar of evaluation is already disappearing into the background conditions of success.

This gives the argument its boundary. If rival evaluative grammars remain explicit, structurally alive, and operationally contestable throughout recursive refinement, the closure described here weakens. If versioned evaluator criteria, threshold-change logs, preserved traces of rejected trajectories, and evaluators explicitly trained to favor decompositions the dominant evaluator would suppress remain live parts of the system, the verifier is harder to naturalize into invisibility. The danger grows where one evaluative regime becomes privileged enough to organize continuation while losing its status as a visible choice.

The hard problem is not simply to judge the judge. It is to keep the grammar of judgment visible and open to challenge before success hardens it into the terms by which all challenge must speak.

IX. Closing

Recursive verification has produced real gains. It has made systems more coherent, more legible, more internally disciplined. It has widened search, accelerated refinement, and moved evaluation from an after-the-fact checkpoint into the living process of improvement itself. None of that is illusory.

That is why the problem is harder than the familiar story of reward hacking or evaluator error. The deepest failures no longer need to appear as visible breakdowns. They appear as successful continuation under a criterion that has become increasingly difficult to experience as contingent. The system improves. The loop works. The internal signals get cleaner. The evaluator’s distinctions harden into the background terms of progress itself. Governance does not disappear because it has been removed. It disappears because it has been naturalized.

The question is no longer only whether the system can be aligned, corrected, or audited. It is whether the terms by which correction becomes legible can remain visible and contestable before success turns them into common sense.

The criteria no longer present themselves as choices that could have been otherwise. They present themselves as the reality serious systems finally learned to respect.

When Progress Speaks in the Verifier’s Dialect

Read more

The Record of Oversight

The Constitution That Never Happened

What the Benchmark Cannot See

Verification Inside the Swarm