You're Not Comparing Models. You're Comparing Contracts.

Two teams publish scores on the same agent benchmark.
One lands in the low sixties. The other clears seventy.
A procurement team reads the spread and makes a call.

What they do not see: both teams may be running the same model. They did not need to change the weights for the gap to appear. The spread can come from scaffold alone.

One team wrapped the model in a harness with better retries. Different tool defaults. A planner step the other team had skipped. None of that appears on the leaderboard.

The comparison that drove the decision was not between two agents.

It was between two contracts.

There Is No Benchmark

The mistake hiding behind this story is a category error.

People talk about agent benchmarks as if they measure a thing called “the model.” They do not. They measure a coupled system. The model is one component. The rest is a stack of protocol decisions that are almost never disclosed and almost always matter.

The score is the output of that stack. Change any layer and you change what the number means.

Recent research on agent evaluation has named those layers explicitly. There are at least seven. Deployment regime. Observation channel. Harness and scaffold. Metric and action. Configured evaluator. Grader protocol. Audit bundle. Each is a contract. Each is negotiable. And each can silently change the verdict while the headline looks the same.

That is what a benchmark actually is. Not a measurement of a model. A measurement of an entire testing contract, of which the model is one slot.

There is structural reason the seven layers are the seven layers. They cluster into three corners that show up in almost every published agent-evaluation failure. What the model is rewarded for. How that reward is optimised. And how the test contract differs from production. Once you hold those three corners in view, the seven-layer stack stops feeling like a checklist and starts behaving like the actual shape of what is being measured.

If you are comparing agent products without parity across those layers, you are not comparing agents.

You are comparing contracts and calling it science.

The Harness You Didn’t Name

The most visible layer, and the one that moves the most points, is the scaffold.

Anyone who has built an agent in the last year has felt this without naming it. You watch a coworker get 75% on a task your model just failed on. You check the weights. They are yours. They changed the prompt template and added a retry loop. The model did not get smarter. The scaffold got thicker.

The numbers say the same thing. RWE-bench reports that on its 162-task benchmark over MIMIC-IV, changing only the agent scaffold around a fixed model can shift performance by more than 30 percent1. Same weights. Different tools. Different retry policy. Different planner. Different headline. The best evaluated agent on that benchmark reaches around 40 percent task success at all; the best open-source configuration is closer to 30. Once you know the contract can move 30 points on its own, neither of those numbers is really about a model.

If scaffold alone can move scores by double digits, then “we used the same model as them” is not a fair-comparison claim. It is a parameter-naming claim.

You have named one slot in a seven-slot contract.
The other six are doing most of the work.

The Judge That Isn’t The Model

The second layer that silently moves scores is the evaluator itself.

When a benchmark uses an LLM judge, people write things like “graded by GPT-4o” as if that pins the measurement down. It does not. The judge is not GPT-4o. The judge is GPT-4o plus a prompt template. Plus a decoding configuration. Plus a tie and abstention policy. Plus whatever retrieval or tool access the judge has during grading. None of that ships with the score.

A recent systematic evaluation of LLM-as-judge setups showed that prompt-template choice alone materially changes both judge quality and internal consistency2. Two teams reporting “we used GPT-4o as judge” can be running substantively different graders. The grader that rewards epistemic hedging disagrees with the grader that penalises it. The grader with access to retrieval checks factuality. The grader without one does not, and cannot.

This is not a small print issue.

The evaluator is the measuring instrument. If two teams use different instruments and report the same number, they are not reporting the same thing.

And without a published judge card, no third party can reproduce the measurement. They can only rerun the model.

The Number That Lies About Consistency

The third layer is the quietest and most dangerous. It is the metric itself.

A standard agent metric is pass@k. You give the agent k attempts. If any one succeeds, it counts. This is perfectly reasonable if your production use allows k attempts. It is actively misleading if it does not.

There is a sibling metric, pass^k. Same k attempts. But it only counts if the agent succeeds on all of them. It measures consistency, not capability.

The gap between these two can be large, and it can open silently.

Recent work on trustworthy agent evaluation shows that controlled error injection into an agent can cut pass^k substantially while barely moving pass@k3. The model still has a ceiling you can hit with enough tries. It has lost the ability to hit that ceiling reliably. If your headline is pass@k and your production regime is one shot, the leaderboard says you are shipping. The bug tracker says otherwise.

The same structural problem appears in calibration metrics. ECE asks whether stated probabilities match empirical frequencies on average. AURC asks whether the system can rank harder cases lower. Both can look nearly identical across two systems while a stricter, abstention-aware metric called BAS, the Behavioural Alignment Score, diverges sharply between them4. BAS asks a different question. Does the confidence surface protect you in exactly the regime where a person or product would actually choose to trust it? Two systems with “similar calibration” can answer that question completely differently once you attach a cost function.

The metric is not a measurement of the model. It is a statement about which errors the model’s operators will tolerate. If that statement does not match your operational contract, the score is not wrong. It is answering a question you did not ask.

Why Rank Stability Is A Trap

Here is the part that makes all of this subtly worse.

Under scaffold shift, the rank order of agents on a benchmark is often relatively stable. A recent efficient-benchmarking study reports that rank preservation is easier to maintain than absolute calibration5. The number moves. The ordering does not.

If all you need is a relative decision, rank stability is comforting. Agent A beats Agent B here, and probably beats it in production.

If you need an absolute decision, it is a trap.

Procurement, safety arguments, SLA setting, cost modelling, and risk disclosure all depend on absolute numbers. A claim like “this agent ships 80% correct at 5 cents per request” binds to the calibrated level, not to the rank. Under scaffold shift, rank can hold while the 80% becomes 62%. Your spreadsheet is still using 80%. Your customers are experiencing 62%.

The protocol that produced 80% is part of the claim. The moment it diverges from production, the claim silently becomes false, even though nothing about the model moved.

This is why seasoned eval teams treat the contract, not the score, as the primary artefact.

You can rerun a score. You can only reproduce a contract if you wrote it down.

The Contract Is The Object

If the score is a function of the contract, the practical move is to treat the contract as the thing you own.

That means three changes to how most teams currently work.

Freeze the contract before you compare. If you cannot describe your deployment regime, observation channel, scaffold version, metric, judge configuration, and grader protocol in one page, you do not have a contract. You have assumptions pretending to be one. Write the page. Commit it. Make it a prerequisite for every comparison.

Version the contract the way you version weights. When the scaffold changes, the contract version changes. When the judge template changes, the contract version changes. When the metric changes, the contract changes, period.

A benchmark result without a contract version is not a result. It is a rumour.

Publish a minimum audit bundle with every reported number. At minimum: the harness, the judge card, the rubric, a sample of trajectories, and the metric definitions used. This is not bureaucratic overhead. It is the only thing that lets a third party tell whether your score is comparable to anyone else’s. Without it, every comparison is a faith-based transaction.

Teams that do this do not ship agents faster. They ship agents that mean the same thing next quarter as they meant this quarter. That is a different product. It is also, increasingly, the only one that compounds.

The Real Question

Generation is getting cheaper every month. Models are getting better every month. Scaffolds are getting richer every month. All of that pushes the same direction. It makes raw agent capability more abundant and less differentiating.

The scarce resource is not the agent. It is whether you can say, with a straight face, what you measured.

Most of the public agent numbers flying around right now do not survive that question. The scaffold is implicit. The judge is underspecified. The metric does not match the action. The audit bundle is missing. Rank stability is quietly being load-bearing on decisions that only absolute calibration can support.

The operators who will build durable advantage in the next two years are not the ones with the best agent.

They are the ones who own the contract under which “best” means anything at all.

Here is the test worth running this week. Pick the most recent agent score your team has cited in a decision. Try to describe the contract that produced it on a single page. Deployment regime. Observation channel. Scaffold version. Metric definition. Judge configuration. Grader protocol. Audit bundle. Seven slots.

If you can fill all seven, you have a result.

If you cannot, you have a rumour your spreadsheet is treating as a number. Every downstream decision is borrowing the rumour’s confidence.

Most teams cannot fill all seven the first time they try. That is the honest finding. Not that your model is wrong. Not that your benchmark is wrong. The thing you thought you measured has been sitting underneath the number all along, and the number is just the part you could see.

If you actually run that test this week: which of the seven slots was hardest to fill? I’d genuinely like to know which layer most teams cannot describe. Leave it in the comments.

Li et al, RWE-bench: A Real-World Evidence Benchmark for LLM Agents on MIMIC-IV, arXiv 2603.22767

Wei et al, Systematic Evaluation of LLM-as-a-Judge, arXiv 2408.13006

Ye et al, Claw-Eval: Toward Trustworthy Evaluation of Autonomous Agents, arXiv 2604.06132

Wu et al, A Decision-Theoretic Approach to Evaluating Large Language Model Confidence, arXiv 2604.03216

Ndzomga, Efficient Benchmarking of AI Agents, Semantic Scholar eeff7139 (arXiv 2603.23749)

If a single argument here changed what you were about to trust, the highest-leverage move is to subscribe on Substack. One piece a week, no filler.