Evals Are the Secret Weapon

03-03-26

Correct has to be defined outside the thing being measured, then checked mechanically, every time — a cheap hook, two auditors who can't agree by default, and a browser that actually gets opened.

Verification should be automatic. The user should not have to say verify all the time. That rule came out sideways, in the middle of an unrelated feature spec, which is exactly why I believe it. Nobody wrote that down as a rule first and then followed it. It fell out of how the work already happened, which means it was never a policy. It was already the operating assumption, and evals are what happen when you take that assumption and give it a mechanical, repeatable shape instead of leaving it as a vibe.

The definition is boring on purpose. Evals are automated tests for an agent, not for a feature. Build thirty to fifty test cases with an expected outcome attached to each one, run the agent against all of them, track the pass rate, and rerun the whole set every time a prompt changes. Most people skip this because it feels like overhead sitting between them and shipping. It is the opposite of overhead. Without it, changing a prompt is a coin flip you cannot even see land — you have no idea if the change made things better, worse, or just different, and you find out three weeks later from a support ticket instead of from a number. Eighty percent of the actual work in building an agent is prompt refinement, edge case handling, and eval creation. The code is the easy twenty percent. Behavior tuning is where the real time goes, and evals are the only thing that makes tuning behavior look like a discipline instead of vibes.

The cheapest, most literal version of a hook that verifies completed work runs on almost nothing: read the last user prompt, read the last agent message, read the list of files that were actually modified, and hand all three to a small, cheap model whose only job is to answer one question — is this actually done. If the answer is no, it does one of two things. It adds more tasks to the list, or it stops the agent from stopping and tells it to keep working. That is an automatic QA layer built specifically to catch agents trying to bail early, and it costs almost nothing to run because the model doing the checking is the cheapest one available, not the one that did the work. The eval here is not thirty test cases run in a batch. It is one check, run every single time, on the actual transcript instead of the agent's own account of the transcript.

That kind of hook does not get to celebrate every time it fires. On a completely different project, a session-scoped goal was blocked behind exactly this kind of check while the team tried to wire up Google sign-in alongside an already-working magic-link flow. The magic link was genuinely done — verified by a browser screenshot showing a real signed-in dashboard, not a claim about one. The hook still refused to close the session, because the condition had said Google or magic link and read it strictly: Google was not done. And Google genuinely could not be finished from inside the session — the OAuth client type in use does not support editing redirect URIs through any API, only through a console a headless agent cannot click into. The hook fired the same not-satisfied verdict five times in a row while the agent explained, correctly, that it was blocked on an action only a human could take. It did not get talked out of the verdict. It stayed blocked until a human created a new project and pasted in a fresh OAuth client, because the whole point of the check is that a good excuse is not the same thing as a met condition, even when the excuse is completely true.

Evals earn their keep hardest on the tools that measure everything else, because if those are wrong, nothing built on top of them can be trusted regardless of how carefully it was built. The tracker responsible for every model's pricing across every provider had its own correctness defined the same way you would define it for anyone else's code: not "does this look right," but "does this match the provider's current live documentation, exactly, right now." Getting it right took roughly a dozen rounds of two auditors working in parallel, each explicitly forbidden from relying on the other's answer, each required to return a flat yes or no on whether the pricing was a one-to-one match against current docs, not memory. The traps they kept catching were specific and unglamorous: one provider's compact model-listing endpoint was silently missing a competitor's real cache-read tier, a different model's aggregator mirror lacked a long-input pricing tier its direct API actually had, a third model had no long-context tier at all despite the tracker assuming one. Round after round ran the same two-auditor check after every fix until both sides came back clean. That is what "define correct outside the thing being measured" looks like when you actually do it — correct was never what the code believed. Correct was whatever the provider's own current documentation said, checked twice, by two auditors not allowed to agree with each other by default.

A check is also only as good as what it compares against, not just what it measures. One automated sync run diffed the current translation catalog against what it believed was the main branch and filed a real task claiming thousands of missing keys per locale — a backlog that did not exist. The branch it diffed against turned out to be a divergent reference with no shared history with the actual working tree at all, so every difference it found was noise dressed up as a finding. A later run defined correct the right way: recompute parity against the real working tree, prove the two branches shared no common ancestor with the actual merge-base command, and close the phantom task with the proof attached. Same category of check, two different definitions of what it was actually comparing against — one produced a false alarm, the other produced the truth. An eval is not just a pass/fail number. It is a claim about what the number was measured against, and that claim needs checking as much as the result does.

I want to be honest about the failure mode too, because a system that only tells success stories about its own verification is not being verified, it is being marketed. A published internal package once shipped with four hundred and seven TypeScript type errors, an improvement over the prior release's four hundred and fifty-eight, which had also shipped broken. The publish gate checked that the build succeeded. It did not check that the type checker was clean. Shipping despite a failing typecheck had quietly become the accepted norm for that package instead of the anomaly it should have been, and it took a separate audit to even notice the gap between the stated rules and what the gate actually enforced. An eval is only as good as what it actually measures, and a gate that checks the wrong thing gives you the exact same false confidence a self-report does — it just arrives wearing a green checkmark instead of a sentence.

The same "prove it, don't assume it" standard applies to ordinary test failures, not just pricing tables and translation catalogs. After a large rename touched a codebase, seventeen tests came back red. The honest-sounding shortcut would have been to assume the rename broke them and start fixing blind. Instead each failure got reproduced on its own terms: a stack overflow was confirmed to happen even with the rename's changes stashed out, so it predated the rename; a batch of Python failures traced to a pinned dependency with no build for the current platform, proven by checking the actual package availability, not by guessing; a batch of sandbox failures traced to a missing kernel permission, proven by literally running the underlying system call by hand and watching it get rejected with the exact same error. Two of the seventeen were real regressions from the rename. Fifteen were pre-existing, and each of those fifteen has a specific, reproduced reason attached to it instead of a shrug. That is what defining correct outside the thing being measured looks like at the most unglamorous possible scale — not a framework, just a refusal to accept "probably the rename" as an explanation when the actual cause can be reproduced on demand.

Even the verification technique itself is not immune to needing its own check. While A/B-testing whether a set of failing tests predated an unrelated change, the standard move — check out the old state, rerun, compare — reverted a conflict-marker fix that had just been made and let the broken version get committed on top of the good one. It was only caught because the same session insisted on re-verifying afterward with a direct, mechanical grep for leftover conflict markers, got a nonzero count, and forced a second fix. The tool built specifically to rule out a regression had just caused one. Nothing here means give up on evals. It means the discipline does not get to stop at "we have a check." It has to survive the follow-up question: what checks the check.

Some verification only means anything if it happens inside a real browser against a real running app, because the failure modes that matter — a spinner that never clears, a message order that silently scrambles, a reply that takes a full minute on a plain hello — do not show up in a unit test at all. That is why QA work in the fleet runs as eight separate agents, each locked to one section of the app, each working off the same shared plan, each required to actually drive the page instead of reading the source and guessing. At larger scale the same instinct runs mechanically: point a tool at a URL, have it inventory every link, button, form, and input as its own scenario, and fan the whole batch out across real sandboxes instead of trusting that a page compiling means a page working. A test suite that never opens a browser has already decided, by omission, that everything a user would actually click on does not need checking.

None of this is free, and I do not pretend the accounting works out to zero. But it is measurable, which is the entire point — a terminal-compression tool built for coding agents shipped with its own benchmark and the improvement is a number, not an adjective: fifty percent fewer tool calls, ten percent less token usage overall, up to seventy percent less on some batch actions. That is what an eval is supposed to produce. Not a feeling that the new version is probably better. A rerun of the same fixed test set, before and after, with a number attached to the difference, so the decision to ship is a comparison instead of a guess.

Evals are called the secret weapon because almost everyone building an agent skips them, not because they are exotic. The version that matters is not a research team's academic benchmark suite. It is a fixed set of cases you rerun every time something changes, a cheap hook that checks a transcript instead of trusting a summary, two auditors who are not allowed to agree with each other, and a browser that actually gets opened instead of assumed. None of it is clever. All of it is the difference between shipping a change and finding out later whether it worked.

← Back to the articles

Evals Are the Secret Weapon

What we shipped, what broke,and what we learned

What we shipped, what broke,
and what we learned