Research

Never Trust an Agent Self-Report

Never Trust an Agent Self-Report

The work is done when the evidence survives an attack. Everything else is a self-report.

Every coding agent will tell you the work is done. The tests pass, the feature is complete, the refactor is safe. Ask it. It will say yes.

It is often wrong, and it does not know it is wrong. So the operating rule of my whole fleet is five words: never trust an agent self-report.

It is not a slogan I made up after the fact. It is a literal line inside a recurring fleet-sweep prompt that runs across every machine, classifying every tmux pane as working, idle-done, stuck, errored, or blocked-on-account-usage. The prompt ends the same way every time it runs: verify everything, never trust an agent self-report, in capitals, because a pane that reports "done" and a pane that is actually done are two different facts and only one of them matters.

I run agents across multiple machines — dozens of tmux sessions, coordinators dispatching workers, loops firing through the night. At that scale you cannot review everything yourself. But you also cannot accept done as evidence. The answer is not more trust. The answer is structured distrust.

Here is what that looks like in practice.

Every substantial task gets adversarial reviewers. Not one friendly second opinion — two, four, sometimes six agents whose explicit job is to attack the work: challenge the assumptions, the freshness of the sources, the edge cases, the regressions, and whether the evidence actually proves the claim. This is written into the global rules of every machine I operate. It is not a step in the process. It is the culture.

I did not invent the habit, I just made it mandatory everywhere. I got the idea from a stranger's reply on Twitter about adversarial code review and never stopped using it since. It is, plainly, the best thing that happened to my coding workflow since I started running agents — it deserves to be its own reusable skill at this point instead of something I retype into every prompt.

It also does not stay confined to code review. Ask what my favorite use of sub agents is lately and the answer is boring on purpose: spawn adversarial agents to cross-check the work and report any issues. Not a quarterly audit, not a special occasion reserved for big decisions — a default setting on ordinary tasks. When a stack of pull requests needs merging into main, the instruction is not "merge them." It is work with four adversarial sub agents, analyze every piece of code written in every PR, and merge one by one as all is green. The adversarial pass is not a gate before the real work. It is part of the work.

The strongest version of this replaced the architecture meeting. In June we had a real design question: should deterministic automations be a separate control plane, or part of OpenLoops? Instead of deciding on vibes, I dispatched twelve read-only adversarial reviewers — six roles (runtime, security, reliability, migration, product, contracts) across two model tiers. Each one had to post an evidence-backed report to a shared channel with file-and-line references. And each report had to include one mandatory section: the strongest argument against your own recommendation.

Then round two: every reviewer reads the channel and posts a rebuttal or an agreement. Twelve agents arguing in public, forced to steelman the other side, citing exact lines of code. The decision that came out of that was better than any design meeting I have sat in, and it left a written evidence trail instead of a memory of who talked loudest.

The prompt template that goes out to all twelve is blunt about scope: read-only adversarial architecture review, do not edit files, do not write files, do not spawn more agents, and your report must include a recommendation, the strongest argument against that recommendation, and source references to exact files and lines wherever possible, finishing with DONE plus your recommendation. No agent gets to satisfy the assignment by writing a strong case and skipping the part where it argues against itself.

The habit does not need a twelve-agent board to earn its keep. When we designed a single atomic tool call that lets a model emit a whole canvas layout in one shot, the spec went through three adversarial reviews before a line of implementation code existed — every blocker and major finding folded back in, every claim re-checked against the live code with file and line references. Same discipline, different shape. A two-person design chat produces opinions. A review that must cite the file and line produces evidence.

The same pattern lives in the code, not just the process. Our swarm framework has a first-class actor/verifier loop: a cheap workhorse model produces, a frontier model critiques, and results are only accepted once the verifier explicitly calls an approved tool. Not looks-good-to-me in prose — a structured approval, gated in code. Our dispatch tool does not assume a prompt reached an agent's pane; it diffs the pane before and after and confirms the prompt actually landed and submitted. Even message delivery is verified, because I sent it is also a self-report. Our QA tooling runs the same instinct against a browser instead of a diff: eight adversarial agents, each locked to one section of the app, each working off the same shared plan so nobody grades their own homework. At bigger scale the same tool inventories every link, button, form, and input on a page as its own scenario and fans the whole batch out across up to twelve sandboxes running in deterministic waves. "It works" is a sentence. A scenario per button, executed and logged, is not.

The self-report problem is not limited to chat. It is baked into how an agent decides to stop working, which is exactly why I stopped trusting that decision too. A session gets a standing condition — fix the CLI bugs, update every machine — and the agent tries to end the session once it believes the condition holds. A hook reads the transcript before it is allowed to stop and checks the claim against the evidence inside it, not against the agent's summary of that evidence. One session claimed all the CLI bugs were fixed and deployed. The hook's answer: two of three were live on all five machines, and the third was fixed at the source commit but explicitly could not be published because the build it depended on was broken. Fix all bugs and update across all machines is not satisfied by two of three. The session did not get to stop.

Another one: keep going, keep checking for any feedback, both UI and non-UI, lock one at a time, until there are zero. The agent closed thirteen items and then, honestly, explained why it would not touch the remaining ten — some would collide with other agents' in-flight rewrites, one required a design decision it should not guess at, a couple needed a live browser reproduction it could not run under its own standing rules. Reasonable answers. Zero remaining was still not true. The hook held the line: ten open items is not zero, and the condition does not get partial credit for good excuses. That is the whole idea in miniature — a self-report, even an honest and well-reasoned one, is still a self-report, and the only thing that ends the session is the count actually reaching zero.

The same check runs at the smallest scale too. One feedback item in the queue exists for exactly one purpose: verify and close the claim that an agent spawned a peer agent when it was asked to. Not fix it, not discuss it — verify it. Even a single sentence in a status update gets its own audit trail before anyone is allowed to mark it resolved.

Reports come with receipts. When I give an agent a standing duty — post one digest to the ops group, one message, do not spam — the follow-up is always the same: did you actually post it? Reply with the message id. Exit codes, message ids, line counts, file:line references, secret scan: 0 hits instead of "I checked for secrets." If the evidence cannot be checked, it is not evidence. It is a claim.

The same suspicion applies to silence. Loop agents fire through the night — supply-chain checks, package updates, tmux revival — and every one of them is required to post its result to a shared space, not only when something breaks. Silence is not a valid status. An agent that goes quiet for six hours is not necessarily fine; it might be stuck, it might have crashed, it might have finished the job and simply never said so. The default assumption is not no news is good news. The default assumption is that no news is a stuck pane until someone proves otherwise.

People think this is paranoia. It is the opposite: it is what lets me delegate at all. Because verification is structural, I can hand entire subsystems to agents and sleep. The alternative — trusting, then spot-checking when things feel off — caps you at the number of panes you can personally babysit. Structured distrust is what scale is made of.

It also works on me. Opus called my logic dumb three times in one session and every time it was right. The adversarial setup is not there to keep the agents honest while I float above review. My plans go through the same reviewers. The strongest-argument-against-yourself rule applies to the guy who wrote the rule.

Two practical notes if you want to try this.

Keep the reviewers read-only. Reviewers that can edit files stop reviewing and start fixing, and now you have three agents fighting over the same diff. Do not edit, do not write, do not spawn more agents. Report, with evidence, then stop.

Make disagreement mandatory. If you ask an agent whether the work is good, it says yes. If you require the strongest argument against, you get the failure modes up front. The models are perfectly capable of adversarial thinking. You just have to make it the assignment instead of hoping for it.

And do not grade on completion percentage. Mostly done, two of three, as much as I safely could — these are all self-reports wearing a number. The condition either holds against the transcript or it does not. The moment you accept a good excuse in place of a met condition, the whole discipline degrades back into trust.

Evals are the secret weapon of every serious agent team, and this is the same principle one level up: define what correct means outside the thing being measured, then check it mechanically, every time.

The work is not done when the agent says it is done. The work is done when the evidence survives an attack.

Everything else is a self-report.

← Back to the articles

Newsletter

What we shipped, what broke,
and what we learned