Research

A dead system and a quiet one send the identical signal. The fix isn't a smarter status message — it's refusing to count anything as evidence that a failure could produce by doing nothing at all.
A twenty-hour outage across an entire fleet of maintenance loops once looked, from the outside, exactly like a quiet day. Docs-sync, i18n-sync, worktree-guard, security-sweep, task-groom, feedback-triage — every scheduled cron job in the maintenance layer failed before it did any work, roughly fifty invocations in a row, for three different reasons: the model was unavailable, the model was not found, or the connection was refused after a few minutes of stalling. A successful run posts a one-line digest to the loops channel. A failed run, in every one of those fifty cases, died with zero tool calls and posted nothing. Silence and health produced the identical signal. For twenty hours, the only evidence a human could have looked at was an empty channel, and an empty channel says both everything is fine and everything is on fire in exactly the same font.
That is the whole argument for treating evidence as a specific, checkable artifact instead of a status. A channel with nothing new in it is not proof of health. It is the absence of proof, and the absence of proof looks identical whether the underlying system is idle or dead. The fix is not a smarter status message. It is refusing to accept the presence or absence of a message as evidence of anything, and demanding something a failure cannot fake by simply not happening.
Someone once put the actual problem to me more precisely than I usually bother to: agent trust and reputation is going to be one of the hardest problems, how do you verify an agent actually did what it said without expensive oversight. That is the right question, and the answer I have landed on after running a fleet for a year is that you don't verify the claim — you refuse to let the claim be the thing you check at all. You check for an artifact that could not exist unless the claim were true.
A second incident from the same maintenance layer shows the failure mode from a different angle. A daily sync cron crashed one morning and left behind a log file containing only its own header — no body, no error, nothing past the first line. Nobody noticed until the next day's run happened to open its predecessor's log before starting and found it suspiciously short. That run fell back to a two-day-old baseline and processed a backlog spanning over a hundred commits in one pass, and it still shipped a correct, tested fix once it caught up. The near-miss is the actual lesson: it worked because that particular loop happened to check its predecessor's evidence before trusting that the predecessor had run at all. A loop that assumes yesterday succeeded because nothing said otherwise is one silent crash away from drifting for a week.
The most literal version of that discipline lives inside a tool built to run math proofs against a budget until either the goal or the budget runs out. Early in its design, the actual requirement got stated in one line: are we saving everything, we have to save all steps everything the AI does. An adversarial reviewer assigned to attack the design pushed it past a simple log file into something closer to an accounting ledger than a debug trace. Every event carries a hash of the previous event, a hash of its own payload, and hashes of anything it links to, plus a strictly increasing sequence number. The whole thing is designed to fail closed on any break in that chain: a deleted event, a reordered one, an edited payload, a swapped artifact, even a row mutated directly in the database, all have to be detectable after the fact. A second reviewer's brief was narrower and just as important — assume the system will be tempted to claim a proof that is not actually valid, and check specifically for false proofs and formalization gaps dressed up as results. Evidence here does not mean a log line that says done. It means a chain you can walk backward, link by link, and if one link doesn't match its hash, you know exactly where the story stopped being true.
Most evidence does not need to be that elaborate, and demanding cryptographic rigor everywhere would be its own kind of waste. But the underlying question is always the same one: can this claim be checked by something other than the thing making it. A worktree-guard loop that runs on a schedule to look for orphaned agent clones once found roughly six gigabytes of them — ten separate directories, each its own git repository, every one created within a three-hour window on a single day. The loop did not report "found some clutter." It reported which directories, their exact size, the exact window they were created in, and cross-referenced each one against a live goal plan before classifying the pile as active work in progress rather than rot worth deleting. The difference between "there's a mess in here somewhere" and a size, a path list, and a timestamp window is the difference between a feeling and a receipt.
The same standard applies when the claim being tested is not a bug but an accusation. A public reply implied that a release had shipped something malicious. The response was not a denial and it was not silence — it was a scoped, read-only audit against the exact commit range between the last known-good release and HEAD, checking specifically for the categories that would matter if the claim were true: obvious malicious behavior, secret exposure, network exfiltration, shell or process execution added where it did not belong, dependency and lockfile changes, deploy workflow changes, anything shaped like a backdoor. Treating an accusation as a hypothesis to falsify against a specific, named range of commits is a different act than getting defensive about it. One produces a feeling of confidence. The other produces something you could hand to the person who made the accusation and ask them to check it themselves.
Small-scale evidence works the same way and gets skipped just as often. When an unrelated test failed during an otherwise unrelated change, the honest-sounding move would have been to say the failure looks pre-existing and move on. Instead the actual check was mechanical: run a working-tree diff scoped to exactly the directories the failing test covers, confirm it comes back clean, and only then log the conclusion — pre-existing failure on the main branch, unrelated to this change — before running the change's own guards and committing. "Not my regression" is a claim. A clean, scoped diff against the failing test's own target paths is the evidence for it, and it costs one extra command to produce.
Most of this evidence has a boring, unglamorous shape day to day, and that is by design. A standing duty to post one digest to an ops channel gets followed up the same way every time: did you actually post it, reply with the message id. A claim that secrets were checked gets replaced with secret scan: zero hits. A claim that a bug was fixed gets replaced with the exact commit hash it landed in. Even a feature that exists purely to stop generated content from silently disappearing got its own dedicated regression suite named, fittingly, receipts. None of these are clever. All of them share the one property that matters: a second person, or a second agent, can go check them without asking the first one anything.
None of this is really about paperwork. It is about what "verify" is allowed to mean. Verify does not mean ask the agent if it's confident. It does not mean check that a channel has recent activity in it. It means find an artifact that could only exist if the claim is true, and that would be conspicuously absent or wrong if the claim is false — a hash chain that either walks or breaks, a timestamp window that either matches or doesn't, a commit range that either contains the bad code or doesn't, a diff scope that is either clean or isn't. The twenty-hour outage was invisible for exactly as long as the evidence being checked was "did a message arrive," which a dead system can satisfy just as easily as a live one by doing nothing at all.
The reason artifacts like these work as evidence and confident language does not is a cost asymmetry, not a moral one. A hash chain, a diff scoped to the failing test's own paths, a commit range, a byte count paired with a timestamp window — all of these are nearly free to produce when the underlying work actually happened, and expensive or outright impossible to fake convincingly when it did not. A sentence that says all fixed costs the same four words whether the fix landed or not. That asymmetry is the actual reason evidence or it didn't happen is worth enforcing as a rule instead of repeating as a slogan: the artifacts worth demanding are specifically the ones that stay cheap when the claim is true and get expensive the moment it isn't. Everything else is decoration wearing the shape of a status update.
The fix that came out of that outage was not a better dashboard. It was accepting that silence itself has to be logged as a fact and checked, not read as good news by default. A loop that goes quiet needs the same thing an agent that claims done needs: something external checking a specific, falsifiable artifact, on a schedule the quiet system does not control. The moment you let "nothing happened" stand in as a form of evidence, you have quietly reintroduced the exact self-report problem you built all this tooling to get rid of, just wearing the costume of an empty inbox instead of a confident sentence.
The receipts end the argument. The work stands or it doesn't. I don't ask for evidence because I don't trust the agents. I ask because a confident sentence and an empty channel cost the same to produce whether the underlying work happened or not, and only one of those two things has ever been true at the same time as a hash that checks out, a diff that comes back clean, or a commit range that either does or does not contain the thing you're afraid it contains. Build the system so that the only way to look done is to actually be done, and the self-reports stop mattering, because nobody has to trust them anymore. That is the whole difference between a fleet I can leave running overnight and one I have to sit and watch. Not more confidence in the agents. Just fewer places where a confident sentence and an empty channel get to count as proof.