Canary Loops

02-03-26

A loop that runs and does nothing looks identical to a loop that is dead. That single fact is the entire argument for canaries.

The simplest version of the idea was almost a joke: start a loop that runs every minute and says hi. No business logic, no state to check, nothing to fix. Its only job is to fire on schedule and produce a visible reply. If the reply shows up on time, the scheduler underneath every other loop is honest. If it does not, nothing above it can be trusted, no matter how sophisticated the loop's actual work is supposed to be.

That instinct got formalized inside the loop scheduler itself, where the regression tests use the cheapest possible verification primitive available: schedule a prompt that says reply exactly LOOP_SMOKE_0150 when this scheduled loop fires, then check the transcript for that exact string. Not a summary of what happened. Not a status that could be technically true while quietly meaning nothing happened at all. One literal string, matched exactly, or the test fails. Someone once asked, flatly, so what's a canary — the honest answer is this: it is the cheapest loop you can write, built only to prove the thing underneath it is still telling the truth. Cheapest is not a throwaway word here. A canary that costs nearly as much to run as the system it is checking is not a safeguard, it is a second unreliable thing to worry about. Reply with a fixed string on a fixed schedule costs almost nothing and can only fail one way — late or missing — which is exactly the property you want from the one component you are trusting to tell you when everything else has stopped telling the truth.

The reason that primitive exists is a real incident, not a hypothetical. LOOP_SMOKE_0149 scheduled cleanly inside a brand-new session with no prior user turn, and the pane looked completely alive — the loop was visible, the schedule was ticking on time, everything about the surface looked healthy. Underneath, every single fired run failed inside the state database with the same error: thread not found, because the rollout path recorded for that session did not exist yet. Direct confirmation from the database showed the same schedule failing seven times in a row over roughly half an hour, same error every time, and nothing about the pane's appearance gave that away. It is the platonic case of a system reporting success while nothing actually happened. The same distrust of a clean-looking return code shows up whenever work gets fanned out across the fleet. One batch pushed the same piece of work into fourteen different repos at once, each running under its own separate account so one subscription's rate limit could not gate the other thirteen. The dispatch layer reported success for all fourteen sends. Nobody treated that as the end of the story. A read-only verifier was spawned immediately afterward with the exact evidence files from the dispatch run and told to check, over SSH, whether the panes were actually alive, whether the tasks had actually been created, and whether any of the fourteen sends had silently misfired behind a success code. Dispatch was not trusted just because the send command said it worked. That is a canary too, just a one-time one instead of a scheduled one: an independent check that refuses to accept a green result at face value.

An adversarial reviewer was dispatched with a narrow brief: reproduce the failure, find the root cause, propose the smallest fix, and write hardened regression tests for two specific conditions — a fresh session with no rollout file yet, and the more general case of a schedule firing in the database without ever injecting a visible turn a person could see. The fix was verified the only way that actually counts for a scheduler: with a fresh exact-string canary. Later runs targeted LOOP_SMOKE_0150, literal reply matching, layered underneath the heavier adversarial review as the deterministic floor everything else stands on.

The scheduled-run system prompt carries its own guardrail against a second, sneakier failure mode: producing exactly one visible final response per run, even when there is nothing to do, even when no action is needed, even when the run is blocked — and explicitly not treating a phrase like every minute as an instruction to go implement that cadence itself inside a single turn. A scheduler that quietly reinterprets its own schedule is just as dangerous as one that silently fails, because both look fine from the outside until you check the actual timestamps.

Checking the actual timestamps is exactly what exposed the worst version of this failure at fleet scale. For roughly twenty hours straight, every scheduled maintenance loop in one product's cron tier died before doing any work — docs-sync, i18n-sync, worktree-guard, security-sweep, task-groom, feedback-triage, around fifty invocations total, with only a handful before the outage started succeeding. Three different failure modes were mixed into the raw logs: the model reporting itself currently unavailable, a model-not-found error on the exact model being called, and a flat connection-refused after a few minutes of stalling. None of that is the damning part. The damning part is what a successful run does versus what a failed one does: a healthy loop posts a digest to the shared channel, and a failed one dies with zero tool calls and posts nothing at all. For twenty hours, a total fleet-wide outage was byte-for-byte indistinguishable from a quiet day where nothing needed attention.

That is the exact failure a canary exists to rule out. Silence is not evidence of health. Silence is only evidence of silence, and a scheduler that cannot tell you the difference between calm and dead is not a monitoring system, it is a coin flip you have decided to trust.

Even a fix for that kind of outage needs its own evidence trail, or it becomes one more thing running on faith. Four separate state databases behind the loop scheduler once turned up matching backup files stamped to the same instant, filesystem evidence of some automatic whole-store repair pass running across every one of them the same day, hours before the LOOP_SMOKE_0149 incident surfaced. No human-readable log line explained what triggered it. The repair itself may have been the right move — the timestamps suggest it worked — but a self-healing action with no recorded reason is only half a canary. It tells you something happened. It does not tell you why, which means the next person who finds those backup files is stuck inferring instead of reading, exactly the position a canary is supposed to prevent.

A second, quieter version of the same lesson happened on an admin-sync cron that ran once a day. One run crashed after writing nothing but its own log header — no error message, no stack trace a person would stumble across, just an empty run pretending to be a normal one. Nobody caught it that day. It surfaced only because the next day's scheduled run happened to inspect its predecessor's log before starting, noticed the log had only a header, concluded the prior run must have died, and fell back two days to the last known-good baseline before processing the full backlog in one pass. That recovery worked, but only because someone had built the habit of checking a predecessor's evidence into the loop itself. A loop that assumes its predecessor ran because it was scheduled to run is one silent crash away from quietly falling behind for as long as nobody happens to look.

Smoke evidence does not have to live inside the scheduler to count. Building a filtering contract across three separate services once went through eight distinct adversarial reviewers in under forty-five minutes, and one of them had exactly one job: reproduce the feature in an isolated temp directory and confirm that replaying the same event twice does not duplicate a loop and that private events actually stay ignored. That is a smoke test for a behavior, not for a scheduler, but it is the same instinct in miniature — do not accept a design document's claim that replay is safe, go make replay actually happen twice in a throwaway directory and look at what comes out.

The same discipline shows up outside the scheduler entirely, in how production deployments get verified. A deploy pipeline resolves a smoke target URL, runs post-deploy smoke checks against it, and rolls back automatically on smoke test failure — not as a manual step someone remembers to run, a gate wired directly into the pipeline with its own exit code. The real log line from one rollout reads exactly like the policy working as designed: smoke tests failed, rolling back ECS service, process completed with exit code 1. That is a canary with teeth — it does not just report that something is wrong, it undoes the change that caused it, automatically, before a human notices anything from the outside. It also fails safely in the direction that matters: the default outcome of an unclear result is rollback, not proceed. A canary that has to be convinced something is wrong before it acts is backwards. The burden of proof belongs on the deploy, not on the check. The instruction wrapped around that same pipeline is just as blunt: loop every minute checking CI and deploy status until the deployment fully succeeds, and never use continue-on-error in CI, because a step that is allowed to fail quietly is a step that will eventually fail quietly at the worst possible time.

Put all of it together and the shape is the same at every layer. A one-line reply loop proves the scheduler is honest. An exact-string canary proves a specific fix actually holds instead of merely looking fixed. A predecessor-log check proves yesterday's run actually ran instead of assuming it did because it was supposed to. A smoke-gated rollback proves a deployment actually works before it gets to stay. None of these are clever. All of them exist because something upstream — the scheduler, the model, the deploy target, a stale credential — has already lied by omission at least once, silently, in a way that looked completely fine until someone went looking for the exact string that was supposed to show up and did not.

Trust nothing you have not tested, including the thing that runs the tests. A canary is not a nice-to-have layered on top of a loop you already believe in. It is the only reason you are allowed to believe in it at all.

← Back to the articles

What we shipped, what broke,and what we learned

What we shipped, what broke,
and what we learned