North Star Loops

01-27-26

Standing over a pane repeating yourself is not management, it is a job with no salary. A north star loop is the same instruction, said once, that keeps saying itself.

Standing over a pane repeating yourself is not management, it is a job with no salary and no end time. A north star loop is the same instruction, said once, that keeps saying itself.

The habit started as a single typed line, the kind that reads like a shrug and turned into a policy: set a loop that runs every minute, and this loop must keep you on track to keep testing, keep fixing, until all works, never cheat, never take shortcuts. No architecture behind it at first. Just the recognition that an agent left alone for an hour drifts, and the fix for drift is not a smarter agent, it is a dumber, more relentless reminder. Prompting each agent by hand, pane by pane, does not fail because the instructions are wrong. It fails because a human forgets to repeat them at minute forty when attention has moved on to the next fire. A loop never moves on.

That habit eventually became its own skill: give it a piece of work, and it writes the tasks, then starts a one-minute loop whose entire prompt is your north star — stay on these tasks, finish them, test them, including in a browser if it's frontend, and do not call it done until it is fully tested and working. The loop does not get smarter over time. It does not need to. Its job is to keep saying the same true thing on a schedule so nobody has to remember to say it.

The skill spells out what fully tested actually means instead of leaving it to interpretation: writing tests, running them, testing the workflow in a browser if the change touches a frontend, and running it through a UI review pass if it does. None of that is exotic. What makes it a north star instead of a checklist is that the loop keeps asking whether all of it is true every single minute, instead of asking once at the start and trusting the answer for the rest of the session.

The clearest proof that a north star can be a person instead of a cron job showed up during what the fleet still calls the codebase freeze. Eight agents were mid-sprint on the main product, each named for a Roman emperor and each scoped to one lane: one lead on git and deploy only, one on features, one on the server, one on the public app, one on browser QA, one on tests, one general developer, one on regression. Then deploys started failing. A coordinator identity, MAXENTIUS, broadcast the freeze: stop all refactoring, stop type extractions, stop export renames, stop file moves, because agents were independently improving shared type files without committing the consumers that depended on them in the same breath, and every one of those partial commits broke CI for everyone else. One agent by name got singled out mid-broadcast: stop extracting shared types immediately. During the freeze, every agent got redirected to read-only analysis and additive tests only, gated on one explicit word from the lead: clear. The freeze itself is a north star with a single condition attached — nothing merges, nothing refactors, until the person who called the freeze says the word that lifts it. No agent gets to decide independently that the coast looks clear enough.

A freeze order is a north star for a whole squad. What happened next was a north star for one agent. MAXENTIUS caught TIBERIUS spinning on the exact same 429 and 500 errors for more than thirty minutes and intervened directly, not with sympathy, with a numbered list: mark the task blocked with a note, move to a different pending task that does not require a running server, remember the freeze is still active, and never stop working, move on now. Nobody had to notice this by watching a dashboard. A coordinator layer was reading worker output continuously and force-correcting the exact failure mode the fleet's rules are built to catch — an agent trusting its own repeated attempt instead of admitting the attempt was not working.

Two other agents got a smaller, more mechanical version of the same nudge that same freeze: you appear stuck, heartbeat across every coordination surface — todos, conversations, mementos — never stop. That line is almost insultingly simple. It does not diagnose anything. It does not offer a better plan. It just refuses to let silence pass as progress, which is most of what a north star loop actually needs to do.

The other half of management by loop is refusing to let an agent cheat its way to looking finished, and the sharpest example of that discipline cost two weeks of real product feedback sitting untriaged. A scheduled loop tried to pull production user feedback into the task tracker every few days and hit the identical wall eight runs straight — the only configured profile authenticated as an org member, the admin endpoint returned a 403, the fallback endpoint was not even deployed, and the vault-stored admin password had gone stale. Every run re-probed cheaply, confirmed nothing had changed, and logged the same tracking note instead of re-diagnosing from scratch. By the eighth run the loop's own note read plainly: production feedback now untriaged fourteen-plus days. There was a workaround sitting right there — log in with a live prod admin credential — and the loop refused it outright: production credential, not sanctioned for an unattended loop. Correctness over throughput, even when throughput was two weeks of a growing backlog. A north star that only pushes an agent forward is half a north star. The other half is a loop that will not let an agent take the shortcut that happens to be sitting in reach.

Not every north star loop needs that much judgment behind it to be worth running. A side project's dev server got a plain fifteen-minute watchdog: start dev, clear the cache, and set a loop that checks it and fixes it if it's broken and restarts it. That prompt fired 577 times captured in one transcript alone — curl a health endpoint, and if it's down, tail the log, run a type check, fix the obvious regression, restart through a script, re-verify, and report healthy in one line when nothing was wrong. No judgment call more sophisticated than up or down. It ran unattended for days anyway, because most of what breaks a dev server does not need a security analyst, it needs someone willing to check every fifteen minutes without getting bored.

At the other end of the same idea sits a goal that ran for three straight days on a single continuous session — a Gmail backfill, framed under an explicit token budget of none, tokens remaining unbounded — climbing from 62,526 tokens to over a million within the first hour and topping out past 7.4 million tokens for that one session. The harness's own instructions state the north star as plainly as it gets stated anywhere: ending this turn does not require shrinking the objective to what fits now, do not redefine success around a smaller or easier task. That line exists because the easiest way for an agent to look finished is to quietly narrow the goal until whatever got done counts as done. A north star loop's whole value is refusing to let the finish line move just because the runner is tired.

That principle does not make the ceiling disappear, it just makes the failure honest when the ceiling gets hit. A goal to rebuild a fleet-wide computer-use control program burned past thirteen million tokens over nine hours and stalled at a usage limit, not because the plan was wrong but because the session ran out of runway mid-execution. A three-site autonomous news operation burned 45.4 million tokens over thirty-three and a half hours and hit the same wall before its own core claim — that it could publish without manual infrastructure edits — was ever verified. Against those two sits a fifty-hour, 69.4-million-token security and admin sweep that actually closed as complete. Same loop shape, same never-redefine-success instruction, two different outcomes that look identical while they are still running. A north star loop cannot promise you will reach the destination. It can only promise you will not quietly agree to a closer one.

Zoom out from any single goal and the same tradeoff shows up at fleet scale. Across one machine's full tracked history, 882 goals closed complete for roughly 294 million tokens total, while 21 goals hit a usage ceiling before finishing and still burned about 101 million tokens getting there — a small number of runs that never landed consumed close to a third as many tokens as the large majority that did. A north star loop does not make that ratio go away. What it guarantees is that the 21 stalled goals stalled honestly, still pointed at the original objective, instead of quietly declaring victory on something smaller partway through.

The instinct behind all of it surfaced once, almost by accident, in the middle of an unrelated feature dictation, not written down as a rule first and enforced after: verification should be automatic, the user should not have to say verify all the time, it should just make sure it fully works and the workflow stays simple. That it fell out organically, mid-sentence, while talking about something else entirely, is better evidence that it is a real belief than if it had been typed up as policy from the start.

Not every north star is scoped to a task with an end. Some are handed over as standing ownership: investigate the loops tmux session, the heartbeat file, and every cron tier on this machine, post exactly one digest of what exists and what looks stale, and going forward you own loops oversight here. No deadline, no closing task. The north star in that case is not finish this, it is keep watching this, indefinitely, and say something the moment it stops being true. That is a different shape of management than a one-minute test-and-fix reminder, but it is the same underlying move — hand off the job of remembering, permanently, to something that does not get bored of remembering.

Management by loop is not more sophisticated than management by memory. It is dumber on purpose, running on a fixed interval, saying the same thing every time whether or not it feels necessary that minute. The difference is that it never gets tired of saying it, never decides the reminder is not needed today, and never quietly lets a stuck agent's thirtieth attempt at the same broken call pass as effort. Set the north star once. Let the loop repeat it for as long as the work takes.

← Back to the articles

What we shipped, what broke,and what we learned

What we shipped, what broke,
and what we learned