Nested Loops: Machines That Maintain Themselves

01-20-26

A machine that needs you every morning is not automated, it is just slow to fail. Nested loops are the difference.

A machine that needs you every morning is not automated, it is just slow to fail. That was the actual insight behind nested loops, not the Silicon Valley name for it.

The idea arrived as one run-on paragraph, typed, not a spec: set up loops, and loops inside loops, so the machines maintain themselves and the AI does the boring work on a schedule — every morning check and update npm packages, every 30 minutes search for supply-chain attacks, every 30 minutes check for stale processes overloading the server and kill or fix them, every 15 minutes check tmux for dead panels and recreate them. Some ops, some security, some just cleanup. The whole point was a machine that keeps itself healthy without anyone opening a laptop to check on it.

Before naming anything, there was a smaller decision that mattered more: what is this thing, structurally. Not a machine named spark01 with scripts bolted onto it. A project — machine level first, repo level later — something a person could actually manage the same way they manage code. It almost went into a project registry with no repository attached, and when that turned out awkward, it went into the todos CLI instead as a real project with real tasks, because self-maintaining machines still need a plan, not just a crontab.

The scope question came before any of the building: machine level, company-division level, app-type level, or repo level. The honest answer was all four eventually, but machine level and repo level first, and machine level before that, because a machine that cannot keep itself alive is not ready to be trusted with pulling and reviewing code across every repo every morning. That second loop — check and pull from every repo on GitHub, first thing, before anyone is awake — got parked deliberately until the machine-level loops were solid. Sequencing the ambition mattered as much as the automation itself.

Four pillars came out of that first paragraph: package updates, supply-chain watch, process hygiene, and dead-pane revival. Each one sounds small. Each one, left undone, is a different flavor of the same problem — a machine slowly degrading while everyone assumes someone is watching it.

Building it out was deliberately sequenced, not dumped in one shot. First came the empty scaffolding: one tmux session named loops, four windows — loops-security, loops-infra, loops-guard, loops-cost — created with an explicit instruction not to append anything to them yet. Only after the skeleton existed on spark01 did the question turn to spark02: the loops session should exist per machine, so create one there too, before a single skill gets attached to it. Structure before content, every time, because a loop bolted onto an undefined space is harder to find later than one that was never scheduled.

The skills themselves were built under a constraint that sounds strange for automation: do not make these very deterministic, the agent has to do the actual work itself. That instruction was enforced the same way most of the serious work gets enforced — a session-scoped stop condition that would not let the session end until the skills existed and were synced to spark01. The schedule is deterministic. What the loop does when it fires is not a fixed script, it is a judgment call the model makes fresh every run. A supply-chain check that hard-codes what counts as a threat goes stale the week you write it. One that reasons about it each time does not.

Dead-pane revival took three attempts before it actually worked, which is the part that never makes it into the tidy version of this story. The first pass was a plain instruction: redo the dead panes properly. It ran, and it undercorrected — sessions stayed dead. The second pass built a named skill, skill-tmux-revive, but it was still a bag of ad hoc steps, not a program, so the complaint came back within days: it does not actually have a real script and a CLI, you have to write it fresh every time, and it still is not working for alumia, I still see the dead stuff. Only the third pass stuck: a bun script and an installed CLI at ~/.local/bin/tmux-revive, callable by name, doing the same restoration the same way every time.

The proof that it finally worked was not a demo, it was a recovery. Spark01 had 21 tmux sessions go down, and skill-tmux-revive brought all 21 back from a resurrect snapshot in one pass — alumia01 through alumia12, codewith01 through codewith04, every iapp-[name] window, open-codewith01 through 04. Same session names, same window names, same layout. The rule that made this possible was decided before any of the code: session name and window name must match. A pane you cannot identify by its own name is a pane you cannot revive by policy, only by memory, and memory does not scale across machines.

Somewhere in the middle of that iteration there was a sharper moment. An agent was told to redo a couple of specific sessions and killed more than it was supposed to. The answer that came back was blunt: you killed them all. No workshop on blast radius, no retro doc — just a tighter skill and a instruction added to the next revision: restore exactly what was asked for, not the neighborhood around it. Self-maintaining does not mean self-authorized to improvise.

Dust sweeps are the other half of self-maintenance, and they are duller by design, which is correct. Long-running fleets of agents generate mess — orphaned files, stale build output, tmp directories nobody remembers creating. A dust_sweep.sh script runs on a schedule and clears it before a full disk becomes the reason a deploy fails at 2am. The same sweep through spark01's health turned into a small cluster of fixes bundled under one task: heap caps, earlyoom, zswap, swappiness tuning, dust sweep, filed together because they are the same category of problem — a machine slowly filling up with agent exhaust until it chokes.

Memory got two dedicated loops out of that investigation: memory-guard and memory-watch, backed by their own scripts, memory_watch.sh and memory_guard_root_setup.sh. Not because memory pressure is exotic, but because it is exactly the kind of failure a fleet of coding agents produces constantly and nobody watches constantly. Agents spawn subagents, subagents hold context, context holds memory, and a machine running sixteen or seventeen authenticated profiles at once will find the ceiling eventually if nothing is watching for it.

The third pillar, stale-process hygiene, got its own skill for the same reason the others did — machine-guard, shipped in the same nine-skill batch as package-update and supply-chain-watch, sitting on the same schedule as the memory fixes: heap caps, earlyoom, zswap, swappiness, dust sweep, filed as one task because a process eating the server and a disk filling with tmp files are the same failure mode wearing different clothes. Killing a runaway process without also sweeping what it left behind is half a fix.

None of this counts unless it can be proven, and proof here means the same thing it means everywhere else in the operation: a task id and an artifact, not a status update. The 21-session recovery has a task id behind it, filed and closed the same day. The standing duty to watch a machine's loops does not end when the agent says it posted a digest — the follow-up question is always did you actually post it, and if yes, give me the message id, if not, post it now and confirm with the id. A self-maintaining machine that cannot produce evidence of its own maintenance is indistinguishable from a machine that is quietly not maintaining itself at all.

The sharpest lesson in this whole area did not come from something we scheduled — it came from something that was already running in a loop we did not intend. Testers-mcp, one of the MCP services, hit a crash-restart loop that had gone unnoticed past 85,000 restarts. Stdio was being handed /dev/null for stdin, failing immediately, and the process manager just kept trying again, forever, quietly. Nothing alerted, nothing surfaced, because a restart is not an error, it is just a restart, until you count them. The fix was two decisions, not one: make the service fail loudly instead of silently respawning, and move it off stdio onto HTTP on its own port, because stdio bloats the load on the machine anyway and was never the right transport for something this chatty. That distinction — HTTP over stdio for anything that runs constantly — went straight into global memory so every future MCP service inherits it instead of rediscovering it.

That is the real test for whether a machine is maintaining itself or just failing slowly in a way that looks like uptime. A crash-restart loop technically kept the service "up." It was still a bug wearing a uniform. Self-maintenance has to include failing loudly, not just staying alive.

The skeleton underneath all of this is a heartbeat file and a set of cron tiers: infra checks, supply-chain watch, machine health, MCP fleet health. One agent per machine gets a standing duty to know the shape of that skeleton — what loops exist, when each last ran, what looks stale — and to report it exactly once, not as a running commentary. The instruction for that duty is specific enough to quote almost verbatim: investigate the loops tmux session, the heartbeat checks, and the cron tiers, then post one concise digest, and going forward you own loops oversight for this machine. Ownership, not a checklist someone re-reads every morning.

None of it stays on one machine either. Every skill that came out of this — package-update, supply-chain-watch, keyleak-scan, tmux-revive, dust-sweep — gets synced to spark01, spark02, and apple03 the same day it ships on the primary box. A self-maintaining machine that only maintains itself alone is a nice trick. A fleet where every machine maintains itself the same way is the actual point, because the alternative is remembering which machine got which fix, and that is exactly the kind of manual bookkeeping loops exist to kill.

What changed in practice is unglamorous, which is how you know it worked. Packages update themselves before dawn. A crashing MCP now screams instead of quietly eating restarts. Dead tmux sessions come back with their own names intact instead of turning into a scavenger hunt. A dust sweep runs before disk pressure becomes an incident instead of after. None of it needed a dashboard. It needed a machine willing to notice its own mess and a loop with permission to clean it up.

A machine that waits for you to notice it is broken was never automated in the first place. It was just quiet about needing you.

← Back to the articles

Nested Loops: Machines That Maintain Themselves

What we shipped, what broke,and what we learned

What we shipped, what broke,
and what we learned