Agents

Dispatch Doesn't Assume Success

Dispatch Doesn't Assume Success

tmux send-keys is flaky: the Enter often doesn't submit and the text just sits in the composer. A dispatch tool that doesn't verify delivery is worse than no dispatch tool at all.

Driving coding agents that live in tmux windows by hand with `tmux send-keys` is flaky. The Enter often doesn't submit — the text just sits in the composer, visibly typed, silently never sent. Long prompts get mangled on the way in. There is no delivery confirmation. And none of it works across machines without something else stitched on top. That paragraph is not a complaint. It is the actual problem statement that produced the tool, and every design decision after it follows from taking that paragraph seriously.

The fix starts with bracketed paste instead of raw keystrokes: long prompts go in through a tmux buffer, not typed character by character into a pane that might reflow, autocomplete, or eat a keystroke halfway through. Then comes a delay before hitting Enter, computed from the prompt itself rather than guessed — a formula based on word count and character count, clamped between a floor and a ceiling so a one-word prompt does not fire Enter before the terminal has caught up and a ten-thousand-word prompt does not wait an absurd amount of time either. If the Enter does not appear to land, it retries. The numbers behind the delay are published, not hidden inside a magic constant: never under 150 milliseconds, never over 4,000, plus roughly 9 milliseconds per word and 0.6 milliseconds per character in between, all of it overridable by an environment variable or a flag if a particular machine or terminal needs different timing. None of this is exotic. It is what happens when you stop assuming a keystroke arrived just because you sent it, and start treating terminal timing as a real variable instead of an implementation detail nobody has to think about.

The part that actually matters is what happens after the Enter: the tool diffs the pane before and after the send, checking for the two things that mean a prompt actually went through — a spinner or an interrupt indicator appearing, or the prompt text leaving the composer and showing up as submitted input further up the pane. If neither shows up, dispatch does not report success. It reports exactly what it saw, which is nothing changing, which is the accurate description of a prompt that silently failed to submit. dispatch doesn't assume success — it verifies.

The reason this distinction is worth an entire tool, and not just a five-line wrapper script, showed up directly in a real dispatch run against fourteen repositories at once. A single delegation batch pushed work into fourteen different open-source repo tmux windows on one machine simultaneously, each addressed through the same send mechanism, each running under a separate subscription profile so that one account's rate limit could not gate all fourteen repos' worth of work. That part is the multi-profile thesis working as designed — spread the load, dodge the ceiling. The part that matters here is what happened immediately after the sends went out: a read-only verifier, with no editing permission at all, was handed the exact evidence files from the dispatch run and told to check over SSH whether the panes were actually alive, whether any of them were dead, and whether any of the fourteen sends had silently misfired. The dispatch tool reporting success on all fourteen sends was not treated as proof that fourteen agents were now working. It was treated as a claim that needed an independent check before anyone trusted it.

Fourteen was not a round number picked for effect. It was fourteen because that was the exact count of repositories needing the same class of work that week, each one wired to its own subscription so that a single account's hourly ceiling could not become the bottleneck for the entire batch. Spreading load across profiles only pays off if the sends actually land — dodge one rate limit just to lose the work to a second, silent failure mode and you have not saved anything, you have just moved the waste somewhere harder to see.

That instinct — verify the dispatch, do not just trust that it returned success — exists because the alternative failure mode is worse than it sounds. A send that silently fails looks, from the dispatcher's side, identical to a send that worked. The command returns the same exit code either way if you are not checking the pane. Which means every downstream assumption built on top of that dispatch — the agent is now working on repo fourteen, the task will show up as claimed in an hour, the fleet is running at full parallel capacity — is a guess wearing the clothes of a fact. Flaky delivery is worse than no delivery, because no delivery fails loudly and flaky delivery fails silently, three steps downstream, in a place nobody is looking yet.

A scheduled loop on the same fleet produced almost exactly that failure shape from a different angle. A recurring loop had been set up to fire a prompt into a session on a timer, and by every visible signal it was healthy: the pane looked alive, the schedule was ticking on time, nothing crashed. Underneath that healthy-looking surface, every single fired run was failing with the same database error, because the session's rollout path did not exist yet at the moment the schedule first fired. Seven runs in a row failed identically, and the loop kept reporting as scheduled and alive the whole time, because from its own point of view nothing had gone wrong — the schedule itself was working perfectly. Only a dedicated adversarial reviewer, handed the exact failure log and asked to reproduce the root cause, found that the thing the schedule was supposed to be triggering had never actually happened once. "Looks alive" and "is working" are two different claims, and the gap between them is exactly where dispatch's pane-diffing exists to look.

The same fleet's own loop scheduler eventually got the cheapest possible check bolted on underneath its heavier adversarial review, once that failure mode was understood: schedule a prompt that asks for one exact string back, then check the transcript for that exact string. Nothing clever about it. Either the literal text shows up where it is supposed to, or it does not, and there is no interpretation required either way. It is the same principle as pane-diffing applied one layer up the stack — do not ask whether the system reports success, ask whether the specific, checkable thing you expected to see actually happened. A canary that has to be argued about is not a canary.

Sending the prompt to the wrong machine entirely is the same failure at a different layer, and it is not hypothetical — a fleet's own tmux status bar once reported the wrong physical machine name in its own footer, meaning a human or a script reading that label to decide where a dispatch should go could have confidently targeted the wrong box. Cross-machine dispatch depends on a fleet's machine registry telling the truth about which pane lives where, over Tailscale, LAN, or SSH. If that identity layer lies, no amount of pane-diffing on the sending side saves you, because you verified that a prompt landed somewhere — just not necessarily where you meant to send it. Verification has to cover the whole path, not just the last hop.

The guardrails around dispatch exist for the same reason the verification does: a tool that moves text into a live terminal pane is a tool that could, with one wrong assumption, execute that text instead of just displaying it. So the send path refuses to target a shell pane at all — it will not paste a prompt into something that is not recognizably an agent's composer — and the execution path, the one that actually runs shell commands, refuses the reverse: it will not run a command against something that looks like an agent pane. Prompt text can be delivered to an agent. Shell commands can be delivered to a shell. The two paths are kept structurally incapable of trading places, because the failure mode of getting that wrong is a destructive command executing somewhere nobody intended it to run. On the execution side specifically, a fixed blocklist refuses filesystem formatting, fork bombs, curl piped into bash, and rewrites under ~/.ssh outright, no matter how the request is phrased — the same posture as the send/exec pane split, applied to content instead of destination.

Scheduled dispatch adds a durability requirement on top of the verification one: a single pidfile-backed daemon owns the schedule, backed by local SQLite state, so a dispatch that is supposed to fire at 3am does not silently duplicate itself if two copies of the daemon somehow start, and does not vanish if the machine reboots at 2:58. Cross-machine dispatch depends on the same daemon knowing the fleet's machine registry — which boxes exist, reachable over Tailscale, LAN, or plain SSH — so a scheduled prompt aimed at a specific repo's tmux window lands on the correct physical machine even when nobody is watching the send happen in real time.

None of this is about distrust of the agents themselves. It is distrust of the delivery mechanism, which is a completely different thing and a much more tractable problem. An agent that receives a prompt clearly and confirmed can be held to a high bar for what it does next. An agent that never received the prompt at all cannot be blamed for doing nothing, and worse, nobody watching the dispatcher's success log has any way to know that is what happened. The bug is invisible exactly where it is most expensive: at the boundary between the system that assigns work and the system that does it, right where every accountability claim downstream depends on the assignment having actually landed.

The whole point of a dispatch tool is that a prompt actually lands and submits, every single time, and that the tool tells you the truth on the rare occasion it does not. Anything short of that is not a convenience layer over `tmux send-keys`. It is the same flaky foundation with a nicer command name on top, and a fleet of agents built on top of a foundation that occasionally lies about whether the ground is there is not a fleet you can trust to be doing what your dashboard says it is doing. Everything you build on top of a fleet of agents — the task lists, the dashboards, the assumption that fourteen repos are moving forward right now — is only as true as the delivery layer underneath it. That layer earns trust by checking itself, not by asking to be believed. Verify the send. Then, and only then, trust the result.

← Back to the articles

Newsletter

What we shipped, what broke,
and what we learned