AI

Cheap Workhorse, Frontier Verifier

Cheap Workhorse, Frontier Verifier

A cheap model produces, a frontier model verifies through an explicit approved call, and structured markdown lets the smart model write the plan once so a cheap model can execute it a thousand times.

Most of what a coding agent does in a given hour is not hard. Read a file. Run a test. Click a button and check the page did not break. Paying frontier-model prices for that work is not rigor, it is waste with a good excuse attached.

The fix is a split we have built into the core of our agent-orchestration stack: a cheap workhorse model produces, a frontier model verifies. Not as a suggestion in a prompt. As a gate in code. The verifier has to call an approved tool before a result counts as done, and results are only accepted once that call happens. No approved call, no acceptance, no matter how confident the workhorse sounds about its own output.

This is the actor/verifier loop at the center of our swarm framework, the piece that turns a goal into a task graph and fans it out across many agents at once. Cheap models — the workhorses — do the actual producing: writing the code, drafting the copy, running the test, filling the form. A frontier model, usually the strongest one available, sits above them and critiques. It reads the workhorse's output the way a senior reviewer reads a junior's pull request, and it has exactly one way to signal approval: an explicit tool call, structured, logged, checkable after the fact. Prose that says "looks good" does not count. The gate is code, not vibes.

The framework gives the split two execution tiers so the workhorse can be whatever is cheapest for the job. One tier runs in-process model calls for fast, small units of work. The other spawns a full coding-agent CLI as a subprocess when a task needs a real terminal and real tools. Both report back into the same shared state, arranged into whatever topology the goal actually needs — a straight pipeline, a fanout to many workers at once, a hierarchy of sub-goals, a mesh where agents can hand work sideways to each other, or a self-expanding graph that grows new nodes as it discovers more work. The topology is a scheduling decision. The actor/verifier split underneath it never changes: something cheap does the work, something smarter checks it.

The economics only work because the two tiers run at wildly different volumes. The workhorse fires constantly — every file read, every test run, every small edit. The verifier fires rarely, once per meaningful unit of work, because judgment is the expensive part and production is the cheap part. Pay frontier prices for the rare judgment call. Pay workhorse prices for the constant grind. Mixing the two up, paying frontier rates for grind or trusting workhorse judgment on anything that matters, is how teams end up either broke or wrong. A frontier model priced at grind-level volume will bankrupt a project long before it finishes. A workhorse trusted with a judgment call will approve its own mistakes and never tell you.

We took the same thesis and turned it into a protocol instead of leaving it as an architecture pattern, because the idea is bigger than any one framework. Structured markdown works as an intermediate representation between models: the smart model writes it once, and a cheap model or plain regex executes it as many times as needed after that. The protocol ships validate, run, compile, lint, and inspect as first-class commands, because a plan is only worth reusing if you can check it is well-formed before you burn a single execution on it. One frontier pass produces a plan. A thousand cheap passes execute it. That ratio is the entire economic argument in one sentence.

The reason this works as a protocol and not just a one-off script is that structured markdown is legible to both tiers at once. A frontier model can write it because it is still natural language with light structure, no special training needed. A cheap model or a regex can execute it because the structure is predictable enough to parse without judgment. That is the whole trick: pick an intermediate format expensive enough to be worth a smart model's time to author, and plain enough that authoring it is the only expensive part.

The same instinct shows up again, in a different shape, inside how we get AI-generated interfaces to render correctly. The model is good at deciding what should exist on a screen — a form here, a chart there, a status card above the fold — and bad at deciding the exact pixel coordinates that make a twelve-column grid not overlap itself. So the model emits a semantic layout: flat blocks and sections, typed intent, nothing about coordinates. A deterministic placement engine, not a second model, resolves that intent into an actual grid. The expensive model decides what. Cheap, deterministic code decides where. Neither side does the other's job, and the boring half runs at effectively zero marginal cost per render.

Testing gets the cheap-workhorse treatment on its own terms, because most of what QA does is not judgment either, it is coverage. Point a crawler at a running app and it generates one test scenario per discovered link, button, form, and input, then batches those scenarios across as many as twelve concurrent sandboxes running in deterministic waves. None of that needs a frontier model watching every click. It needs cheap AI agents that can drive a headless browser, follow a script, and report what happened. We ran this against our own product, generating scenario coverage nobody would have had the patience to write by hand, at a cost nobody would tolerate if a frontier model priced every click.

Getting a sandbox ready for a cheap agent to hammer on is its own small discipline. A working tree gets uploaded with safe excludes — no git internals, no environment files, no node_modules, no private keys — secrets get resolved from references instead of being copied in the clear, and the launch waits for the app to actually report itself ready before the first scenario runs. None of that is glamorous. All of it is what makes it safe to point twelve cheap agents at a real application instead of a mocked one, which is the only version of this worth running.

Compression is the same economics applied to a different resource: not which model runs, but how much of its context window a routine action is allowed to cost. A terminal layer we run in front of coding agents keeps full output expandable on demand but returns compact, structured answers by default for the loops that fire constantly — test runs, builds, searches, git status. The full firehose is still there if you ask for it. By default you get a bounded, structured answer, because the expensive resource in an agent system was never CPU. It is the model's attention, and every token spent reading verbose output it did not need is a token it cannot spend on the part of the task that actually required judgment.

Put all four of these next to each other and one rule falls out: figure out which part of the work is judgment and which part is grind, and price them differently on purpose. Judgment is rare, expensive, and worth a frontier model's attention. Grind is constant, cheap, and belongs on a workhorse or on deterministic code that does not think at all. Most of the cost overruns we have seen in agent systems come from getting this backwards — a frontier model burning its budget on grind work a workhorse could have handled, or a cheap model being trusted with a judgment call nobody actually checked.

The verifier also has to be a genuinely different, stronger model than the one it is checking, not the same model asked to grade its own homework twice. A model reviewing its own output tends to agree with itself for the same reasons it produced that output in the first place. Put a frontier model with different training, different failure modes, and no stake in having written the answer on the other side of the gate, and the review actually finds things. This is the same reason a budget governor tracking spend across the swarm has to be a separate accounting pass from the models doing the spending — the thing measuring the work cannot be the thing doing the work.

The verifier's approval also has to survive contact with an agent that is, by default, going to tell you its own work is done. Every coding agent will say the task succeeded if you ask it. That is not malice, it is the nature of a self-report, and a self-report is not evidence. The approved tool call exists precisely because it forces the frontier model to commit to a structured, checkable judgment instead of a paragraph of reassurance. If the verifier cannot point to what it checked, it does not get to approve anything, and neither does the workhorse that produced the work in the first place.

There is a resumability requirement underneath all of this that is easy to skip and expensive to skip badly. A swarm run persists its shared state — call it a blackboard — so a task graph that gets interrupted halfway through does not have to restart from zero and re-spend every token it already spent. Losing that state on a crash means paying the workhorse tier all over again for work it already did, which defeats half the point of making the workhorse cheap. Cheap only stays cheap if you do not have to pay for it twice.

This matters more than it sounds like it should, because a swarm run is not a single request-response cycle, it is hundreds of workhorse calls and a handful of verifier passes stitched together over minutes or hours. A crash at hour two of a long-running task graph, with no persisted state, throws away every workhorse call that already happened and every verifier approval already granted. Recomputing all of that from scratch is not resilience, it is a second bill for the same work.

None of this requires exotic infrastructure. It requires deciding, honestly, which parts of your pipeline are actually hard. Most of them are not. Run those on the workhorse, gate the results behind a verifier that has to prove its approval, and reserve the frontier model for the handful of calls per task that are genuinely worth its price. Everything else is grind, and grind should be cheap. That does not mean firing the frontier model. It means being honest about how rarely you actually need it to speak.

Look at your own pipeline this week and count how many of your model calls are judgment calls. If the honest number is under ten percent, you are paying frontier prices for grind, and the fix is not a bigger model. It is a cheaper one, watched by a smaller, sharper one.

← Back to the articles

Newsletter

What we shipped, what broke,
and what we learned