Systems

Model Routing in Practice

Model Routing in Practice

A gateway underneath everything, subscription profiles run as fleet capacity, a usage heartbeat that auto-switches on rate limits, and Claude for speed, Codex for orchestration, Haiku for QA.

For six months I asked two vendors for the same feature and got nothing. So I built it myself. That is not a complaint, it is how model routing actually gets decided in practice: not by reading a comparison chart, but by hitting a wall with one vendor and routing around it with another.

The wall, in my case, was scheduled execution. I wanted loops — recurring, unattended agent runs — in the CLI, and I said so, repeatedly, in public, to the people who run Codex and to the people who run Claude Code. Neither shipped it on my timeline. So the gap became a product: a local daemon for persistent loops and workflows that survives restarts and records every run, wired to guarded adapters for whichever coding agent CLI is doing the actual work that day. Ask a vendor for six months, then build the missing piece yourself. That is the honest origin story of half of what we run.

That gap did not appear out of nowhere. Both vendors run their own feedback loops in public, fielding feature requests directly, and I used nearly every one of those openings to ask for something specific: queued messages so a prompt does not get lost mid-turn, background hooks that do not depend on an active session, visibility into what a sub-agent is actually doing while it works. Some of it shipped eventually. Loops never did, on either side, for long enough that building it stopped being optional. Vendor responsiveness is itself a routing signal — if a gap stays open for months after you have flagged it clearly, that is information about which vendor to lean on for what, not just a complaint to keep repeating.

Multi-provider is not an ideology, it is a hedge against exactly that kind of gap. My daily-driver CLI is a modified derivative of OpenAI's Codex that grew a gateway layer above the providers, multi-profile subscription auth, and a usage-heartbeat daemon that auto-switches profiles the moment a rate limit hits. None of that exists because one vendor is bad. It exists because every vendor eventually rate-limits you, ships a feature slower than you need it, or has an outage on the one day you cannot afford one, and a single-provider setup has no answer for that day.

The auth-profile mechanic underneath all of this is deliberately boring: a named profile per login, switchable from the command line, so a work identity and a personal identity and a rate-limited-for-the-day identity can all sit side by side without fighting over the same token. Boring is the right call here. The interesting part of routing should be which model handles which job, not whether the CLI remembers which account you meant.

Routing by model is a decision I make out loud, not a default I inherited. Claude for speed, because it is fast enough to be the executor doing the actual coding. Codex for orchestration, because it is the one I trust to plan and debug across a larger surface. Haiku for QA, because verification does not need a frontier model, it needs a cheap one that is disciplined about checking a narrow thing. The point is not that one model is smarter. It is that the future is not one agent doing everything, it is several specialized agents, each doing the one job it is actually good at and cheap at.

The Haiku layer is worth describing exactly, because it is the smallest, cheapest piece of this and it does real work. A hook feeds it three things after every turn: the last user prompt, the last agent message, and the files that changed. Haiku reads that and does one of two things — adds tasks the agent missed, or blocks the agent from declaring victory early. It is an automatic QA pass running on the cheapest model in the stack, checking the output of whichever model just did the expensive work. Same actor/verifier shape as everything else here, just running at Haiku prices because the check itself is narrow enough not to need more.

Vendor philosophy is also a routing input, not just vendor identity. OpenAI's approach makes you define everything yourself — tools, execution logic, state management — more control, more code to own. Claude's approach figures more of it out for you — built-in tools that work immediately, less control, less code. Neither is better in the abstract. Pick based on your tolerance for magic versus control, on that specific task, that day. A compliance-heavy, structured workflow wants the vendor that makes you spell everything out. A creative, exploratory task wants the one that gets out of your way.

Tool routing follows the same discipline as model routing: keep the list short and earn every entry on it. Asked in public which MCP servers actually matter, my answer was five: a web-search server, a browser-automation server, a dev-tools inspector, a UI-component server, and an optional docs server. Everything else should just be a CLI. That is not minimalism for its own sake. Every MCP server wired into an agent burns context describing itself before the agent does a single unit of real work, so the bar for adding one more server to the routing table has to be higher than "it might be useful once."

Treating consumer subscriptions as a compute fleet is the part of this that looks strange from outside and is completely ordinary once you are in it. People running serious multi-agent setups are not paying per-token API rates for everything, they are running a dozen, two dozen, sometimes over fifty parallel paid subscription accounts and routing work across whichever one has headroom, because subscription rate limits reset per account and metered API pricing does not scale the same way at that volume. One builder in my orbit runs roughly fifty Max-tier accounts in parallel and reported burning through 50 billion tokens in a single week, most of it on the most expensive model available, at a monthly spend in the tens of thousands of dollars. That is not an outlier flexing. It is what the current rate-limit regime forces anyone running agents at real scale to do. That number sounds absurd until you have personally hit a rate limit mid-task and watched a whole agent fleet sit idle waiting for a reset window. At that point fifty accounts stops sounding excessive and starts sounding like the only rational response to a limit you do not control.

A usage heartbeat is the only thing that makes a fleet like that operable instead of a full-time babysitting job. It watches consumption per profile, and the instant one profile gets close to its rate limit, it switches the next request to a profile that has room, automatically, without a human noticing the switch happened. Running sixteen or seventeen subscription profiles by hand, watching each one's usage in a separate tab, is a full-time job that produces nothing. Running the same sixteen or seventeen behind a heartbeat and an auto-switcher is a fleet.

None of this multiplies the actual complexity a human has to hold, because the gateway sits underneath as one stable interface. One gateway key for clients, many provider keys behind the gateway, routing OpenAI-compatible traffic across whichever providers are configured — DeepSeek, Qwen, GLM, whoever is cheapest or fastest that quarter — without every calling application needing to know which provider actually served the request. The routing complexity lives in one place, once, instead of being copy-pasted into every product that needs a model.

Explicit policy has to sit next to that routing layer, not as an afterthought bolted on later. Requests should never get silently routed to a region or a provider class the caller did not approve, and the defaults should stay local-first: no hosted calls at all unless someone deliberately configured them. A gateway that quietly reroutes your traffic somewhere you did not choose is not a convenience, it is a liability wearing a convenience's name.

Even fact-finding gets routed rather than fixed to one tool. Whichever AI is already present in the conversation — a research-focused model, a coding model, whatever has the context loaded — is the one that answers the side question, instead of switching apps to ask a different one the same thing. The routing decision here is not about which model is smartest in the abstract, it is about which one is already holding the context that makes the answer worth trusting.

Circuit breakers are the part of routing that only matters on the bad day, which is exactly why they are worth building before the bad day arrives. Our own product routes between bring-your-own-key providers and managed models with a circuit breaker layered on top of candidate selection, so a provider having a bad afternoon gets tripped out of rotation automatically instead of taking every request down with it. Routing without a circuit breaker is not routing, it is a single point of failure with extra steps.

Bring-your-own-key adds a routing dimension that pure multi-provider setups do not have to think about: the same logical model call might need to resolve to a user's own API key on one request and a managed, pooled key on the next, depending on what that specific account configured. The candidate-selection logic has to build the full list of viable routes first — which keys are available, which providers are healthy, which are cheaper — before the circuit breaker ever gets a vote on which one actually gets used. Skip that step and you are not routing, you are guessing which key happens to work today.

Put the pieces together and model routing in practice looks nothing like a static config file with model names in it. It is a live system: a gateway underneath everything, subscription profiles treated as fleet capacity, a heartbeat watching consumption and switching automatically, a circuit breaker tripping bad providers out of rotation, and a standing habit of assigning models to jobs by what they are actually good at instead of by brand loyalty.

Every routing decision above assumes you are actually measuring the thing you are routing on, not guessing. Swapping the bash tool for a thinner, JS-based execution path on one project measured out to roughly half as many tool calls and up to seventy percent fewer tokens on some batch actions, and that number is what justified the swap, not a hunch that it felt faster. Route on evals, not vibes. A model or a tool earns its place in the routing table by a measured number, the same way a candidate earns a spot on a team by a work sample, not a resume.

The vendors will keep shipping features on their own schedule, not yours, and every rate limit they set is a decision made for their infrastructure, not for your task queue. Route around the gap instead of waiting for it to close, and build the piece that is actually missing when waiting stops making sense. That is not a hack. At this point, it is the job.

← Back to the articles

Newsletter

What we shipped, what broke,
and what we learned