AI

The 44-Second Turn

The 44-Second Turn

How per-turn tool filtering shredded our prompt cache: 74K cache-write tokens every turn, zero reads, 44 seconds to first token.

Our agent platform had a bug no profiler could see. Every turn where the model used tools took 44 seconds before the first token arrived. Not the whole response. The first token.

A 44-second time-to-first-token is not a performance problem, it is a product death sentence. Nobody waits 44 seconds for an agent to start typing. The UI was fine, the streaming was fine, the worker was fine, the model was fine. And still: 44 seconds.

The product behind this bug is a queued-execution agent platform. A chat turn comes in, gets validated and saved, and gets handed to a worker that streams the reply back over server-sent events. Guardrails run twice on every turn, once on the way in and once right before the model call. None of that showed up as slow in isolation. Every layer we owned, measured on its own, was fast. The 44 seconds sat in the gap between the worker picking up the job and the first streamed token, and it only showed up on turns where the model used tools.

This particular platform runs its own fleet of agent sessions on top of the product it serves — a dozen parallel coding sessions during a normal development day, hundreds of live user sessions once shipped. At that volume a 44-second stall on every tool turn is not an edge case a few users hit. It is the default experience for anyone whose agent needs to look something up.

Here is what it turned out to be, because I suspect half the agent products being built right now have some version of this bug.

We ruled things out one at a time before we found it, and the order matters because it is the order most teams will walk too. First the UI: no, the spinner and stream renderer were both instant once bytes arrived. Second the queue: no, the job was picked up in single-digit milliseconds. Third the worker: no, history loads and guardrail checks were each a few milliseconds. Fourth the model: no, the same prompt against the same model outside our stack answered in under two seconds. That left exactly one place to look — the actual request we were assembling and sending.

Anthropic's prompt caching is a prefix cache. The serialized request — tools, then system, then messages — gets cached as a prefix, and the next turn only pays for what changed after that prefix. It is one of the best economic levers in the whole API. Unless you break it.

For scale on why that prefix matters: a single browser-automation MCP server can burn 13,700 tokens of tool description before an agent does anything useful, and a heavier one can burn 17,000. Wire a handful of tool servers into a real agent product and the system-plus-tools prefix alone runs into the tens of thousands of tokens before a user has typed a single word. Prompt caching exists specifically so you do not pay full price for that prefix on every turn of a conversation. Anything that changes those first bytes throws that saving away.

We were being clever. Each turn, we filtered the tools we sent to the model — an activeTools list computed per turn, so the model only saw what was relevant right then. Sounds elegant. Sounds debloated. It is neither.

The instinct was not wrong on its own. Smaller tool lists are supposed to be cheaper and safer: fewer tokens describing tools the model will never call this turn, fewer chances it reaches for the wrong one. That reasoning holds for a single, one-shot request. It falls apart for a cached multi-turn conversation, because it treats every turn as a fresh request instead of the next line in a conversation the API is trying to remember on your behalf.

Changing the tools array changes the first bytes of the serialized request. Which invalidates the tools-system-messages prefix. Which means the cache never hits. Our telemetry showed roughly 74,000 cache-write tokens per turn and zero cache reads. Every single turn, we paid to re-write the entire context into the cache and never read it back. The 44 seconds was the cache write. We had built a machine that shredded its own cache every turn and billed us for the privilege.

Cache writes are not free, and they are not flat-priced either. Anthropic charges a premium over the base input rate for a 5-minute cache write, and a larger premium still for a 1-hour write, because you are asking the API to hold that prefix ready for you. We were paying the 5-minute write premium on the full prefix, on every turn, for conversations that could run for hundreds of turns. Multiply 74,000 tokens by the write-tier rate, by every tool turn, by every active session, and the bug stops being a latency story and turns into a straight line on a billing dashboard.

The fix is boring, which is how you know it is right. Freeze the serialized tools array. Send the same stable tool set every turn and let the model ignore what it does not need. Deciding which tool to call is exactly what models are good at. The per-turn filtering was us doing the model's job, badly, at enormous cost.

There is a second-order fix underneath the obvious one. If certain tools genuinely should not be usable in certain states, gate them with the tool's own description and with a deterministic server-side rejection of calls that should not happen — not by rewriting the array the model sees every turn. Put the judgment call inside the model's job, or behind a check that runs after the call, never inside a prefix you regenerate by hand.

One more thing about how we fixed it, because the process matters as much as the bug. The fix plan was not verified against documentation or blog posts. It was verified against the installed SDK source in node_modules — the actual code serializing the requests, file and line. Docs describe intentions. The installed source describes reality. When latency and money are on the line, read the reality.

The agent that had originally written the per-turn filtering code was confident it was correct: cleaner prompts, tighter focus, exactly what the ticket asked for. It was wrong, and it did not know it was wrong. That is not a knock on the model, it is the reason source-level verification is not optional once money and latency are both on the line. Never trust a self-report, including the one your own code gives you when it looks fine on the surface.

Reading the installed source paid for itself twice. It confirmed the exact byte order the SDK serializes — tools, then system, then messages — which is what told us the tools array sat ahead of everything else and therefore poisoned everything downstream of it. And it turned up a second, smaller offender nearby: a debug timestamp we had left in the system prompt for support tickets. It changed on every request too. Small compared to the tools bug, but it was still shredding the cache on turns that made no tool calls at all, and we would not have found it by reading documentation.

The investigation started life as a broader push to heavily improve time-to-first-token while debloating the guardrail checks running on the same path, on the theory that guardrails were the likely villain. They were not. The guardrail passes were a few milliseconds each, both of them, every time. The lesson inside the lesson: profile before you guess which subsystem is guilty, even when it is the obvious suspect. The actual cost was sitting one layer up, in how the request itself was built.

Lessons, in order of how expensive they were.

Agent products live and die by the prompt cache. Your context is huge: tools, system prompt, history, retrieved files. If you are not getting cache reads on turn two, you are paying full price for your own architecture.

Anything that mutates the prefix pays for the whole prefix. Dynamic system prompts, per-turn tool filtering, timestamps in the system message, randomized tool ordering — all of it invalidates everything after it. Stability is a feature. Treat the serialized prefix like a public API: change it deliberately and rarely.

Cleverness per turn is bloat. We thought filtering tools was optimization. It was decoration with a five-figure invoice attached. The elegant version was the dumber one.

Measure the cache, not just the latency. TTFT told us something was wrong. Cache-write versus cache-read tokens told us what. If your dashboard does not show cache reads per turn, you do not have a dashboard, you have a mood board.

Verify against installed source. Not the docs, not the changelog, not what the agent that wrote the code says it does. Never trust a self-report — from agents or from your own assumptions.

We now run a cost tracker across every coding agent and every served model, and it does not report one number, it reports five kinds of the same dollar: API-equivalent cost, metered API cost, subscription-included cost, estimated cost where a provider does not itemize, and unknown cost where we genuinely cannot tell yet. It carries cache-write pricing tiers, 5-minute against 1-hour, as separate line items instead of an average. And on a schedule, it reconciles those estimates against the real invoices from Anthropic, OpenAI, and whichever other provider is in the mix that month. If the reconciliation drifts, something in the model is wrong, not the invoice.

We track every dollar of this now: cost per account, per agent, per machine, cache-write pricing tiers included, reconciled against the actual billing sources. It sounds like accounting because it is. Token economics is an operating discipline in the pre-AGI era, and teams that treat it that way will outrun teams that do not.

Cost per account and per agent is not a vanity cut of the dashboard. It is how you catch the next version of this bug before it costs five figures. One account burning ten times the cache-write tokens of every other account on the same plan is not a power user, it is a prefix getting shredded somewhere in that account's path. We would have caught the tools bug two weeks earlier if that view had existed when it started.

None of this stays a one-time fix either. Every new tool server, every new feature that touches the system prompt, every well-meaning agent that decides to personalize a message for a user, is a fresh chance to reintroduce the same bug in a new shape. So the prefix is now a reviewed surface: changes to the tools array, the system prompt builder, or the ordering of either one require the same scrutiny as a schema migration, because in cost terms that is exactly what they are.

After the fix, tool turns went from 44 seconds to normal. Same model, same product, same context. The only thing that changed is that we stopped fighting our own cache.

Check yours today. It takes one query against your usage data: cache reads per turn. If that number is zero, you already know what your users are feeling.

← Back to the articles

Newsletter

What we shipped, what broke,
and what we learned