Systems

Cost per account, per agent, per machine, cache-write pricing tiers tracked separately, and every estimate reconciled against the real invoice. Anything less is a bill you have not opened yet.
Most teams running coding agents at scale can tell you the total on last month's invoice. Ask them what one account, one agent, or one machine cost yesterday, and they go quiet. That gap is not a reporting nuisance. It is where the money disappears.
We found this out the expensive way. A platform we run had a bug where every tool-using turn took 44 seconds before the first token arrived, root-caused to prompt-cache thrash: roughly 74,000 cache-write tokens per turn, zero cache reads, on every single tool call. Nobody caught it from the invoice. The invoice was one number a month behind. We caught it from latency, weeks after the meter had already been running.
That is the whole argument in one incident. If cost is a lagging monthly summary, you find out about a bug like that from users complaining, not from a graph. Token economics has to be instrumented, tracked the way you track error rates and latency, or it functions as accounting fiction that happens to be denominated in dollars.
So we built a cost tracker that treats this as a first-class systems problem instead of a finance afterthought. It watches every coding agent we run — Claude Code, Codewith, Gemini, OpenCode, Cursor, and whatever else shows up in the fleet next quarter — and it does not report one number per agent. It reports five kinds of the same dollar, because they are not the same dollar: API-equivalent cost, metered API cost, subscription-included cost, estimated cost where a provider does not itemize, and unknown cost where we genuinely cannot tell yet. Collapsing those five into one line is how teams lie to themselves without meaning to.
The subscription-included category exists because of a decision that complicated our own accounting on purpose. We run well over a dozen subscription auth profiles across our machines at once, sometimes sixteen or seventeen simultaneously, treating ordinary consumer subscriptions as a schedulable compute fleet instead of paying metered API rates for everything. A usage heartbeat watches each profile and auto-switches to the next one when a rate limit hits. It is a legitimately cheaper way to run a fleet of agents. It also means the same unit of work can cost zero marginal dollars on one profile and real metered dollars on the next, depending purely on which subscription happened to be live when the request went out. If your cost model cannot represent that, it will average two different economics into one meaningless number.
Reconciliation gets harder the more providers you run, and we run several on purpose rather than betting everything on one vendor. Anthropic, OpenAI, and whichever OpenAI-compatible provider is cheapest that quarter — DeepSeek, Qwen, GLM, take your pick — each itemize usage differently, bill cache behavior differently, and expose billing data through a different shape of API or a different export entirely. A cost tracker that only understands one provider's invoice format is a cost tracker for one provider. Ours has to normalize several different billing dialects into one internal schema before it can even ask whether the estimate was close.
Cache-write pricing adds another axis that most cost trackers flatten by accident. A cache write billed at the 5-minute tier and a cache write billed at the 1-hour tier are not the same expense with the same label. They are separate line items with separate rates, and averaging them hides exactly the kind of anomaly that turned into the 44-second bug in the first place. Track the tiers separately or do not bother tracking cache costs at all — an average of two different prices is not a metric, it is noise wearing a decimal point.
None of these estimates get to grade their own homework. On a schedule, the whole system reconciles its running estimates against the actual invoices from Anthropic, OpenAI, and whichever other provider is active that month. If the reconciliation drifts past a threshold, that is treated as a bug in the estimator, not a rounding error to shrug off. A pricing model nobody checks against the real bill is a guess wearing a spreadsheet.
Attribution has to go three levels deep to be useful: cost per account, cost per agent, cost per machine. Machines have names in our fleet — spark01, spark02, and the rest — and each one accumulates its own number. One account spending ten times what every comparable account spends is not a power user worth celebrating. It is a signal that something is misbehaving in that account's path, the same way one server with ten times the CPU of its peers is not a busy server, it is a leak. We would have caught the cache-thrash bug two weeks sooner if that per-account view had existed before the bug did.
None of this is useful locked inside a human-facing dashboard either. The same cost data ships through a CLI, an MCP server, a REST API, and a dashboard, because the agent burning the tokens should be able to check its own budget mid-task exactly the way a human checks a bank balance before making a purchase. An agent that can query how much it has spent this session before deciding whether to keep going is an agent that can govern itself. An agent that only finds out at the end, from a human reading a graph, has already spent the money.
The same discipline shows up one layer down, inside the agent-orchestration code itself, not just in the dashboard watching it from outside. Our swarm framework — the piece that turns a goal into a task graph and fans it out across many agents — carries a budget governor as a first-class component, not a bolt-on. It converts token usage to dollars through a real pricing table, aggregates spend across every agent in the swarm in real time, and halts the run outright once it crosses a ceiling you set in advance. When no provider keys are configured, it falls back to a deterministic offline model so the entire framework still runs and tests cleanly without live credentials. Nothing about the cost path is hardcoded. A governor that only exists in a dashboard, discovered after the run finished, is not a governor. It is a receipt.
Agents make this more urgent than it would be for a normal service, because an agent that gets stuck does not necessarily fail loudly. It retries a failing tool call, or re-reads the same file five times trying to understand an error, or spins up a sub-agent to fix a problem the sub-agent then also gets stuck on. A human doing the same task would get bored or give up. An agent will happily burn tokens in a loop until something external stops it. A ceiling that halts a run automatically is not a nice-to-have in that world, it is the only thing standing between a bug and a bill nobody notices until the end of the month.
The ledger side of this is just as unglamorous and just as load-bearing. Our own product's billing runs on credits, one credit equal to one cent, with a ledger that consumes from buckets in a fixed priority order, spending limits an org can set for itself, and promotional grants that expire on schedule. Every adjustment — grant, revoke, refund, expiry — runs through the same money-critical path: an advisory lock, an idempotency key, a transaction that either fully lands or fully rolls back. We do not let two concurrent requests double-spend the same bucket because the underlying job happened to retry. That is boring, deliberate, bank-ledger discipline applied to what looks, from the outside, like a chatbot subscription.
Even the wording of the ledger got a systems pass, not just a copy pass. Early on, a promotional grant was described to users as a flat "$10 free credit." It got rewritten to credits denominated in the same unit as everything else in the ledger, because a dollar figure that lives outside the accounting model your system actually enforces is a promise your ledger cannot keep. If the unit users see is not the unit your code moves, you will eventually owe someone an explanation you do not have.
Budget alerts follow the same standard as everything else we consider load-bearing: they get retried. A webhook that fires once and disappears into a 500 error is not an alert, it is a coin flip. If a budget ceiling matters enough to configure, it matters enough to guarantee the notification actually arrives, which means treating the delivery the same way we treat prompt delivery to an agent pane — verify it landed, do not assume it did.
None of the numbers in any of this come from asking an agent how much it spent. An agent's own account of its token usage is a self-report, and the same rule that governs code review governs billing: never trust one. Every figure traces back to the usage field the provider actually returned on the response, or to the invoice line the provider actually billed, never to a summary the model wrote about itself. A cost system that takes the model's word for its own spend will drift the moment the model is wrong about anything, and models are wrong about plenty.
Put the pieces together and the shape of the discipline becomes obvious. Five cost categories instead of one blended number, because subscription-included and metered API are different economics wearing the same currency symbol. Cache-write tiers tracked separately, because averaging them erases the exact anomaly worth catching. Scheduled reconciliation against real invoices, because an estimate nobody checks is a guess. Attribution by account, agent, and machine, because aggregates hide the one outlier that is actually telling you something. A budget governor inside the execution path, not next to it, because a ceiling enforced after the money is already spent is not a ceiling. And a ledger built like a bank's, because at scale it is one. None of these six pieces is exotic. Each one is the kind of unglamorous plumbing a payments team would recognize immediately. The only unusual part is applying that plumbing to a product where the thing being metered is a model's attention instead of a customer's shopping cart.
We have spent over 170 billion tokens coding with agents this year. At that volume, the difference between tracking cost as an afterthought and tracking it as an operating discipline is not a rounding error. It is the difference between finding a five-figure bug from a latency complaint, weeks late, and finding it from a graph, same day.
Pull up your own usage data right now and ask it three questions: which account costs the most per unit of output, which agent has the worst cache-read ratio, and does your estimate match last month's actual invoice within a few percent. If you cannot answer any of the three, you do not have token economics. You have a bill you have not opened yet.