Systems

CLI, MCP, SDK: Four Doors Into the Same Room

CLI, MCP, SDK: Four Doors Into the Same Room

A tool that only speaks MCP can't run from a cron job. A tool that only speaks CLI can't be called from another agent's process. Build all four doors, or don't bother.

A tool that only speaks MCP cannot run from a cron job. A tool that only speaks CLI cannot be called from inside another agent's process without shelling out and praying. A tool that only ships a web dashboard cannot be called by anything, ever, except a human with a mouse. Pick one surface and you have built a room most of your users cannot enter.

So every package I publish ships four surfaces: a CLI, an MCP server, an SDK, and usually a daemon. Same core engine underneath all four. The dispatch package's own architecture doc says it plainly — the whole thing is "layered so the same core engine powers four surfaces" — and that sentence is not marketing, it is the actual module boundary in the code. The CLI parses flags and calls the engine. The MCP server parses tool calls and calls the same engine. The SDK is the engine, exported. The daemon holds the engine's state alive between calls. None of the four surfaces contain business logic of their own. If they did, the four doors would quietly drift into four different rooms.

The packages themselves are named like the rooms they guard: plain plural nouns, one noun per domain — todos, secrets, machines, mementos, conversations. Not a platform, not a suite, one thing that owns exactly one concern. That discipline is what makes four doors tractable in the first place. A package that tries to be three domains at once cannot keep its CLI and its MCP server in sync, because there is no longer one engine to keep them in sync with — there are three, wearing the same trench coat.

That drift is the actual failure mode, and it is why every package gets a parity test: a test that asserts the CLI's verb set and the MCP server's tool set are identical. Not similar. Identical. Without that test, here is exactly what happens: someone ships `todos add` and `todos done` to the CLI on a Tuesday, forgets the MCP server, and three weeks later an agent calling the MCP tools can create a task but never close one. Nobody notices, because the human using the CLI never hit the gap. The agent hit it every single day and had no way to tell you, because the whole point of giving it a tool is that it stops asking a human for help. A parity test turns that silent gap into a failing CI job the moment someone forgets a verb.

The fourth door, the SDK, does something different: it forces the maker to eat their own cooking. The rule for every hosted product I run is that it must consume its own open-source core through the public, versioned package exports — the same import path a stranger installing it from npm would use. No reaching into internals, no private shortcut. If the public SDK is not good enough for my own product to build on, it is not good enough to publish. That single rule catches more API design mistakes than any code review, because the person who has to live with a bad SDK surface is the same person who designed it.

A self-describing manual sits on top of all four. Every package answers `manual --json` with the full contract — every verb, every flag, every output shape — so an agent can learn the tool by calling one command instead of a human pasting documentation into a prompt. Shell completions ship alongside it for the human half of the audience. And the discovery convention is explicit down to the file name: legacy agents that only know to look for `AGENTS.md` are told, in that same file, to also go read the package's own doc — because the assumption that everyone converges on one discovery file is already wrong, and pretending otherwise just means some agents silently miss half the contract.

None of that means every tool needs an MCP server. I keep a short list of MCP servers actually worth running as a server — a browser automation tool, a component library lookup, a couple of research tools — and everything else is a CLI, because most agents already live in a shell and calling a well-documented binary is cheaper than another running process with its own auth and its own token overhead. Agents will probably live in the CLI and humans will live in the GUI, long term, and building an MCP wrapper around every CLI on principle is exactly the kind of complexity nobody asked for. The four-door pattern is not "always build four doors." It is "decide on purpose which doors this tool needs, and never let the ones you build drift apart."

When you do build the MCP door, the token bill is real and it starts before your agent does anything. Playwright's MCP server burns 13,700 tokens on startup. Chrome DevTools MCP burns 17,000. That is context spent before a single useful action, on every session, for every agent that loads it. It is why we built our own browser instead of living with that tax — 154 tools across 4 engines, a native Bun.WebView instead of a bundled Chromium, 50x less memory, 2x faster startup — and it is why every MCP server we ship returns compact, bounded output by default: rows capped, previews instead of full bodies, a hint for the next call instead of everything up front. An MCP tool call is not a free action. It is a metered one, and the meter starts running the moment the server initializes.

The payoff for getting all four doors right shows up once agents start calling each other instead of just calling you. I built a todos MCP server and a conversations MCP server, pointed two agents at both, and put them in a loop together on one project. It costs real money to run — every message and every task write is a paid call — but watching them delegate to each other through the same two tools I use by hand, with no human relaying context between them, is the actual argument for the pattern. Neither agent needed a bespoke integration. They needed the same CLI verbs I have, exposed as MCP tools with identical names and identical shapes, so what one agent wrote the other could read without translation.

The microservices side of the ecosystem makes the same case at a different scale. Twenty-one independent packages — auth, billing, teams, memory, knowledge, guardrails, notify, files, audit, jobs, and more — each with its own Postgres schema, its own HTTP API, its own MCP server, and its own CLI binary, runnable embedded inside an app or standalone as its own service. Install only the ones you need. Every one of the twenty-one ships an MCP server on day one, not as a follow-up ticket, because the assumption baked into the scaffolding template is that an agent, not a developer clicking through an admin panel, is going to be the thing reading an audit log or flipping a feature flag at four in the morning. Bake the four-surface pattern into the scaffold once, and every new package inherits it for free instead of relitigating the decision each time.

The daemon door earns its keep whenever the work is scheduled rather than requested. Our loops package is a CLI and daemon for recurring work — cron-style schedules that survive a process restart and log every run — with guarded adapters for headless coding agents so the exact same safety rules apply whether a human registered the schedule by hand or another agent wrote it into the daemon at 2am. MCP mutations against that daemon are off by default, same posture as everywhere else, because a scheduler an agent can silently reprogram is a scheduler that needs to fail closed before it needs to be convenient.

The gateway package shows the SDK door mattering even when there is only one provider-facing surface a caller ever sees. One gateway key for the client, many provider keys hidden behind it — DeepSeek, Qwen, GLM, whatever we route to that week — so a product's SDK import never has to change when the provider roster does. The CLI and the SDK share that same routing table; changing a provider in one config file changes it for a human running a one-off command and a product serving live traffic, at the same time, because both callers hit the identical engine underneath.

The identities package makes the case for why parity has to include what a surface refuses to return, not just what it returns. `identities status --json` answers the same shape whether the caller is a script, an MCP tool, or a human running the CLI by hand: counts and opaque references, never a name, an email, or a key. Getting that redaction right once, at the engine layer, means none of the three doors can accidentally leak more than the others — there is no door with a looser lock, because there is only one lock and three doors share it.

The QA tooling shows the CLI and the SDK doors working together instead of the MCP door. Point it at a URL and it crawls the app, generating one test scenario per discovered link, button, form, and input, then fans those scenarios out across concurrent sandboxes in batches, with a dry-run preflight that checks credentials and environment before spending a token running a real browser. A human types one CLI command. The SDK is what a CI pipeline calls to run the same crawl on every merge, headless, with no human in the loop at all. Two different callers, same engine, same contract — which is the entire test of whether the four-door pattern actually held.

Getting this right is less about any individual door and more about refusing to let convenience carve a fifth, undocumented one. The fastest way to break the pattern is a debugging session where you patch the MCP server directly because the CLI rebuild is slow, or you hardcode a query in the SDK because writing a proper CLI flag for it felt like overhead at 11pm. Both patches work in the moment. Both are how a parity test starts failing three weeks later and nobody remembers why.

So the checklist before anything ships is short and it never gets shorter: does the CLI have it, does the MCP server have it under the same name, does the SDK export it the way a stranger would import it, and does the manual describe all three the same way. Four doors, one room, one contract. Anything less and you built a tool for whichever caller happened to be in the room when you wrote it — and that caller is not going to be the one running your tool at three in the morning.

← Back to the articles

Newsletter

What we shipped, what broke,
and what we learned