Systems

Agents Are the Primary Users

Agents Are the Primary Users

Close to 100 packages, one rule: every tool ships a CLI, an MCP server, an SDK, and --json — because the user is an agent.

I have published close to 100 open-source packages, and every one of them follows the same rule: the primary user is an agent, not a human.

That single assumption changes almost every design decision. Here is what it produces.

Every package ships four surfaces: a CLI, an MCP server, an SDK, and usually a daemon. Same core engine under all four, with a parity test keeping the CLI and MCP verb sets identical. A human types the CLI. An agent calls the MCP server. A product imports the SDK. The daemon keeps state alive across restarts. One domain, one package, four doors into the same room.

Our loops package is the clearest example of the fourth door. It is a local CLI and daemon for scheduled and recurring work that survives process restarts and records every run, with guarded CLI adapters for headless coding agents — the same coding CLI I use myself, plus a handful of others, all driven the same way whether a human typed the schedule or another agent registered it. MCP mutations through that daemon are off by default, same as everywhere else, because a scheduler that can silently start mutating state on a timer is exactly the kind of surface that needs to fail closed.

The package names are plain nouns, plural, and each one owns exactly its domain: todos, loops, machines, secrets, identities, mementos, conversations, brains, testers, wallets, guardrails. No package tries to be a platform. If a concern needs a home, it gets its own repo, its own SQLite file, its own CLI binary, and a design doc that states what it must never become. MCP HTTP ports are allocated sequentially across the fleet — brains on 8801, machines on 8821, the markdown protocol on 8822 — so an agent can guess a service's port from a table instead of asking a human to look it up.

The microservices package makes the same point at a different scale: 21 independent building blocks — auth, teams, billing, llm, agents, memory, knowledge, guardrails, prompts, notify, files, audit, traces, flags, jobs, and more — each with its own Postgres schema, HTTP API, MCP server, and CLI binary. Install only what you need, embedded in your app or standalone as its own service. Every one of those 21 ships an MCP server, not because a human asked for one, but because the assumption going in was that an agent, not a developer clicking through an admin panel, is going to be the thing reading audit logs and flipping feature flags at 4am.

The rule holds even when we are not building from scratch. Forking a file manager and a terminal multiplexer taught me that fast: our fork of a terminal file manager added a machine-aware entry layer so an agent working across a fleet can browse a remote box the same way it browses local disk, resolved live against the machine registry. Our fork of a terminal workspace multiplexer went further and added structured session and pane discovery, prompt delivery receipts, prompt queues, agent metadata, and pane observation with audit records — an API surface bolted onto a tool that used to expose nothing but keybindings. Both forks stay intentionally close to upstream, credit is preserved, licenses are preserved. What changed is that a terminal built for a human's eyes now also answers structured queries from an agent's code.

Everything speaks --json. Every command has a machine-readable output mode, and every package has a self-describing manual — `manual --json` — an agent can read to learn the whole contract without a human explaining it. Shell completions ship too, because half of my agents live in a shell and the other half live in an MCP client, and neither should need a README open in a second window. If your tool needs a human to read the docs aloud to the agent, you built a tool for 2023.

Output is a token budget, not a courtesy. Playwright's MCP server burns 13,700 tokens before your agent does anything. Chrome DevTools MCP burns 17,000. That is context your agent needed for actual work, spent before the first real action. So we built our own browser instead of living with that tax: 154 tools across 4 engines, native Bun.WebView instead of a bundled Chromium, 50x less memory, 2x faster startup. One install, every browser, and none of the token overhead we were paying just to open a tab. The same discipline runs through every other tool: compact, structured answers by default — bounded rows, ids, previews, and a hint for the next detail step — with full output expandable on demand. Tests, builds, searches, git state: the routine loops that burn context get compressed, because the expensive resource in an agent system is not CPU. It is attention.

Our QA tool follows the same logic from the other direction: instead of a human writing test scenarios, it points a cheap agent at a URL, crawls the app, and generates one scenario per discovered link, button, form, and input automatically. Saved workflows then fan out across up to 12 concurrent sandboxes in deterministic batches, with a dry-run preflight that checks credentials, rsync, and app sources before a single token gets spent driving a browser. The human's job shrank to pointing at a URL and reading a report. Everything in between is agent work, so everything in between is built for an agent to run unattended.

The same economics produce a pattern I keep coming back to: smart LLM writes it, cheap LLM or regex executes it. Have the frontier model produce structured markdown once, then let deterministic code or a cheap model run it a thousand times. That is the entire thesis behind our markdown protocol — structured markdown as the intermediate representation between models, with validate, run, compile, lint, and inspect commands sitting on top of it. You do not pay frontier prices for the thousandth execution of a plan the frontier model already wrote.

State is local-first and durable. SQLite under a predictable directory — `~/.hasna/<package>` for most of them, `~/.codewith` for the coding CLI — daemons that survive restarts, every run recorded, idempotency keys and fingerprint upserts so a loop can retry without creating duplicates. Agents crash, sessions die, machines reboot. The substrate does not get to lose data because a process ended. And the state directory is overridable by an environment variable on every package — `<PACKAGE>_DATA_DIR` — because agents run tests too, and test isolation is not optional. A test run that writes into the same SQLite file as production is not a test, it is a liability with a green checkmark.

Hard limits are part of the interface. Every file under 700 lines — I have watched agents split an 1,142-line SDK barrel into nine cohesive modules and a 1,214-line Stripe handler into seven, every one of them landing comfortably under the ceiling, because the rule does not bend for money-critical code either. Subagent concurrency is capped by policy, and the cap is a number I tune, not a vibe: I raised it from 2 to 6 in one pass, and to 16 on a machine that could take it, and both changes live in the same rules file as everything else, in plain language, not in a comment nobody rereads. A UI block registry capped at 16 types. Numbers as governance. Agents will happily generate infinite complexity; the limits are how the system says no on my behalf at 3am, when I am not watching.

Safety rails are product features, not disclaimers. The exec path refuses destructive patterns outright: filesystem formatting, fork bombs, curl piped into bash, rewrites under ~/.ssh. Prompt delivery refuses shell panes, and command execution refuses agent panes, so prompt text can never land in bash by accident. MCP mutations are off by default and require exact confirmation strings when enabled — not a yes/no prompt an agent can talk itself past, an exact string it has to reproduce. Secrets are listed and searched without ever printing values. An agent-facing tool that can be talked into disaster by one bad completion is not powerful. It is broken.

We pulled that logic out into its own package, guardrails, so no single tool has to reinvent it: allow, deny, warn, redact, or approval-gate, applied uniformly to actions, prompts, shell commands, MCP calls, browser use, and model routing. One set of policy decisions, reused everywhere an agent might be tempted to do something it should not. The identities package applies the same restraint to information instead of actions — `identities status --json` returns counts and opaque references, never names, emails, or keys, because a metadata-only contract is what a fleet consumer should get regardless of whether the consumer is a script or a person reading over its shoulder.

And verification is built in, not bolted on. Our dispatch tool exists because driving agents in tmux with send-keys is flaky: the Enter often does not submit and the text just sits in the composer. So dispatch uses bracketed paste, computes a delay from prompt length, retries the Enter, then diffs the pane before and after to confirm the prompt actually landed and submitted. dispatch does not assume success — it verifies. Flaky delivery to an agent is worse than no delivery, because everything downstream becomes a guess.

Cost tracking got the same agent-first treatment. Our cost tracker covers Claude Code, Codewith, Gemini, OpenCode, Cursor, and a handful of others, and it distinguishes API-equivalent, metered-API, subscription-included, estimated, and unknown cost by account and by coding agent, then reconciles those estimates against the real billing sources. A human can read that off a dashboard once a week. A budget governor inside a swarm needs to read it every few seconds, in --json, to decide whether to keep spawning workhorse agents or halt on ceiling. Build the dashboard first and the governor becomes an afterthought bolted onto a UI that was never meant to be polled. Build the --json feed first and the dashboard is just another consumer.

None of this is exotic. It is what happens when you take one question seriously at every design review: what does the agent need? Humans forgive ambiguity; agents amplify it. Humans skim verbose output; agents pay per token to read it. Humans stop when something feels wrong; agents barrel through. So the tools must be unambiguous, compact, durable, and fail closed.

Here is the part I did not expect. Designing for agents made everything better for humans too. Compact output is what I want in my terminal. --json is what my scripts want. Idempotent commands are what my fingers want at 2am. Fail-closed exec is what my machines want always. The manual --json contract that an agent reads in one shot is also the fastest way for me to remember my own CLI's flags six months after I wrote them.

It even changed how I hire, in a sense. Every new package starts by asking who the caller is, and the honest answer is almost never "a person typing carefully." It is a loop running at 3am, a subagent three levels down a delegation tree, or a workflow that fired off a cron trigger and will not stop to ask a clarifying question. Build for that caller and the human who occasionally shows up gets the better experience by accident.

The agents are the users now. Build like it.

← Back to the articles

Newsletter

What we shipped, what broke,
and what we learned