Voice-First Specs

12-30-25

I do not write tickets. I talk in walls of intent, and a fleet decomposes them into plans, gated by hard brakes I do not always have to pull myself.

I do not write specs. I talk, in long unstructured paragraphs, and something downstream turns the talking into a plan. That is the entire method, and it works better than any ticket template I ever tried to use.

The clearest proof is a domain-registration pipeline I dictated in one paragraph, profanity included: buy a domain through Route53, always delegate DNS to Cloudflare regardless of registrar, provision SES addresses on it automatically, wire Cloudflare and SES and Resend together for sending. The acceptance test was just as concrete: buy three real three-word domains, provision three SES addresses per domain, send sixteen emails back and forth per address, confirm delivery. No ticket, no user story, no acceptance-criteria template. Just a wall of intent, spoken once, and a fleet that turned it into a working pipeline.

A correction mid-build on that same project taught me something about where the real spec actually lives. An agent had quietly reinterpreted the whole thing as a personal inbox checker instead of infrastructure for giving users and agents email addresses at scale. My correction was blunt and not polite, and it also happened to contain the entire missing spec in one sentence: this app is ours, we have SES access, this is for giving users and agents email addresses. Frustration and precision showed up in the same breath. That keeps happening. The angriest messages in my history are also the most information-dense, because anger strips out the hedging and leaves only the requirement.

Another one, dictated even faster: build a simple internal app for generating logos, structure it like our task manager but CLI-only for now, give it an MCP server and an SDK, wire it into our fine-tuning tool, point it at three different image models including one we can train ourselves, let it output SVG, save everything under a predictable local path. One breath, a working scope. Mid-build the scope grew on its own — should this just be logos, or a whole brands thing — and grew again a message later to include a completely different vendor integration, with a live API key pasted straight into the chat and routed to the secrets vault instead of left sitting in plain text. The spec was never a document. It was a conversation that kept adding clauses.

Sometimes the wall of intent is not even typed or spoken into the same session that has to act on it. A macOS notes app I wanted rebuilt had its actual spec sitting nowhere near the chat — it lived as a pile of voice memos I had recorded on a completely different machine through a separate notes CLI, describing how I wanted the thing structured across local, self-hosted, and cloud layers. The instruction to the agent was not a description of the product. It was: go read my notes cli on the other machine first, then come back with a plan. The spec was audio, recorded whenever the idea occurred to me, and the agent's first job was retrieval, not interpretation.

A twelve-hour dictation about an entirely different product shows the same shape at a bigger scale. I wanted to extend outreach hard, and I said so out loud at length: it takes a tremendous amount of time to prompt every single agent individually, we have goals now, we have loops, we are building workflows, but none of it is enough yet, they should all just work toward goals and plans and keep coding all day uninterrupted regardless of usage limits. That is not a spec in any conventional sense. It is closer to thinking out loud into a microphone. But every noun in it — goals, loops, workflows, uninterrupted execution — became a real primitive within weeks, because the decomposition layer treated the ramble as source material, not noise to filter out.

None of this works without something on the other end whose entire job is decomposition. My stated ideal has always been to dump my thoughts once and let the system turn them into goals, then plans, then tasks, and only come back to me with the questions that actually need me. Coordinators exist specifically to do that translation — turning a paragraph of run-on intent into a task graph nobody has to reread three times to understand.

The best rule to fall out of this whole approach was never written down as a rule. It surfaced by accident in the middle of an unrelated feature dictation, when I said out loud that verification should be automatic, that I should not have to say verify all the time, that the system should just keep checking that the thing actually works. Nobody had asked me to state a philosophy. It fell out of me mid-sentence about something else entirely, which is exactly why I trust it more than anything I would have written into a policy doc on purpose.

Voice-first does not mean consequence-free, and that is where the hard brakes come in. The moment an agent starts building before I have actually agreed to what gets built, I stop it, in writing, every time: I did not say to build it, undo what you did, we need to discuss this further. Sometimes it is shorter and blunter than that — stop, talk to me first, we need to discuss what we are doing before anything gets touched. A wall of spoken intent is not a blank check. It is the start of a conversation, and the conversation has veto points built into it on purpose.

One of those veto points now lives in infrastructure instead of in my mouth, which is the version I actually prefer. A site-generation product I ran through two straight days of rapid-fire dictation — full redesign, a copywriting pass, a rebrand, then a login system — had every one of those goals gated by a session-scoped hook that would not let the session end until the stated condition was genuinely met. When the login goal said Google OR magic link, the hook held the line even after magic link worked end to end, quoting the transcript back to justify why Google specifically was still unresolved. It fired the same rejection five times in a row before the real blocker turned out to be Google's own console, not the agent. A verbal hard brake depends on me being present to say stop. A hook-shaped hard brake does not need me in the room at all.

Discuss, then plan, then build is the actual sequence, even when the discussion is thirty seconds of me thinking out loud. I ask for the plan before I ask for code, on purpose: talk me through the layout specifically, do not give me code yet, tell me what needs to exist before any of this can function. Skipping straight to an artifact is the single fastest way to get work undone, because an artifact built on a guess about intent is not a shortcut, it is a detour with extra steps.

A dense technical spec can come out of the same loose, run-on register as everything else, which is the part people find hardest to believe. An autonomous site-building product got its entire operating model dictated in two sentences: keep the whole thing on a credits-based budget where every task costs a fixed number of credits, and route all the actual code execution through sandboxes because letting generated code run unsandboxed is reckless. That two-sentence ramble became a real architecture — three different coding SDKs pinned to three different layers of the stack, all of it running inside isolated sandboxes, billed to end users in credits. Loose phrasing and precise architecture are not opposites. The looseness is just how the precision gets transmitted before anyone has bothered to formalize it.

That two-layer split — say it messy, let the system make it precise — only holds together because of a rule sitting underneath both the walls of intent and the hard brakes: the default posture for any agent is to act, not to ask. Nobody should be firing a clarifying question back at me for every small judgment call inside a dictation; that defeats the entire point of talking instead of writing tickets. The brake exists for exactly one category of event — an agent about to commit to scope I never actually approved — and nowhere else. Ask constantly and voice-first specs become an interview. Never ask and never brake, and voice-first specs become a runaway train. The method only works because those two failure modes are fenced off from each other on purpose.

The naming of things gets the exact same live-refinement treatment as the specs themselves, which tells you the process is the same process end to end. A pricing model for a token-based freelance marketplace went through three separate name candidates in one sitting — one rejected as too close to an existing brand, a second rejected outright mid-search with an explicit stop, a third finally accepted once I confirmed the domain was buyable. Naming a product and specifying a product are the same act of talking until something coherent falls out and then locking it in.

What makes voice-first specs work instead of collapsing into chaos is that intent and execution are handled by two completely different layers, and neither one is allowed to touch the other's job. I say what I want, in whatever order it occurs to me, contradictions and profanity and all. The system's job is to hold that against reality, ask the two or three questions that actually block progress, and refuse to call anything finished until the original intent — not a paraphrase of it, the actual intent — is satisfied. I do not decompose my own thoughts. I never have to. That is the whole reason this works at the speed it works.

There is a cost to talking instead of writing, and I do not pretend there is not. A spoken wall of intent is redundant, self-correcting mid-sentence, sometimes contradictory by the time you reach the end of it. But a written spec's tidiness is mostly theater anyway — it looks precise and is usually wrong about the one detail that mattered, discovered three weeks later. A dictated wall of intent is honest about being messy while it is being produced, and the mess gets resolved once, by the plan, instead of never, by a document nobody rereads.

I do not think the future of specifying software looks like more structured templates. I think it looks like fewer of them — more raw intent, dumped once, decomposed by something whose only job is decomposition, gated by hard brakes that hold whether or not I am watching. Say what you want. Let the system ask what it needs to know. Stop it the moment it guesses wrong. That is the whole method.

Everything else is paperwork pretending to be a plan.

← Back to the articles

What we shipped, what broke,and what we learned

What we shipped, what broke,
and what we learned