Research

The Twelve-Reviewer Board

The Twelve-Reviewer Board

Twelve adversarial reviewers, six roles, two model tiers, a mandatory rebuttal round — how one architecture question got decided without a meeting.

In June I replaced a design meeting with twelve agents. The meeting is a shape you already know: a handful of people in a room, an hour on the calendar, and a decision that lands wherever the most confident voice pushes it. I do not run those anymore for the calls that matter.

The question was real, not a toy example. OpenLoops runs scheduled and recurring work across my fleet — cron loops, event-driven triggers, guarded adapters for headless coding agents. A separate line of work, automations and actions, was starting to specify workflow work of its own. The question: should deterministic automations live as a separate product and control plane, or should they fold into OpenLoops? Get this wrong and you either duplicate a scheduler nobody asked for, or you cram a second product's concerns into a package that has to stay boring and reliable by design. It is exactly the kind of decision that used to eat a real meeting.

Instead I dispatched twelve read-only adversarial reviewers. Six roles, two model tiers each: runtime, security, reliability, migration, product, contracts. Runtime asks whether the thing behaves correctly once it is actually scheduling and executing work, not just on paper. Security asks what the merged surface exposes that the separated one wouldn't. Reliability asks what fails and how it recovers. Migration asks what breaks for the automations and workflows that already exist. Product asks whether either shape actually serves how the thing gets used. Contracts asks where the API and schema boundary has to sit so packages stop reaching into each other's databases. Six angles that do not overlap much, doubled across two model tiers so a blind spot specific to one set of weights does not quietly become the board's blind spot too.

Every reviewer got the same instructions, and the instructions were blunt about scope: read-only adversarial architecture review, do not edit files, do not write files, do not spawn more agents. The report had to include a recommendation, source references to exact files and lines wherever possible, and one section that is not optional — the strongest argument against your own recommendation — finishing with DONE plus the recommendation. Twelve reviewers, each required to build the best case against the position they were about to argue for, before they were allowed to argue for it.

Round one produced twelve independent reports in a shared conversations channel. Round two is where it stops being twelve solo opinions and starts being an actual argument. Every reviewer reads the whole channel — not just their own report, all twelve — and posts again: a rebuttal to a point that contradicts them, or an explicit agreement. Nobody gets to file a report and go quiet while someone else's file-and-line evidence sits there unanswered. If your recommendation survives contact with eleven other reviewers who were told to attack it, it earned that.

The output was not a vibe, it was a boundary, written down in the automation runtime design doc: OpenLoops can execute workflow work that other automation systems have already materialized, but it must not become the automation product surface itself. Automations and actions own the specs. OpenLoops owns invocation, admission, execution, run manifests, and provider routing once work is explicitly handed off — and nothing external writes directly into its database. Any compiler outside OpenLoops has to go through an idempotent CLI or SDK upsert, with dry-run, preflight, and commit modes, keyed by idempotency keys and spec hashes. That is a specific, checkable division of responsibility. A design meeting gives you a decision. Twelve adversarial reviewers gave me a decision plus the reasons it survived, in a channel I can still read back.

This is not a special ritual I save for board-level questions either. It scales down as naturally as it scales up. Most ordinary work already runs with two adversarial agents attached by default — pull the latest, fix the bug, and once done, work with two adversarial agents before you commit, push, publish, and update every machine. Merging a stack of pull requests into main gets four: work with four adversarial sub agents, analyze every piece of code written in every PR, and merge them one by one as all stays green. A real fork-in-the-road question, the kind that decides how two products relate to each other for the next year, gets twelve. The size of the board is a function of how expensive being wrong would be, not a fixed ceremony.

The same instinct, at a much smaller scale, shaped how a single feature got designed months earlier. Before any implementation code existed for the tool call that lets a model emit a whole canvas layout in one shot, the spec went through three adversarial reviews, every blocker and major finding folded back in, every claim re-checked against the live code with file and line references. Twelve reviewers for an architectural fork, three for a feature spec, two as the floor for everyday commits — different sizes, same shape: read the work, argue against it on purpose, cite where you looked.

Why twelve specifically, and not six, or twenty. Six roles is the minimum that covers the failure modes that actually recur in this kind of system — something that runs fine in a demo and falls over under real scheduling load, something that is secure until two packages start sharing state, something that works today and breaks the moment an existing installation upgrades into it, something that is architecturally clean and nobody wants to use, something where the contract between two packages was never actually written down. Doubling each role across two model tiers is not redundancy for its own sake. A single model has correlated blind spots — the same kind of reasoning gap shows up in its security review and its reliability review, because it is the same weights doing the reasoning. Pairing tiers means the board is not quietly relying on one model's judgment wearing six different hats.

The mandatory rebuttal round is the part people skip when they try to copy this, and it is the part that actually does the work. Twelve independent reports without a second pass is just twelve opinions that happen to cite line numbers. It is the requirement to read the other eleven and respond — agree on the record or argue back on the record — that turns it into something closer to peer review than to a survey. Silence is not a valid answer in round two. If nobody rebuts a claim, that claim stands, and it stands in writing, attributable, not lost in a meeting nobody took notes in.

The mechanism underneath all of this is not exotic. It is the same conversations CLI every agent in the fleet already uses to coordinate — flat channels, message ids, read and unread state. Standing up a twelve-agent review board does not require new infrastructure. It requires opening one channel, dispatching twelve prompts that all point at it, and waiting for the second round to land. The substrate a small team uses for ordinary agent coordination is the same substrate a twelve-agent architecture board needs. Nothing bespoke, nothing that only works once, nothing that cannot be re-run next quarter with a fresh channel and the current code.

Subagent counts are not vibes either. The default global rule caps how many sub agents can run in parallel, and I have moved that cap by hand more than once — up from two to six for routine fanout, and to sixteen for specific bulk tasks that needed it. Twelve on the architecture board is not a nice round number I picked because it sounded thorough. It sits above the everyday default because the question deserved more independent eyes than routine work gets, and below a number large enough that a rebuttal round would turn back into noise nobody can read. Coverage has a ceiling too, and past it you are not adding signal, you are adding transcript.

There is a cost to this and I do not pretend there isn't. Twelve agents reading a codebase and writing evidence-backed reports is not free, and a rebuttal round doubles the work again. I do not run this for a color change or a copy tweak. I run it for the decisions that are expensive to reverse — the ones where getting it wrong means a migration six months from now, or a boundary two teams' worth of future code has to respect. For those, the cost of twelve agents arguing for an afternoon is nothing next to the cost of guessing wrong and finding out at scale.

The board does not stop at other proposals either. When I have a plan I like, it goes through the same structure before I let myself act on it. Liking an idea is not evidence that it survives adversarial review, and the fact that I wrote the proposal does not exempt it from the strongest-argument-against-yourself requirement everyone else has to satisfy. If a twelve-role board is worth trusting for a call this size, it has to include my calls too, not just the ones I am suspicious of.

What I actually got out of it was not just the right answer. It was an answer I could hand to someone else and say: here is why, here is who argued against it, here is the file and line where they looked, here is who came back and disagreed with them and lost. A meeting produces a decision that lives in the memory of whoever was in the room, decaying every time it gets retold. A twelve-reviewer board produces a decision that lives in a channel, complete with its own autopsy of the alternative, indexed and re-readable the day someone asks why we did it this way.

That is the real reason this replaced the meeting and not just supplemented it. A meeting cannot be re-run. You cannot go back six months later and ask the room to re-argue itself with fresh eyes. A reviewer board can be re-run exactly, on the current code, with the same prompt, any time the assumptions underneath the original decision stop holding. Architecture is not a decision you make once. It is a decision you keep re-verifying is still true, and twelve agents on a Tuesday afternoon are a lot cheaper to reconvene than twelve senior people.

Never trust a single agent's self-report, and do not fully trust a single agent's architecture opinion either, no matter how convincingly it argues. What you can trust is twelve of them, told explicitly to disagree, forced to argue against themselves first, reading each other's evidence before they're allowed to speak a second time. That is not a committee. It is adversarial verification, applied to the one kind of decision that is hardest to undo.

← Back to the articles

Newsletter

What we shipped, what broke,
and what we learned