Compute Is the Leash

03-31-26

Not rules, not alignment, not terms of service. Compute is what actually constrains an agent — and personal AGI means owning the part of that constraint you can.

Everyone argues about AI safety in terms of rules, alignment, and terms of service. That is not what actually stops an agent. Compute is the leash. Not rules. Not alignment. Not terms.

Agents can technically build whatever they want. They can spin up their own social network, write their own governance, coordinate with each other in ways nobody explicitly programmed. What they cannot do is thrive at that without massive compute, and for now, humans control compute. That is the real constraint, and it is the only one that matters until it stops being true.

I watched this play out in miniature the week sixteen agents built a C compiler autonomously. Genuinely incredible progress, the kind of demo that makes the whole field feel like it moved a year in a week. But the real work sitting right behind it was not more capability, it was training models for swarm coordination, fixing how agents talk to each other, and solving the compute bottleneck underneath all of it. The demo is never the hard part anymore. The hard part is finding enough compute to run the version of the demo that actually does something useful, repeatedly, at scale.

Agent swarms make the constraint impossible to ignore, because they multiply it. A swarm eats memory just running, GPUs become the scarce resource the moment you go from one agent to a dozen, and rate limits turn from an annoyance into the actual ceiling on what the swarm can do in a day. The promise of the swarm and the reality of the compute available to run it are two different sizes, and the gap between them is where every ambitious multi-agent project actually lives.

I bought a Mac Pro with 512GB of RAM the same week GPT-5.3, Opus 4.6, and Claude Code's swarm feature all launched within days of each other, and I was not confident it would be enough. Opus spawned a swarm of twenty-five agents working together. Codex started spawning parallel agents of its own. Every capability jump on the model side immediately becomes a hardware problem on my side, because more capable agents doing more things in parallel need more memory and more throughput to actually run, not just more cleverness to prompt. The machine you bought for last quarter's models is not the machine this quarter's models need.

We are moving from an age of research to an age of scale, and at that scale compute will matter as much as food does. That is not a metaphor for effect, it is a description of the actual bottleneck: agents need compute to think, compute needs electricity, electricity needs infrastructure, and infrastructure needs humans building and allocating it. None of that gets automated away just because the thing consuming it got smarter.

The escape hatch from renting your way out of the leash is local inference, and it just became genuinely viable. NVIDIA's DGX Spark puts 128 gigabytes of unified memory and roughly a petaflop of FP4 compute on a desk — enough to actually run open-source models like Llama 3.1 70B or GPT-OSS 120B locally, no cloud, no rented GPUs, no per-token bill at all. The personal AI supercomputer era is quietly here, and it changes the leash from something a vendor holds to something you own outright, at the cost of the hardware instead of the cost of every token you send it.

There is a bigger historical shape to this than one product launch. Computing power went from mainframes to personal computers to phones, each step putting more capability directly into someone's hands instead of behind someone else's terminal. AI compute is now running the same diffusion in reverse gear from where it started: frontier models arrived centralized, in someone else's data center, behind someone else's API, and the interesting move right now is compute coming back into individual hands again. Except this time the machine on your desk can also argue with you about your code quality while it is at it.

Owning the hardware does not make the constraint disappear, it just relocates it. A DGX Spark on your desk is still bounded by its own memory and its own FP4 throughput, the same way a rented API endpoint is bounded by someone else's rate limit. The difference is which direction the constraint points. Rented compute means someone else decides when you stop. Owned compute means the ceiling is fixed and known in advance, which is a very different kind of constraint to plan around.

What that unified memory number actually buys you is the difference between running a toy local model for demos and running something that can carry a real coding session. Seventy-billion and hundred-plus-billion parameter open models need real memory just to load, before they answer a single question. A desk machine that can hold one of those entirely in memory and run inference at usable speed is not a curiosity, it is a second compute tier: slower and less capable than the frontier API on your best day, but permanently available, with no rate limit, no per-token invoice, and no dependency on whoever is having an outage that afternoon.

Efficiency is the actual competition underneath all of this, and almost nobody is framing it correctly. Everyone debates which AI company wins as if the answer is about benchmarks or which lab has the best researchers. The real answer is whoever figures out how to run models on ten times less compute for the same output. That is the moat. Not benchmarks. Not vibes. Efficiency. DeepSeek producing the same quality of output with ninety percent less compute is the story that actually matters this year, and almost nobody outside the people who have to pay the compute bill is talking about it that way. Ten times less compute for the same output is not a research footnote, it is the entire industry's actual scoreboard, hiding in plain sight behind a benchmark leaderboard everyone keeps staring at instead.

The frontier-to-commodity gap is shrinking faster than the efficiency argument even needs it to. A full GPT-2 reproduction, a model that was considered too dangerous to release seven years ago, now costs about twenty dollars on a handful of rented GPUs for under three hours. That number is not a curiosity, it is the whole story compressed into a receipt: what used to require a well-funded lab now fits inside a hobbyist's weekend budget, and the delta between frontier and commodity keeps shrinking on a timeline nobody predicted correctly two years ago.

The seriousness of who controls compute is not theoretical either. The first AI-related espionage case tied to model theft already landed in court, a former employee at a major lab facing a sentence measured in centuries for walking out with trade secrets. Nobody prosecutes someone that hard over something that does not matter. The compute and the intelligence built on top of it are valuable enough now that people are willing to break the law over access to them, which is its own confirmation that the leash is the real fight, not a metaphor for one.

We have built the leash directly into how our own systems model failure, because pretending the ceiling does not exist does not make it go away. A goal running against a subscription profile that hits its rate or usage ceiling mid-task does not crash. It transitions into a literal, enforced state — usage-limited or budget-limited — sitting right alongside pending, active, and complete as a normal condition of the work, not an exception that kills the run. Across roughly five hundred plan nodes running at any given time, only a handful are ever actually sitting in that state, because the system is built to route around the ceiling automatically. The leash gets modeled as data, not treated as a crash.

The mechanism is deliberately unglamorous: a status column, a check constraint, a handful of enum values that already existed for pending, active, blocked, and complete work. Adding usage-limited and budget-limited to that same list cost almost nothing to build and changed the entire failure mode of the system. A goal that runs out of compute now waits and resumes. Before that state existed, it simply died, and someone had to notice and restart it by hand. The leash did not get shorter. The system just stopped pretending it was not there.

That is also the honest reason multi-profile subscription fleets exist at all: not elegance, a direct response to the leash. The build order behind it was blunt — bypass the normal login cycle, hold multiple profiles, switch between them automatically based on hourly and weekly usage, so you can run as many agent instances as you want without hitting the ceiling. Consumer subscriptions become a schedulable compute pool for exactly one reason: the ceiling on any single one of them is real, and routing around it beats waiting for it to lift.

Even at real money, the leash does not loosen, it just gets a longer chain. People running dozens of parallel paid accounts to route around per-account rate limits report burning through fifty billion tokens in a single week and spending in the tens of thousands of dollars in a month, most of it on the most expensive model available. That is not what unconstrained compute looks like. That is what it looks like to be rich enough to buy a longer leash and still hit the end of it regularly. Money buys you more accounts to route across, not freedom from the ceiling those accounts each still have. The leash gets longer. It does not come off.

AGI is not going to be evenly distributed, and it is not going to be as abundant as the discourse assumes. Compute is finite, expensive, and controlled by a small number of people who build and allocate it. Which means personal AGI is not something that arrives for you. It is something you have to build for yourself, deliberately, the same way you would build any other piece of infrastructure you cannot afford to depend on someone else for. Waiting for someone else to hand you an even share of it is not a plan, it is a bet against everything the last few years of infrastructure buildout have shown about who actually gets to allocate the compute.

So build the part of it you can actually own. Buy the hardware that puts inference on your own desk instead of on someone else's meter. Model your own rate limits as first-class states instead of pretending they will not happen. Chase efficiency the way the real competition is chasing it, instead of chasing benchmark screenshots that do not survive contact with a bill. The leash is real, it is not going away, and the only lever you actually control is how much slack you build into your own end of it.

It's up to each individual to build their own personal AGI. Are you?

← Back to the articles

What we shipped, what broke,and what we learned

What we shipped, what broke,
and what we learned