a16z For Limited Partners
Issue 01 · June 2026

AI Inference
Portfolio Intelligence for Institutional Allocators
This Issue

AI Is Shedding Its Compute Stack

Every prior computing transition replaced the assumptions beneath the stack rather than upgrading the stack itself. Agents are forcing the same replacement now, at every layer at once.

a16z
a16z Portfolio Intelligence Team ·  June 2026
Listen to this issue
0:00 / 0:00

Software infrastructure doesn't age gracefully. It accumulates. Each new workload gets bolted onto the last architecture, each new demand gets answered with a workaround, until one day the whole thing is held together by assumptions that no longer describe the world it is running in. That is how transitions announce themselves, and one is announcing itself now.

The transition is being driven by agents. An agent doesn't submit a request and wait. It receives a goal and fans out, spawning sub-tasks, querying databases, calling APIs, spawning more sub-tasks from the results of those calls, recursively, in parallel, in milliseconds. The infrastructure underneath was never designed for that. Understanding why, and what it takes to fix it, is the clearest available view into where the next generation of foundational technology companies will come from.

Fig. 1 — One request vs. one goal
1 operation in flight One request. One response. The pattern the stack was built for.
A human interaction produces one system response. A single agent goal triggers a recursive fan-out of sub-tasks, database queries, and API calls, growing by the millisecond. Toggle to compare the two workloads the same infrastructure is asked to serve.

Every layer encounters the same problem differently

The stack assumes work arrives in an orderly way. Agents break that assumption everywhere at once.

All software runs on a tower of supporting layers: chips, data centers, and power at the bottom, and above them the systems that make raw hardware usable, including storage, networking, databases, and the scheduling and routing software that decides what runs where and when. The bottom of that tower has already been rebuilt for AI. GPUs displaced CPUs as the unit of compute, racks were redesigned around power densities several times what they once carried, cooling went liquid. The layers above are what haven't caught up, and they share a single organizing assumption: work arrives in an orderly way. One request, one response, predictable volume, schedulable load. For decades that assumption accurately described how software was used, and every layer has it baked in.

Agents violate it everywhere, simultaneously. Today's enterprise backends are designed for a 1:1 ratio of human operations to system responses, and a single agent goal can trigger 5,000 sub-tasks, database queries, and internal API calls at millisecond scale. When an agent refactors a codebase or combs through security logs, traditional databases and rate limiters read it as a DDoS attack.

That image is worth sitting with. Infrastructure built to serve users is interpreting legitimate work as an attack, because the pattern of that work no longer resembles anything it was designed to recognize. The failure runs through every layer.

Fig. 2 — What each layer was built for, and what agents demand
Layer
Last rearchitected
The stack was built forWhat agents demand
ComputeCloud autoscaling, ~2006–2014 Large, schedulable jobs consuming known capacity over predictable periods Bursty, recursive work that arrives without warning and stops without notice
NetworkingSoftware-defined networking, early 2010s Moving large volumes in predictable, directional transfers Coordinating thousands of stateful, interdependent calls arriving in cascades
StorageObject storage, 2006 Files and objects, read and written at human pace Repeated, fast access to weights, activations, and embeddings across simultaneous processes
OrchestrationContainers and serverless, ~2014 Discrete, stateless tasks enumerated in advance Goals that recursively generate their own interdependent sub-tasks
CoordinationMicroservices-era gateways, early 2010s Routing, locking, and policy enforcement at human concurrency State management and policy enforcement across massive parallel execution
Five layers, one root cause. Each was designed around orderly work, and none has been fundamentally rearchitected since the cloud era. The prose that follows goes deep on two of them.

A pair of examples makes the failure concrete.

Start with how databases handle simultaneous access. When two processes try to modify the same record, the database resolves the conflict with a row-level lock: the first process takes exclusive control, and everything else waits. Locking works when conflicts are rare, which they are when the users are people. A thousand agents working the same dataset collide constantly, and a system built on waiting in line spends most of its time waiting: a bank branch staffed for the occasional customer, hit by a thousand at once. The replacement, already shipping in a new class of databases, is optimistic concurrency: let every process proceed, reconcile conflicts after the fact, and treat simultaneous demand as the normal condition.

Then the cold start. Serverless platforms save money by keeping capacity asleep until a request arrives, and waking it takes a moment. A person never notices the moment. An agent dispatching five thousand sub-tasks notices it five thousand times, and because each delayed sub-task delays everything downstream of its results, the waiting compounds through the cascade. It is a kitchen that lights the stove when each order arrives: fine at lunch-counter pace, finished by the dinner rush.

Each failure has its own engineering fix, but the root cause is shared: an architecture designed for order, meeting a workload that is chaotic by design. The bottleneck becomes coordination — routing, locking, state management, policy enforcement across massive parallel execution — and coordination doesn't scale by adding hardware. The systems themselves have to change.

This has happened before

The cloud transition replaced assumptions, one layer at a time. This one is replacing every layer at once.

The shift from on-premises computing to the cloud followed the same shape. The old stack worked exactly as designed; the world it described stopped existing. File systems built for machines you owned gave way to object storage billed by the gigabyte. Hand-provisioned servers gave way to virtualization and APIs that could summon capacity in seconds. Capital expenditure gave way to metered consumption. Each replacement looked incremental on its own. Together they were a new architecture, and the companies that built around the new assumptions, AWS among them, became the most consequential infrastructure businesses of their era.

Fig. 3 — Three stacks, three broken assumptions
On-premises
1980s – 2000s
The assumption that broke
Compute is something you own. Capacity is bought ahead of demand and lives in your building.
Cloud
2006 – present
The assumption that broke
Traffic comes from people. Load is predictable, low-concurrency, and arrives at human speed.
Now
Agent-native
Emerging
The assumption being rebuilt
Work is orderly. The next stack treats recursive, massively parallel demand as the default state.
Each transition replaced the organizing assumption of the era before it. The companies that built around the new assumption defined the layer.

The same replacement is underway now, with one meaningful difference: prior transitions rebuilt one layer at a time, each on its own cycle, and this one is rebuilding every layer at once. The first response to the generative AI boom was correctly aimed at the constraint of its moment. Access to compute at the lowest total cost had become a determining factor for the success of AI companies, and capital moved accordingly, into GPU clusters, data centers, and the power to run them. That buildout solved the training problem. The agent problem lives somewhere else entirely: in the control plane, in whether routing, state, and policy can hold together when everything runs at once.

What gets built next

The replacements are specific, and the builders are already deep in production.

Building for agents means re-architecting that control plane. The replacements are specific: databases that reconcile instead of lock, API gateways that recognize an agent pursuing a goal instead of rate-limiting it as an attacker, monitoring that counts completed tasks rather than requests per second, schedulers that expect a goal to spawn its own sub-tasks. The systems that define this era will treat the thundering herd as the default mode: cold starts that don't penalize, latency variance that doesn't compound, concurrency limits that don't become ceilings.

Fig. 4 — Human-speed vs. agent-speed traffic
Human-speed: predictable, low concurrency Agent-speed: recursive, bursty, massive
DESIGNED CAPACITY 0 TIME →
Human traffic stays beneath the capacity the system was provisioned for. Agent traffic exceeds it in recursive bursts that arrive without warning. The next generation of infrastructure treats the spikes as the default state, not the exception.

History suggests how this resolves, and how the returns distribute. Every prior full-stack transition produced a small number of companies that defined the new layer and compounded for decades, and in every case the ownership that mattered was taken early, while the architecture was still unsettled and the eventual winners were indistinguishable from the noise. By the time the cloud transition was obvious, AWS belonged to public-market shareholders. The venture returns had been captured years before, by investors close enough to the builders to see the new assumptions forming.

The same window is open now. The companies that will define the agent-native stack are being built today, by teams solving these failures in production, and they are visible mainly to those working alongside them.

These shifts tend to become obvious right around the time they are no longer early.