Systems13 min read

The multi-agent architecture that actually ships

A practical breakdown of multi-agent systems that actually ship. A three-role architecture (orchestrator, workers, validators), validation contracts written before any code, serial execution over naive parallelism, and why the real bottleneck is human attention, not model intelligence.

Key takeaways

The bottleneck in software engineering is human attention, not model intelligence.
Production multi-agent systems need three roles: orchestrator, workers and validators. Write the validation contract before any code.
Serial execution beats naive parallelism for software work. The right model in each role is a compounding advantage.

The bottleneck is attention, not intelligence

Here is the claim that reframes the whole problem. The bottleneck in software engineering is no longer intelligence. It is human attention.

Even strong engineers only drive a few tasks forward at once. Every commit needs review. Every feature needs supervision. Models are already smart enough to work through a backlog of fifty items. There is just not enough human bandwidth to watch them do it.

The shift is subtle but important. A human decides what to build. A system works out how. You approve the plan, walk away for hours or days, and come back to finished work.

That frames everything else here. Multi-agent is not about looking impressive in a demo. It is about running work asynchronously at a scale one agent cannot sustain.

Five building blocks for any multi-agent system

Multi-agent research is messy. Every framework has its own terminology. Strip it back and most production systems are really built from five strategies.

Delegation. One agent spawns another with a sub-task and waits for the result. Sub-agents in coding tools are the most common example. Simple, but limited on its own.

Creator-verifier. One agent builds, a separate agent checks. The verifier has fresh context and no bias toward the implementation. This is why we do code review as humans.

Direct communication. Agents talk to each other with no central coordinator, like sending DMs. Hard to get right, because state fragments and there is no single source of truth.

Negotiation. Agents coordinate over a shared resource, such as the same API or the same part of the codebase. It works best when the trade is win-win, not adversarial.

Broadcast. One agent sends context to many. Status updates, shared constraints, new requirements. Less flashy, but critical for coherence over long runs.

Next time someone describes their agent setup, map it to these five. It quickly shows what is actually happening under the hood.

A mission combines four of the five

The architecture worth building combines four of these into one long-running workflow. Call it a mission.

You describe a goal. You scope it through conversation. You approve a plan. The system then executes for hours or days while you focus on something else.

A mission is not a single chat session. It is an ecosystem of agents communicating through structured handoffs and shared state.

Runs of this kind can last well over a week. That only works because of the structure, not because the model got smarter mid-run.

The three-role architecture

Every mission uses three distinct roles. Each starts with clean context. Each has one job.

Orchestrator handles planning. It is your sounding board. It asks the strategic questions, surfaces unclear requirements, and produces a plan of features, milestones and a validation contract.

Workers handle implementation. When a feature is assigned, the worker starts with no accumulated baggage. It reads the spec, implements, commits to Git, and leaves a clean slate for the next worker.

Validators handle verification. They do more than run lint and tests. They check that the feature actually works end to end. That is what lets a mission run for days without drifting.

If you are explaining multi-agent to a client or a colleague, this three-role split is the simplest model that still reflects production reality.

Why tests written after the code lie to you

Most coding agents validate by running tests the agent wrote after building the feature. The tests pass. Coverage looks healthy. But the tests were shaped by the code, not by what the code was meant to do.

Tests written after implementation do not catch bugs. They confirm decisions.

A mission fixes this with a validation contract, written during planning, before any code. It defines correctness independently of the implementation. On a complex project this can be hundreds of assertions, and every feature is assigned the assertions it must satisfy.

After each milestone, two validators run.

Scrutiny validator. The traditional checks. Test suite, type checking, lint, plus dedicated code review agents for each completed feature.

User testing validator. This one acts like a QA engineer. It spawns the application, interacts through computer use, fills forms, clicks buttons, and checks that the flows work.

Neither validator has seen the code before, so validation is adversarial by design. Most of the wall-clock time is spent here, waiting for real execution rather than generating tokens.

Structured handoffs keep context from drifting

When a worker finishes, it does not just say it is done. It fills out a structured handoff.

What was completed. What was left undone. What commands were run and their exit codes. What issues were found. Whether it followed the orchestrator's procedures.

Errors get caught at milestone boundaries. Corrective work gets scoped. The mission pulls itself back on track, not by hoping agents remember, but by forcing them to write it down and then act on it.

This is the connective tissue most hobby setups miss. Without it, long runs degrade quietly until the output is wrong.

Why serial execution beats parallel

The obvious move is parallelism. Ten agents, ten times the throughput. For software work, it does not hold up.

Agents conflict. They step on each other's changes. They duplicate work. They make inconsistent architectural decisions. The coordination overhead eats the speed gain while you burn tokens.

So a mission runs features serially. Only one worker or validator runs at a time. Inside a feature, read-only work is parallelised, such as codebase search and API research. Inside validators, code review is parallelised.

It looks slower on paper. The error rate drops sharply, and over a multi-day run that correctness compounds.

Mission control and model whispering

A normal chat interface does not work for a job that lasts days. You need a mission control view: progress, budget burned, what the active worker is doing, the handoff summaries, and what the validators found.

You can sit over it like a project manager, or you can close the laptop and leave. The point is that it runs without you.

The second discipline is what we call model whispering. You build a mental model of how different models behave, where they fail, how those failures compound over a long run, then you choose which model sits in which seat.

Planning rewards slow, careful reasoning. Implementation rewards fast code fluency. Validation rewards precise instruction following. No single model is best at all three.

Using a different provider for validation also reduces shared training-data bias. As models specialise, putting the right model in the right seat becomes a compounding advantage.

What this looks like in production

Run a build like this and the shape is consistent. Take a chat app on the scale of a Slack clone.

Roughly sixty percent of the time and sixty percent of the tokens go on implementation. Validation almost never passes first time, so follow-up features are routine. That is the QA loop earning its keep. Around half of the final lines of code end up as tests, with most of the codebase covered.

Prompt caching does a lot of the heavy lifting to offset the cost of such long runs.

The strongest use cases are in the enterprise. Prototyping features overnight, building internal tools faster, running large refactors and migrations, research work, and modernising old codebases so agents work better in them later.

Designing systems that improve with each model

Every multi-agent builder has the same fear. The next model release makes their architecture obsolete overnight.

The answer is to keep almost all of the orchestration logic in prompts and skills, not in a hard-coded state machine. A few hundred lines of text can define how features are decomposed and how failures are handled. A handful of sentences can change the execution strategy completely.

Worker behaviour is driven by skills the orchestrator writes per mission. The only fixed logic is thin bookkeeping. Running validation, and blocking progress when a handoff issue has not been addressed.

The structure provides the discipline. The models provide the intelligence. That structure can even carry models below frontier level, because the validation contracts and milestone checkpoints keep them honest. It runs on open-weight models too.

The codebase ends up cleaner than when you started. The tests, the skills and the structure mean both agents and humans are more productive in it afterwards.

How to explain this to your team

If you need the one-minute version for a colleague:

"We are not limited by how clever the AI is. We are limited by how much we can supervise. A mission lets you approve a plan, then a structured team runs for days. An orchestrator plans, workers build one feature at a time, and validators check that it actually works before anything moves on."

If you need a checklist for your own setup:

Start with the workflow, not the tool. Define what done means before any code, in a validation contract. Keep the builder and the checker as separate agents. Force a written handoff at every boundary. Run serially unless the work is read-only. Put the right model in each role.

And the question to pressure-test any design. Which of the five building blocks does each part use, where is the connective tissue, and what happens when validation fails on day three?

The starting move is simple. Describe a goal, argue with the orchestrator about the scope, approve the plan, then go and do something else. That is the operating model worth copying.

Apply for a session