When should I use multiple agents instead of one?

Use multiple agents when the task naturally splits into specialist roles (researcher, writer, critic), when one agent's context window is too small for the whole job, or when you need different system prompts / temperatures per stage. Don't multiply agents for a task one focused agent could do.

What's the simplest multi-agent pattern?

Supervisor + workers. A supervisor routes tasks to specialist workers and aggregates results. Two-level hierarchy. Easy to debug. Most production multi-agent systems are some variant of this.

How do I evaluate a multi-agent system?

End-to-end task success on a held-out eval set is the only metric that matters in the end. Per-agent metrics (precision of routing, quality of each step) help debug. The eval set is the source of truth — see the LLM evaluations post.

Multi-Agent Systems in 2026 — Production Patterns That Work

In 2026 the agentic AI conversation moved past “build a single agent” to “build a team of agents.” Gartner reports a 1,400% surge in multi-agent system inquiries. But most multi-agent systems people ship are over-engineered for their use case. This post is the working patterns and the honest tradeoffs.

If you haven’t built single-agent systems yet, start with AI Agents with LangGraph in 2026 — A Practical Tutorial .

When multi-agent earns the cost

Real reasons to use multi-agent:

Specialization. A researcher agent has tool access and prompts optimized for finding info. A writer agent has prompts optimized for tone and structure. Different roles, different systems, different temperatures.
Context budget. If the whole task doesn’t fit in one agent’s context, split it across agents that hand off summaries.
Trust boundary. A critic agent reviews a writer agent’s output without seeing the writer’s chain-of-thought — eliminates the “I just wrote it, it must be right” bias.
Parallelism. Three workers can search three sources simultaneously. One agent serializes.
Different models. Use Opus for hard reasoning, Haiku for cheap classification, Sonnet for general work. Multi-agent lets each role pick the right model.

Bad reasons:

“It sounds more sophisticated.”
“We saw a paper with 7 agents.”
“We want to fan out for fun.”

A single well-prompted agent often beats a poorly-orchestrated swarm. Default to single. Add agents when there’s a concrete reason.

Pattern 1 — Supervisor / Worker

The most useful production pattern. A supervisor routes work to workers, aggregates results.

            ┌──────────────┐
            │  Supervisor  │
            └─────┬────────┘
       ┌──────────┼──────────┐
       ▼          ▼          ▼
   ┌─────┐    ┌─────┐    ┌─────┐
   │ Web │    │ DB  │    │Code │
   │Worker│   │Worker│   │Worker│
   └─────┘    └─────┘    └─────┘

The supervisor’s prompt: “You receive a user request. Decide which workers to call. After workers respond, decide whether to call more or return.”

Workers have narrow toolsets. They do one thing well.

from langgraph.graph import StateGraph
from langgraph.prebuilt import ToolNode

# pseudo-code; see LangGraph docs for the full setup
graph = StateGraph(AgentState)

graph.add_node("supervisor", supervisor_node)        # decides routing
graph.add_node("web_worker", web_worker)             # tools: search, fetch
graph.add_node("db_worker", db_worker)               # tools: query, schema
graph.add_node("code_worker", code_worker)           # tools: read_file, run

graph.set_entry_point("supervisor")
graph.add_conditional_edges("supervisor", route_to_worker)
graph.add_edge("web_worker", "supervisor")
graph.add_edge("db_worker", "supervisor")
graph.add_edge("code_worker", "supervisor")

Why this works in production:

Easy to debug. Trace shows: supervisor said X, worker did Y, supervisor said Z.
Easy to extend. New worker = one node + one routing rule.
Bounded context. Supervisor sees summaries; workers see details.

This is the shape I reach for first.

Pattern 2 — Writer / Reviewer

Two agents in a feedback loop:

Writer drafts → Reviewer critiques → Writer revises → Reviewer approves

Mechanically:

@workflow.defn
class WriterReviewer:
    @workflow.run
    async def run(self, brief: str) -> str:
        draft = await write(brief)
        for _ in range(3):
            review = await review(brief, draft)
            if review.approved:
                return draft
            draft = await revise(brief, draft, review.notes)
        return draft   # ship after 3 rounds even if not approved

The reviewer doesn’t see the writer’s chain-of-thought. It sees only the final draft. That’s the value: an independent critique with no anchor.

Used heavily for:

Code review (one agent writes, another reviews).
Long-form writing (drafts + edits).
Decision-making (one agent proposes, another challenges).

For coding specifically, see Claude Code Skills and Agentic Coding Patterns .

Pattern 3 — Hierarchical

Bigger problems get a tree:

                Root supervisor
                 │
        ┌────────┼────────┐
        ▼                 ▼
   Research lead    Writing lead
     │                  │
   ┌─┴─┐            ┌──┴──┐
   ▼   ▼            ▼     ▼
   Web  DB        Drafter  Editor

Each non-leaf node is itself a supervisor. The root delegates broad goals; intermediate supervisors break them down; leaf workers do.

Useful when:

Tasks have clear sub-domains.
You want bounded context at every level (root sees only the top-level summary).
You’re orchestrating across teams of agents that don’t share context.

Risk: complexity explodes. Three layers can be debuggable. Five usually isn’t.

Pattern 4 — Swarm / Network

Agents interact peer-to-peer, no fixed hierarchy:

   Agent A ←→ Agent B
        ↘ ↙
        Agent C

Each agent can pass control to any other based on the situation. OpenAI’s Swarm, Crew AI’s collaborative crew, and LangGraph’s Send API all support this.

Use sparingly. Swarms are exciting in research; in production they’re hard to debug, hard to evaluate, hard to bound.

Pattern 5 — Pipeline / Chain

Agents in series, each handing off:

extract → validate → enrich → summarize → respond

This isn’t really “multi-agent” — it’s a deterministic pipeline of LLM calls with different prompts. But it’s how 70% of “multi-agent” systems actually behave in practice. Honest framing helps you reason about it.

Communication models

How do agents talk to each other? Three options:

1. Shared state (LangGraph default)

A typed state object passes between nodes. Each agent reads what it needs, writes what it changes.

Pros: Type-safe, debuggable.
Cons: Tight coupling; all agents must agree on the shape.

2. Message passing

Each agent has an inbox; sends messages to others by name. Like Erlang actors.

Pros: Loose coupling; agents can be added/removed.
Cons: Easy to get into deadlock or infinite messaging loops.

3. Tool-based

Agents are exposed as tools to a parent. The parent calls them like any other tool.

Pros: Composes naturally with single-agent tool use.
Cons: Parent has to know about all children; centralized.

For most production systems, shared state with a supervisor (Pattern 1) is the right default.

Memory and context

A multi-agent system that forgets is useless. Three layers:

Per-agent context. What this agent saw in this session.
Shared task state. What the team has agreed on / produced so far.
Long-term memory. What this team has learned across sessions (vector DB / KV store).

LangGraph’s checkpointer (Postgres-backed) handles 1+2. Long-term is your responsibility — store summaries in pgvector and retrieve when relevant. See Build a RAG App with pgvector and FastAPI for the retrieval patterns.

Reliability

Multi-agent systems amplify single-agent reliability problems. Each agent can:

Loop forever.
Call tools wrongly.
Hand off to a non-existent peer.
Get stuck waiting for a peer.

Defenses:

Hard step limits. Cap rounds at the supervisor level.
Timeouts on every agent invocation.
Tool-call validation before execution.
Termination conditions in routing — explicit “we’re done” check.
Durable workflows (see Temporal Durable Execution ). For agents that take minutes-to-hours, surviving worker crashes is non-negotiable.

Evaluation

How do you tell if a multi-agent system is working?

End-to-end task success. The metric. Did the team produce the right outcome?
Per-agent metrics. Routing precision (did the supervisor pick the right worker?), worker quality (did each worker do its job?). Useful for debugging.
Cost / latency budgets. Multi-agent is more expensive than single. Track tokens-per-task, time-per-task.
A/B against single agent. Often the most surprising experiment. Many “multi-agent wins” disappear when you give the single agent the same time and tools.

For the eval mechanics, see LLM Evaluations — Test Prompts and Agents .

Cost discipline

A naive multi-agent system explodes token usage. Each round, every agent reads context. Things to watch:

Pass summaries, not full transcripts between agents.
Cache stable system prompts (Anthropic prompt caching helps; see Anthropic Claude API Guide ).
Use cheaper models for simple roles (Haiku for routing; Sonnet for work; Opus only when needed).
Hard cap on rounds. “Agent must finish within 5 rounds” forces convergence.

Common mistakes

1. Adding agents instead of fixing prompts

If a single agent fails, the first instinct should be: better prompt, more examples, structured output. Multi-agent is often expensive complexity for what was a prompt-engineering problem.

2. No clear termination

A supervisor that never says “done” loops until step limits. Encode termination explicitly.

3. Workers that are too small

A worker whose entire job is “call this one tool” is just a tool. No need to wrap it as an agent.

4. Workers that are too big

A worker doing 10 different things is just another supervisor. Split.

5. Skipping evaluation

You don’t actually know if your multi-agent system is better than a single agent unless you measured. Build the eval. Compare. Decide based on data.

When to ship single-agent

Default to one well-prompted agent for:

The user’s intent fits in one prompt.
The tools needed are <10.
The reasoning fits in 200k context.
You want predictable cost / latency.

Multi-agent if:

The team has obvious specialist roles.
Single agent’s context overflows.
Different stages need different models.
Reviewer-style independence matters.

When multi-agent earns the cost#

Pattern 1 — Supervisor / Worker#

Pattern 2 — Writer / Reviewer#

Pattern 3 — Hierarchical#

Pattern 4 — Swarm / Network#

Pattern 5 — Pipeline / Chain#

Communication models#

1. Shared state (LangGraph default)#

2. Message passing#

3. Tool-based#

Memory and context#

Reliability#

Evaluation#

Cost discipline#

Common mistakes#

1. Adding agents instead of fixing prompts#

2. No clear termination#

3. Workers that are too small#

4. Workers that are too big#

5. Skipping evaluation#

When to ship single-agent#

Read this next#