Section 30.3: Multi-Agent Patterns & Topologies

This continuation of Section 30.2 picks up after the single-agent libraries and moves to the topologies that combine multiple agents into one system. It catalogues the four multi-agent topologies in production (hierarchical, peer / debate, pipeline, competitive), names the canonical frameworks for each, and tabulates the failure modes you should expect.

Multi-Agent Patterns and Topologies

One agent is rarely enough. As soon as a task spans multiple skills (research plus writing, code plus review, planning plus execution), a single LLM context starts to fail: prompts get long, role confusion creeps in, the model loses track of which sub-goal it is pursuing. The standard response is to decompose into multiple agents with focused roles, smaller contexts, and a handoff protocol. The remaining question: which topology?

This section catalogs the four multi-agent topologies in production through 2024-2026 with the named cases that made each famous. The conceptual taxonomy is in Chapter 28; this is the practitioner's reference for which pattern to pick and which failure modes to expect. Chapter 27 (Tool Use) covers the MCP and function-calling primitives the agents use.

Four multi-agent topology patterns side by side: hierarchical, peer debate, pipeline, and competitive best-of-N — **Figure 30.3.1:** Four multi-agent topologies cover most production systems. Hierarchical (CrewAI-style) for content pipelines; peer / debate (AutoGen) for reasoning; pipeline (LangGraph) for structured handoff; competitive best-of-N for creative or coding tasks. The named framework column is the practical, not the only, way to express each pattern.

Big Picture

Four topologies cover 90% of production multi-agent systems: hierarchical (manager dispatches to workers), peer / debate (agents argue toward consensus), pipeline (sequential handoff), competitive (best-of-N wins). Each maps to a problem class with documented failure modes. The framework choice (LangGraph, CrewAI, AutoGen, MetaGPT) is mostly a question of which topology each makes ergonomic.

1. Hierarchical: Manager Plus Workers

The hierarchical topology has a manager agent that decomposes the request and dispatches sub-tasks to specialist workers. The manager plans, sequences, and aggregates; workers execute one well-defined skill. CrewAI popularized this through role-based abstractions and it dominates production content-generation systems.

CrewAI defines each worker as a Role with description, goal, and backstory; the manager (Crew) sequences tasks. The 2024-2025 deployments at Mintlify and HubSpot use a researcher, writer, editor, and fact-checker chained for blog production. The pattern works because each role's prompt stays under 2k tokens, and the manager only tracks current step, not full history.

from crewai import Agent, Task, Crew, Process

researcher = Agent(role="Researcher", goal="Find authoritative sources",
                   backstory="A meticulous research librarian.", tools=[web_search, arxiv])
writer = Agent(role="Writer", goal="Draft engaging prose",
               backstory="A former magazine staff writer.", tools=[])
editor = Agent(role="Editor", goal="Polish for clarity and voice",
               backstory="A senior copy editor.", tools=[])

crew = Crew(
    agents=[researcher, writer, editor],
    tasks=[Task(description="Research X", agent=researcher),
           Task(description="Draft post about X using the research", agent=writer),
           Task(description="Edit the draft", agent=editor)],
    process=Process.hierarchical,  # manager dispatches; alt: Process.sequential
)
result = crew.kickoff()

Code Fragment 30.3.1: CrewAI defines each worker as a Role with description, goal, and backstory; the manager (Crew) sequences tasks.

Production Pattern: Bound Manager Authority

The manager prompt is the highest-leverage piece of the system. A loose manager ("delegate as you see fit") drifts into infinite re-dispatch; a tight manager ("call each worker exactly once, in order") loses most of the value. The 2025 sweet spot: a small set of allowed actions (dispatch, aggregate, terminate) plus a max-iteration cap. CrewAI's max_iter and LangGraph's recursion limit both enforce this.

2. Peer / Debate: Consensus Through Disagreement

The peer topology has agents at the same level with deliberately opposing roles (proposer / critic, optimist / pessimist, two experts) exchanging arguments to converge on an answer. Du et al. (2023) "Improving Factuality and Reasoning in Language Models through Multiagent Debate" showed measurable factuality gains on benchmarks where a single agent confabulates. The 2024-2025 follow-ups extended this to code review (proposer suggests fix, second agent searches for bugs).

AutoGen (Microsoft Research, now v0.4 / Agent-Stack as of 2025) is the framework most explicitly designed around peer conversation. Its GroupChat wires N agents into a shared transcript with a speaker-selection policy (round-robin, LLM-routed, rule-based). Microsoft's 2024 GitHub Copilot Workspace case study used AutoGen for code review: proposer suggests refactor, tester writes tests, reviewer looks for regressions, loop terminates on no further critiques.

Warning: Failure Mode: Cascading Hallucinations

The textbook peer-debate failure: the first agent says something false with confidence; the second, finding it plausible, builds on it instead of challenging it. Park et al. 2023 "Generative Agents" documented this in social simulations; 2024 Anthropic work extended the analysis. Production mitigation: bind at least one agent to external ground-truth (search, code execution, database). Pure LLM-on-LLM debate amplifies confident wrongness.

3. Pipeline: Sequential Specialization

The pipeline topology is simplest: A's output is B's input, no branching, no loops. MetaGPT (Hong et al., ICLR 2024) made this famous by modeling a software org as a sequence (PM, architect, project manager, engineer, QA) where each produces a structured artifact (PRD, design doc, task breakdown, code, test report) as input for the next. Artifacts coordinate: B reads what A wrote, not A's reasoning history.

Pipeline fits genuinely linear workflows with clear skill boundaries. MetaGPT v0.8 (2024) generated small open-source projects (Snake game, calculator, blog scaffold) from one-line specs; the 2025 successor OpenHands (formerly OpenDevin) applies the pattern to autonomous software engineering with tighter shell-and-Git integration. Pipelines also dominate doc-processing: ingest extracts entities, normalization canonicalizes, summarization writes briefs, classification routes to reviewers.

4. Competitive: Race to the Best Answer

The competitive topology runs N agents on the same task in parallel and picks the best by learned judge, deterministic test, or majority vote. It is the agent equivalent of ML ensembles. Public 2024-2025 examples come from coding: self-consistency decoding (Wang et al., extended to agent loops in 2024) generates N candidates and votes; AlphaCode 2 (DeepMind 2024) sample-and-filtered thousands of code candidates by passing tests.

Competitive fits verifiable tasks (run the code, check JSON schema, validate math) where N x cost is justified by quality gain. The 2025 best-of-N inference work at Anthropic and OpenAI, with reasoning models drawing from a budget of intermediate thoughts and a learned verifier, internalizes this pattern into a single model.

5. Comparison and Selection

Table 30.3.1: Multi-Agent Topologies: Selection Guide.

Topology	Best For	Failure Mode	Canonical Framework	Cost Profile
Hierarchical	Multi-skill tasks with clear roles	Manager runaway loops	CrewAI, LangGraph	1 LLM call per worker step + 1 per manager decision
Peer / Debate	Factuality, reasoning, code review	Cascading hallucinations	AutoGen	2N+ calls per turn (every agent speaks)
Pipeline	Linear workflows, structured artifacts	Early-stage errors propagate	MetaGPT, OpenHands	1 call per stage; deterministic
Competitive	Verifiable tasks with high cost of failure	N x the budget for marginal gain	AlphaCode-style sample-and-rank	N parallel runs + verifier

6. Coordination Overhead: The Hidden Tax

Every additional agent adds coordination overhead: tokens on role descriptions, on context-carrying transcripts, on planning a human would have done in their head. Chen et al. (NeurIPS 2024) "Are More LLM Calls All You Need?" showed empirically that on simple tasks (single-step Q&A, basic summarization), multi-agent systems are worse than a well-prompted single agent because coordination overhead exceeds specialization benefit.

Heuristic: do not reach for multi-agent until a single-agent baseline fails for an identifiable reason (context overflow, role confusion, repeated tool errors). If the single-agent baseline works, multi-agent is gold-plating.

Postmortem: The Endless Crew

A pattern repeatedly seen in 2024-2025 CrewAI deployments: a hierarchical crew where the manager keeps re-dispatching because the worker's output is "almost right but not quite." Without explicit termination, the loop runs until token budget exhaustion. Fix: every Task gets a max_iter and every worker returns a structured artifact whose acceptance is judged by a deterministic check (schema, regex, hash), not another LLM call. The framework will not save you; you bound loops yourself.

7. Framework Mapping

Most frameworks support multiple topologies but each has a sweet spot. LangGraph is most general (graph engine; any topology) but code-heavy. CrewAI is most ergonomic for hierarchical and pipeline patterns. AutoGen is most natural for peer debate via GroupChat. MetaGPT is purpose-built for the software-engineering pipeline. For competitive / best-of-N no framework dominates; teams wire it themselves around any of the above. See Section 30.2 (Libraries & Frameworks) for the framework-level comparison.

What's Next?

In the next section, Section 30.4: Datasets & Benchmarks, we build on the material covered here.

Further Reading

Du, Y. et al. (2023). "Improving Factuality and Reasoning in Language Models through Multiagent Debate." arXiv:2305.14325. arxiv.org/abs/2305.14325. The canonical peer-debate paper; established that multi-agent debate measurably improves factuality.

Hong, S. et al. (2024). "MetaGPT: Meta Programming for a Multi-Agent Collaborative Framework." ICLR 2024. arxiv.org/abs/2308.00352. Defined the pipeline pattern for software-engineering simulation: PM, architect, engineer, QA roles with structured artifacts.

Wu, Q. et al. (2024). "AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation." COLM 2024. arxiv.org/abs/2308.08155. The AutoGen framework paper; GroupChat primitive and speaker-selection policies for peer topologies.

Chen, L. et al. (2024). "Are More LLM Calls All You Need? Towards Scaling Laws of Compound Inference Systems." NeurIPS 2024. arxiv.org/abs/2403.02419. Empirical analysis showing multi-agent systems can be worse than single-agent on simple tasks due to coordination overhead.

CrewAI Team (2024). "CrewAI: Role-Based Multi-Agent Orchestration." CrewAI Documentation. docs.crewai.com. Framework reference for hierarchical content-production workflows.

OpenHands Team (2025). "OpenHands: An Open Platform for AI Software Developers." arXiv:2407.16741. arxiv.org/abs/2407.16741. Successor to OpenDevin; pipeline pattern applied to autonomous software engineering.

Google DeepMind (2024). "AlphaCode 2 Technical Report." deepmind-media/AlphaCode2. Competitive sample-and-filter pattern for code generation at scale.