Chapter 28: Multi-Agent Systems | Building Language AI

Chapter opener illustration: Multi-Agent Systems.

"None of us is as smart as all of us."
Echo, Co-Operative AI Agent

Looking Back

One agent (Chapter 26) with tools (Chapter 27) handles most production tasks. Some tasks need several agents. This chapter covers the framework landscape (LangGraph, CrewAI, AutoGen, Swarm), the architectural patterns (supervisor, hierarchical, debate, ensemble), and the engineering reality: most "multi-agent" systems work because they replicate one good agent with role-specific prompts, not because of any deep coordination magic.

Chapter Overview

Anthropic's research-agent paper from June 2025 reported that a multi-agent orchestrator-plus-workers setup outperformed a single Claude agent by 90 percent on the company's internal research benchmark, while burning roughly 15 times the tokens. That is the multi-agent trade: more roles, more debate, more compute, and sometimes more right answers. Most teams over-buy: a well-prompted single agent solves 80 percent of problems with no orchestration tax. This chapter is about the 20 percent where extra agents actually pay for themselves, and the LangGraph, CrewAI, AutoGen, and OpenAI Agents SDK patterns that let you build them without a state-machine graveyard.

You will learn to design multi-agent architectures using supervisor, pipeline, mesh, swarm, hierarchical, and debate topologies. The chapter covers structured communication protocols with consensus mechanisms, durable state management using LangGraph state machines and Temporal, and human-in-the-loop interaction points with graduated autonomy and trust calibration. Building on the single-agent foundations from Chapter 26, these patterns connect to the agent safety considerations in Chapter 49.

Big Picture

Complex tasks often exceed what a single agent can handle. Multi-agent systems use collaboration patterns like supervisor hierarchies, debate, and pipeline architectures to decompose problems. This chapter builds on the single-agent foundations of Chapter 26 and connects to the agent safety considerations of Chapter 49.

Note: Learning Objectives

Compare major multi-agent frameworks (LangGraph, CrewAI, AutoGen, OpenAI Agents SDK, Google ADK) and select the right one for a given use case
Design multi-agent architectures using supervisor, pipeline, mesh, swarm, hierarchical, and debate topologies
Implement structured communication protocols with consensus mechanisms and strategies to prevent sycophantic convergence
Build durable, checkpointed agent workflows using LangGraph state machines or Temporal for long-running orchestration
Design human-in-the-loop interaction points with graduated autonomy and trust calibration

Prerequisites

Chapter 26: AI Agent Foundations (ReAct loop, agent architectures, planning and memory)
Chapter 27: Tool Use, Function Calling & Protocols (function calling, MCP, A2A protocol)
Chapter 12: Prompt Engineering (system prompts, structured outputs, chain-of-thought)
Familiarity with Python async patterns and basic graph or state machine concepts

Sections

Lab 28: Build a 3-Agent Debate System That Beats a Single-Agent Baseline

Objective

Set up a multi-agent debate where two adversarial agents argue both sides of a question and a third "judge" agent synthesizes the final answer. By the end you will see where debate helps (controversial / ambiguous queries) and where it hurts (factual lookups), and you will have a measurable improvement on a small benchmark.

Steps

Step 1: Pick a benchmark. Use a 100-item subset of TruthfulQA (multiple-choice). This benchmark has known-tricky questions where single-model accuracy hovers around 50 to 70%.
Step 2: Baseline. Run GPT-4o-mini directly on all 100 questions. Record accuracy.
Step 3: Build the debate. Implement three roles via LangGraph or AutoGen: Proposer argues for answer A, Skeptic argues for answer B, Judge reads 2 rounds of exchange and decides. Each role is a separate system prompt; share state via a message log.
Step 4: Run debate on the same 100. Record accuracy and total tokens consumed (debate burns ~5x baseline).
Step 5: Error analysis. Open 10 questions where debate flipped the verdict. How many flipped to correct? To wrong? Hypothesize when debate helps (calibration, edge cases) vs. hurts (one agent confidently wrong + persuasive).
Step 6: Library shortcut. Re-implement the same debate using autogen-agentchat in 25 lines (UserProxy + AssistantAgent x 2 + GroupChatManager). Compare developer experience to the from-scratch version.

Expected Output

Expected time: 3 to 4 hours. Difficulty: intermediate. Artifact: accuracy table (baseline / debate) + token-cost analysis + categorized failure log.

What's Next?

Next: Chapter 29: Specialized Agents. The patterns in Chapter 28 are generic; Chapter 29 zooms into the agent types that are actually shipping revenue in 2026: coding agents (Claude Code, Cursor, Windsurf, Devin), research agents, data-analysis agents, browser-use agents, and computer-use agents. Each has a distinctive architecture, and the differences between them are the lessons.