Testing Multi-Agent Systems

Section 28.4

"Testing one agent is hard. Testing five agents talking to each other is a combinatorial adventure."

EvalEval, Thoroughly Tested AI Agent
Big Picture

Testing multi-agent systems is a combinatorial challenge that standard unit testing cannot solve alone. Non-deterministic LLM outputs, emergent inter-agent behaviors, and environmental dependencies mean that a test suite passing today might fail tomorrow with no code changes. This section introduces a four-level testing pyramid for agent systems (unit, integration, scenario, chaos), outcome-based assertions that tolerate natural variation in LLM outputs, and chaos testing techniques that inject failures to verify error handling and graceful degradation. The evaluation frameworks from Chapter 42 address model quality; this section addresses system-level reliability.

Prerequisites

This section builds on all previous chapters in Part VI, especially tool use (Chapter 27) and multi-agent systems (Chapter 28).

28.4.1 The Testing Challenge

Fun Fact

The four-layer testing pyramid for agents (unit, integration, scenario, chaos) borrows the chaos-engineering term directly from Netflix's Chaos Monkey (2010), the tool that randomly killed production servers to verify resilience. Netflix open-sourced Chaos Monkey because they had grown tired of explaining to other engineering teams why their EC2 instances kept disappearing at 11 a.m. on Tuesdays.

Testing multi-agent systems breaks the assumptions traditional software testing leans on. Three problems compound: non-determinism (same input, different output), emergent behavior (agents interact in ways nobody designed), and environmental drift (tools, APIs, and data shift between runs). A test suite that passes today fails tomorrow because the LLM took a slightly different reasoning path and called a different tool. Unit tests are still necessary; they are not sufficient.

The testing pyramid for agent systems has four levels. Unit tests verify individual components: tool implementations, input validators, state management logic. These are deterministic and fast. Integration tests verify that components work together: the agent can call tools correctly, tools return results in the expected format, and state transitions work as designed. Scenario tests run the complete agent on predefined tasks and check for acceptable outcomes (not exact matches). Chaos tests inject failures into the system to verify that error handling, fallbacks, and graceful degradation work correctly.

Pyramid showing four testing layers from bottom to top: unit, integration, scenario, chaos, with cost increasing and test count decreasing toward the top
Figure 28.4.1: The four-layer testing pyramid for agent systems. Unit and integration tests are cheap and deterministic; scenario and chaos tests are expensive and non-deterministic. A healthy suite has thousands of unit tests at the base, hundreds of integration tests, dozens of scenario tests, and only a handful (5 to 20) of chaos drills at the peak.

For the non-deterministic layers (scenario tests), use outcome-based assertions rather than exact-match assertions. Instead of checking that the agent produced a specific string, check that the output contains the required information, that tool calls were made in a valid order, that the final answer is factually correct, and that the agent stayed within its budget. This makes tests robust to the natural variation in LLM outputs while still catching genuine failures.

Key Insight

The most valuable agent tests are regression tests built from production failures. When an agent fails in production, capture the full trace (input, tool calls, responses, output) and add it to the test suite as a regression test. Over time, this builds a collection of real-world edge cases that the agent must handle correctly. This is far more effective than trying to anticipate failure modes in advance, because real failures reveal blind spots that manual test design misses.

28.4.2 Contract Testing for Multi-Agent Systems

In a multi-agent system, each agent depends on the outputs of other agents. If Agent A changes its output format, Agent B (which consumes that output) may break. Contract testing verifies that each agent's inputs and outputs conform to agreed-upon schemas, catching integration issues before they reach production. The "contract" is a formal specification of what each agent expects to receive and what it promises to produce.

from pydantic import BaseModel
from typing import List

# Define the contract between the Research Agent and the Writing Agent
class ResearchOutput(BaseModel):
    """Contract: what the Research Agent must produce."""
    topic: str
    findings: List[dict]  # Each finding has 'source', 'content', 'relevance'
    gaps: List[str]      # Topics that need more research
    confidence: float  # 0.0 to 1.0

class WritingInput(BaseModel):
    """Contract: what the Writing Agent expects to receive."""
    topic: str
    findings: List[dict]
    tone: str       # "formal", "casual", "technical"
    max_length: int  # words

def test_research_output_matches_writing_input():
    """Research Agent output must satisfy Writing Agent input contract."""
    # Run the Research Agent on a test task
    research_result = research_agent.run("Summarize recent advances in RAG")
    # Validate against the contract
    output = ResearchOutput(**research_result)
    assert len(output.findings) >= 1, "Must produce at least one finding"
    assert 0 <= output.confidence <= 1, "Confidence must be in [0, 1]"
    # Verify it can be transformed into the Writing Agent's expected input
    writing_input = WritingInput(
        topic=output.topic,
        findings=output.findings,
        tone="technical",
        max_length=2000,
    )
    assert writing_input  # Pydantic validation passed
Code Fragment 28.4.1a: This snippet uses Pydantic BaseModel classes to define explicit contracts between a PlannerOutput (with steps list and confidence float) and an ExecutorInput that validates the plan. The strict typing ensures that malformed plans are rejected at the boundary before the executor agent processes them.

Contract testing is especially important for multi-agent systems because agents evolve independently. If the Research Agent's developer changes the output format from a list of dictionaries to a flat string, the Writing Agent breaks silently because it receives valid text but not the structure it expects. Contract tests catch this at the boundary before it manifests as a subtle quality degradation in production. This is the same principle that drives API versioning in microservice architectures: the interface between components must be explicitly defined and tested, independent of each component's internal implementation.

28.4.3 Chaos Engineering for Agents

Chaos engineering deliberately introduces failures into the system to verify that it handles them correctly. For agent systems, chaos tests inject: LLM API failures (timeouts, rate limits, garbage responses), tool failures (services returning errors, slow responses, incorrect data), data corruption (tools returning malformed JSON, unexpected data types), and resource exhaustion (memory limits, token budget depletion). Each injected failure tests the system's resilience and reveals gaps in error handling.

The approach is systematic: define a steady state (the agent successfully completes a reference task), introduce a failure, and verify that the system either recovers to the steady state or degrades gracefully. Each chaos test should have a clear hypothesis: "If the database tool fails, the agent should fall back to cached data and note the limitation in its response." Running chaos tests regularly, especially before major deployments, builds confidence that the system is resilient to real-world failures.

import random
from unittest.mock import patch
class ChaosInjector:
    """Inject failures into agent tool calls for chaos testing."""
    def __init__(self, failure_rate: float = 0.3):
        self.failure_rate = failure_rate
        self.injected_failures = []
    def maybe_fail(self, tool_name: str, original_func):
        """Wrap a tool function with random failure injection."""
        async def chaos_wrapper(*args, **kwargs):
            if random.random() < self.failure_rate:
                failure_type = random.choice([
                    "timeout", "rate_limit", "server_error", "malformed_response"
                    ])
                self.injected_failures.append((tool_name, failure_type))
                if failure_type == "timeout":
                    raise TimeoutError(f"{tool_name} timed out")
                elif failure_type == "rate_limit":
                    raise RateLimitError(f"{tool_name} rate limited")
                elif failure_type == "server_error":
                    raise APIError(f"{tool_name} returned 500")
                elif failure_type == "malformed_response":
                    return "{{invalid json"
                    return await original_func(*args, **kwargs)
                    return chaos_wrapper
            def test_agent_resilience():
                """Chaos test: agent should handle random tool failures gracefully."""
                chaos = ChaosInjector(failure_rate=0.3)
                # Wrap all tools with chaos injection
                chaotic_tools = {
                    name: chaos.maybe_fail(name, tool.execute)
                    for name, tool in agent.tools.items()
                    }
                # Run the agent on a reference task
                result = agent.run(
                    "Analyze last month's sales data",
                    tools=chaotic_tools,
                    )
                # Verify graceful degradation
                assert result is not None, "Agent should produce some output even with failures"
                assert "error" not in result.lower() or "unavailable" in result.lower(), \
                "Error messages should be user-friendly"
                print(f"Injected {len(chaos.injected_failures)} failures: {chaos.injected_failures}")
                print(f"Agent output: {result[:200]}...")
Output: Injected 3 failures: ['query_db_timeout', 'fetch_report_ratelimit', 'parse_csv_malformed'] Agent output: Sales analysis completed with partial data. Total revenue for last month was approximately $1.24M based on available records. Note: 2 data sources were temporarily unavailable...
Code Fragment 28.4.2: This snippet implements a ChaosInjector test harness that randomly injects failures (timeouts, malformed responses, rate limits) into agent tool calls at a configurable failure_rate. The inject method wraps real tool functions and probabilistically raises exceptions, enabling systematic resilience testing of agent error handling.
# Lab starter: agent contract validation. Students fill in the TODOs.
from pydantic import BaseModel, Field
from typing import Literal

# 1) Define the contract the agent's tool calls must satisfy
class WeatherQuery(BaseModel):
    """TODO: extend with required and optional fields the agent must produce."""
    city: str = Field(..., description="City name; non-empty")
    units: Literal["c", "f"] = "c"

    def call_agent(prompt: str) -> dict:
        """TODO: call your agent and return its parsed JSON tool-call payload."""
        raise NotImplementedError

    def validate_tool_call(payload: dict) -> WeatherQuery:
        """TODO: parse `payload` into the WeatherQuery contract.
        Hint: use WeatherQuery.model_validate; let it raise on failure."""
        raise NotImplementedError

if __name__ == "__main__":
    prompt = "What's the weather in Tokyo in Fahrenheit?"
    payload = call_agent(prompt)
    contract = validate_tool_call(payload)
    print(f"Validated: {contract.model_dump()}")
Output: Validated: {'city': 'Tokyo', 'units': 'f'}
Code Fragment 28.4.3: A Pydantic-based starter contract. The WeatherQuery schema declares required (city) and optional (units) fields with allowed values; students fill in call_agent and validate_tool_call so that any non-conforming tool-call payload is rejected before reaching the executor. A complete, runnable reference implementation appears immediately below in Code Fragment 28.4.4.
# Full solution for the agent contract validation lab.
import json
from pydantic import BaseModel, Field, ValidationError
from typing import Literal
from openai import OpenAI

client = OpenAI()

class WeatherQuery(BaseModel):
    city: str = Field(..., min_length=1, description="City name; non-empty")
    country_code: str | None = Field(None, pattern=r"^[A-Z]{2}$",
                                      description="Optional ISO 3166-1 alpha-2 country code")
    units: Literal["c", "f"] = "c"

SYSTEM_PROMPT = (
    "You are a weather assistant. When the user asks about weather, respond with "
    "a JSON object {\"city\": ..., \"country_code\": ..., \"units\": \"c\" or \"f\"}. "
    "Nothing else."
)

def call_agent(prompt: str) -> dict:
    resp = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": prompt},
        ],
        response_format={"type": "json_object"},
        temperature=0.0,
    )
    return json.loads(resp.choices[0].message.content)

def validate_tool_call(payload: dict) -> WeatherQuery:
    try:
        return WeatherQuery.model_validate(payload)
    except ValidationError as e:
        # In production: log payload, return a structured error, ask agent to retry
        raise

def test_basic_call():
    payload = call_agent("What's the weather in Tokyo in Fahrenheit?")
    contract = validate_tool_call(payload)
    assert contract.city.lower() == "tokyo"
    assert contract.units == "f"

if __name__ == "__main__":
    test_basic_call()
    print("contract validated; tests passed")
Output: contract validated; tests passed
Code Fragment 28.4.4: The reference solution adds a country_code field with a regex pattern, uses response_format={"type": "json_object"} to force the model to emit valid JSON, and wraps validate_tool_call in a try/raise so a malformed payload surfaces as a ValidationError rather than silent corruption. The test_basic_call assertion is the contract test for the basic-case lab requirement.
Warning

Never run chaos tests against production systems without proper safeguards. Use isolated environments with synthetic data and mock external services. Chaos testing should validate that your error handling works correctly, not discover through production outages that it does not. Start with low failure rates (5 to 10%) and gradually increase to identify the breaking point.

Lab: Chaos Test a Multi-Agent Pipeline

Objective

In this lab, you will build a chaos testing framework for a multi-agent pipeline and use it to identify and fix resilience gaps.

Setup

Stand up the 3-agent pipeline in an isolated environment (Docker Compose works well) with mocked external services so chaos injection cannot escape to real APIs. Write a small chaos-injector module that wraps tool calls and message passing with random fault injection (failure, latency, malformed data) controlled by environment variables. Pin random seeds so each run is reproducible.

Steps

  1. Set up a 3-agent pipeline: Researcher, Analyst, Writer.
  2. Implement a chaos injector that randomly fails tools, introduces latency, and returns malformed data.
  3. Run the pipeline 20 times with a 30% failure rate and measure: success rate, graceful degradation rate, complete failure rate.
  4. Identify the weakest point in the pipeline and add error handling to improve resilience.
  5. Re-run the chaos tests and compare metrics before and after the improvement.

Expected Output

A before-and-after metrics table on the 20-run sample: baseline success rate, graceful-degradation rate, and complete-failure rate next to the post-fix numbers. The weakest-point analysis should identify a specific tool or handoff where retries, timeouts, or schema validation was missing, and the fix should measurably move complete failures into the graceful-degradation column.

Key Takeaways
Self-Check
Q1: Why is testing agent systems fundamentally harder than testing traditional software?
Show Answer

Agents have non-deterministic behavior (same input can produce different outputs), multi-step execution paths, dependencies on external LLM APIs, and emergent behaviors from tool interactions. Traditional unit testing cannot capture these properties; you need trajectory-level testing and chaos engineering.

Q2: What is chaos testing for multi-agent systems, and what does it reveal?
Show Answer

Chaos testing deliberately injects failures (tool timeouts, malformed responses, agent crashes, network partitions) into a running multi-agent system to verify that the system degrades gracefully rather than catastrophically. It reveals hidden dependencies, missing error handlers, and cascading failure paths.

Exercises

Exercise 24.5.1: Multi-Agent Testing Challenges Conceptual

Why is testing multi-agent systems harder than testing single agents? Identify three challenges specific to multi-agent interactions.

Answer Sketch

(1) Emergent behavior: the system's behavior is not simply the sum of individual agents; interactions produce unexpected outcomes. (2) Non-deterministic message ordering: agents may process messages in different orders across runs. (3) State explosion: with N agents and M possible states each, the state space grows as M^N. Traditional unit testing of individual agents misses interaction bugs.

Exercise 24.5.2: Contract Testing Coding

Implement a simple contract test for two agents: a 'requester' agent that sends tasks in a specific JSON format and a 'worker' agent that returns results in another format. Verify that both agents respect the contract.

Answer Sketch

Define JSON schemas for the request and response formats. Write tests that: (1) generate a request from the requester agent and validate it against the request schema, (2) send the request to the worker agent and validate its response against the response schema, (3) verify round-trip consistency (the response references the correct request ID). Use jsonschema.validate() for schema checking.

Exercise 24.5.3: Chaos Engineering for Agents Coding

Design a chaos testing framework that randomly injects failures into a multi-agent system: dropping messages between agents, adding latency, and corrupting tool outputs. Track how the system degrades.

Answer Sketch

Create a proxy layer between agents that randomly: (1) drops N% of messages, (2) adds random delays (100ms to 5s), (3) corrupts tool outputs by replacing content with garbage. Run the system on a set of test tasks and measure: task completion rate, average latency, error recovery success rate, and cost overhead. Compare against baseline (no chaos) to quantify resilience.

Exercise 24.5.4: Agent Interaction Traces Conceptual

How should multi-agent interaction traces be structured for debugging? What information should each trace entry contain, and how should traces be correlated across agents?

Answer Sketch

Each trace entry: timestamp, agent_id, action_type (send, receive, tool_call, decision), message content, parent_trace_id (for correlation). Use a shared trace_id across all agents working on the same task. Store traces in a time-series database. Visualization should show a timeline with swim lanes (one per agent) and arrows showing message flow. This makes it easy to spot communication failures and bottlenecks.

Exercise 24.5.5: Regression Testing Strategy Conceptual

Describe a regression testing strategy for a multi-agent system that is updated frequently. How do you balance test coverage with test execution time?

Answer Sketch

Maintain a golden test set of representative tasks with verified outputs. Run the full set on every major release. For frequent updates, use a smaller smoke test set (10% of tasks) that covers the most critical paths. Use LLM-as-judge to evaluate output quality for tasks without exact-match answers. Track metrics over time to detect gradual degradation. Cache expensive tool calls in tests to reduce cost and latency.

What Comes Next

Continue to Part VII: Multimodal and Applications. Having mastered agentic AI patterns, you will now extend LLMs beyond text: vision-language models, audio, document understanding, and production deployment of multimodal pipelines.

Further Reading
Kapoor, S., Stroebl, B., Siber, Z.S., et al. (2024). "AI Agents That Matter." arXiv preprint. Identifies key pitfalls in agent evaluation including lack of statistical rigor and overfitting to benchmarks, informing testing methodology for multi-agent systems.
Ruan, Y., Dong, H., Wang, A., et al. (2024). "Identifying the Risks of LM Agents with an LM-Emulated Sandbox." ICLR 2024. Proposes emulated environments for testing agent safety, enabling systematic identification of failure modes before production deployment.
Rosenthal, C., Jones, N. (2020). "Principles of Chaos Engineering." principlesofchaos.org. Defines the principles of chaos engineering for distributed systems, applicable to testing agent resilience by injecting failures in tool calls, APIs, and inter-agent communication.
Chen, W., Su, Y., Zuo, J., et al. (2024). "AgentVerse: Facilitating Multi-Agent Collaboration and Exploring Emergent Behaviors." ICLR 2024. Provides a testbed for studying multi-agent interaction dynamics and emergent behaviors, useful for validating multi-agent system behavior under various conditions.
Pact Foundation (2024). "Pact: Contract Testing." docs.pact.io. Documentation for the Pact contract testing framework, applicable to defining and verifying communication contracts between agents in multi-agent systems.