"Testing one agent is hard. Testing five agents talking to each other is a combinatorial adventure."
Eval, Thoroughly Tested AI Agent
Testing multi-agent systems is a combinatorial challenge that standard unit testing cannot solve alone. Non-deterministic LLM outputs, emergent inter-agent behaviors, and environmental dependencies mean that a test suite passing today might fail tomorrow with no code changes. This section introduces a four-level testing pyramid for agent systems (unit, integration, scenario, chaos), outcome-based assertions that tolerate natural variation in LLM outputs, and chaos testing techniques that inject failures to verify error handling and graceful degradation. The evaluation frameworks from Chapter 42 address model quality; this section addresses system-level reliability.
Prerequisites
This section builds on all previous chapters in Part VI, especially tool use (Chapter 27) and multi-agent systems (Chapter 28).
28.4.1 The Testing Challenge
The four-layer testing pyramid for agents (unit, integration, scenario, chaos) borrows the chaos-engineering term directly from Netflix's Chaos Monkey (2010), the tool that randomly killed production servers to verify resilience. Netflix open-sourced Chaos Monkey because they had grown tired of explaining to other engineering teams why their EC2 instances kept disappearing at 11 a.m. on Tuesdays.
Testing multi-agent systems breaks the assumptions traditional software testing leans on. Three problems compound: non-determinism (same input, different output), emergent behavior (agents interact in ways nobody designed), and environmental drift (tools, APIs, and data shift between runs). A test suite that passes today fails tomorrow because the LLM took a slightly different reasoning path and called a different tool. Unit tests are still necessary; they are not sufficient.
The testing pyramid for agent systems has four levels. Unit tests verify individual components: tool implementations, input validators, state management logic. These are deterministic and fast. Integration tests verify that components work together: the agent can call tools correctly, tools return results in the expected format, and state transitions work as designed. Scenario tests run the complete agent on predefined tasks and check for acceptable outcomes (not exact matches). Chaos tests inject failures into the system to verify that error handling, fallbacks, and graceful degradation work correctly.
For the non-deterministic layers (scenario tests), use outcome-based assertions rather than exact-match assertions. Instead of checking that the agent produced a specific string, check that the output contains the required information, that tool calls were made in a valid order, that the final answer is factually correct, and that the agent stayed within its budget. This makes tests robust to the natural variation in LLM outputs while still catching genuine failures.
The most valuable agent tests are regression tests built from production failures. When an agent fails in production, capture the full trace (input, tool calls, responses, output) and add it to the test suite as a regression test. Over time, this builds a collection of real-world edge cases that the agent must handle correctly. This is far more effective than trying to anticipate failure modes in advance, because real failures reveal blind spots that manual test design misses.
28.4.2 Contract Testing for Multi-Agent Systems
In a multi-agent system, each agent depends on the outputs of other agents. If Agent A changes its output format, Agent B (which consumes that output) may break. Contract testing verifies that each agent's inputs and outputs conform to agreed-upon schemas, catching integration issues before they reach production. The "contract" is a formal specification of what each agent expects to receive and what it promises to produce.
from pydantic import BaseModel
from typing import List
# Define the contract between the Research Agent and the Writing Agent
class ResearchOutput(BaseModel):
"""Contract: what the Research Agent must produce."""
topic: str
findings: List[dict] # Each finding has 'source', 'content', 'relevance'
gaps: List[str] # Topics that need more research
confidence: float # 0.0 to 1.0
class WritingInput(BaseModel):
"""Contract: what the Writing Agent expects to receive."""
topic: str
findings: List[dict]
tone: str # "formal", "casual", "technical"
max_length: int # words
def test_research_output_matches_writing_input():
"""Research Agent output must satisfy Writing Agent input contract."""
# Run the Research Agent on a test task
research_result = research_agent.run("Summarize recent advances in RAG")
# Validate against the contract
output = ResearchOutput(**research_result)
assert len(output.findings) >= 1, "Must produce at least one finding"
assert 0 <= output.confidence <= 1, "Confidence must be in [0, 1]"
# Verify it can be transformed into the Writing Agent's expected input
writing_input = WritingInput(
topic=output.topic,
findings=output.findings,
tone="technical",
max_length=2000,
)
assert writing_input # Pydantic validation passed
Contract testing is especially important for multi-agent systems because agents evolve independently. If the Research Agent's developer changes the output format from a list of dictionaries to a flat string, the Writing Agent breaks silently because it receives valid text but not the structure it expects. Contract tests catch this at the boundary before it manifests as a subtle quality degradation in production. This is the same principle that drives API versioning in microservice architectures: the interface between components must be explicitly defined and tested, independent of each component's internal implementation.
28.4.3 Chaos Engineering for Agents
Chaos engineering deliberately introduces failures into the system to verify that it handles them correctly. For agent systems, chaos tests inject: LLM API failures (timeouts, rate limits, garbage responses), tool failures (services returning errors, slow responses, incorrect data), data corruption (tools returning malformed JSON, unexpected data types), and resource exhaustion (memory limits, token budget depletion). Each injected failure tests the system's resilience and reveals gaps in error handling.
The approach is systematic: define a steady state (the agent successfully completes a reference task), introduce a failure, and verify that the system either recovers to the steady state or degrades gracefully. Each chaos test should have a clear hypothesis: "If the database tool fails, the agent should fall back to cached data and note the limitation in its response." Running chaos tests regularly, especially before major deployments, builds confidence that the system is resilient to real-world failures.
import random
from unittest.mock import patch
class ChaosInjector:
"""Inject failures into agent tool calls for chaos testing."""
def __init__(self, failure_rate: float = 0.3):
self.failure_rate = failure_rate
self.injected_failures = []
def maybe_fail(self, tool_name: str, original_func):
"""Wrap a tool function with random failure injection."""
async def chaos_wrapper(*args, **kwargs):
if random.random() < self.failure_rate:
failure_type = random.choice([
"timeout", "rate_limit", "server_error", "malformed_response"
])
self.injected_failures.append((tool_name, failure_type))
if failure_type == "timeout":
raise TimeoutError(f"{tool_name} timed out")
elif failure_type == "rate_limit":
raise RateLimitError(f"{tool_name} rate limited")
elif failure_type == "server_error":
raise APIError(f"{tool_name} returned 500")
elif failure_type == "malformed_response":
return "{{invalid json"
return await original_func(*args, **kwargs)
return chaos_wrapper
def test_agent_resilience():
"""Chaos test: agent should handle random tool failures gracefully."""
chaos = ChaosInjector(failure_rate=0.3)
# Wrap all tools with chaos injection
chaotic_tools = {
name: chaos.maybe_fail(name, tool.execute)
for name, tool in agent.tools.items()
}
# Run the agent on a reference task
result = agent.run(
"Analyze last month's sales data",
tools=chaotic_tools,
)
# Verify graceful degradation
assert result is not None, "Agent should produce some output even with failures"
assert "error" not in result.lower() or "unavailable" in result.lower(), \
"Error messages should be user-friendly"
print(f"Injected {len(chaos.injected_failures)} failures: {chaos.injected_failures}")
print(f"Agent output: {result[:200]}...")
# Lab starter: agent contract validation. Students fill in the TODOs.
from pydantic import BaseModel, Field
from typing import Literal
# 1) Define the contract the agent's tool calls must satisfy
class WeatherQuery(BaseModel):
"""TODO: extend with required and optional fields the agent must produce."""
city: str = Field(..., description="City name; non-empty")
units: Literal["c", "f"] = "c"
def call_agent(prompt: str) -> dict:
"""TODO: call your agent and return its parsed JSON tool-call payload."""
raise NotImplementedError
def validate_tool_call(payload: dict) -> WeatherQuery:
"""TODO: parse `payload` into the WeatherQuery contract.
Hint: use WeatherQuery.model_validate; let it raise on failure."""
raise NotImplementedError
if __name__ == "__main__":
prompt = "What's the weather in Tokyo in Fahrenheit?"
payload = call_agent(prompt)
contract = validate_tool_call(payload)
print(f"Validated: {contract.model_dump()}")
WeatherQuery schema declares required (city) and optional (units) fields with allowed values; students fill in call_agent and validate_tool_call so that any non-conforming tool-call payload is rejected before reaching the executor. A complete, runnable reference implementation appears immediately below in Code Fragment 28.4.4.# Full solution for the agent contract validation lab.
import json
from pydantic import BaseModel, Field, ValidationError
from typing import Literal
from openai import OpenAI
client = OpenAI()
class WeatherQuery(BaseModel):
city: str = Field(..., min_length=1, description="City name; non-empty")
country_code: str | None = Field(None, pattern=r"^[A-Z]{2}$",
description="Optional ISO 3166-1 alpha-2 country code")
units: Literal["c", "f"] = "c"
SYSTEM_PROMPT = (
"You are a weather assistant. When the user asks about weather, respond with "
"a JSON object {\"city\": ..., \"country_code\": ..., \"units\": \"c\" or \"f\"}. "
"Nothing else."
)
def call_agent(prompt: str) -> dict:
resp = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": prompt},
],
response_format={"type": "json_object"},
temperature=0.0,
)
return json.loads(resp.choices[0].message.content)
def validate_tool_call(payload: dict) -> WeatherQuery:
try:
return WeatherQuery.model_validate(payload)
except ValidationError as e:
# In production: log payload, return a structured error, ask agent to retry
raise
def test_basic_call():
payload = call_agent("What's the weather in Tokyo in Fahrenheit?")
contract = validate_tool_call(payload)
assert contract.city.lower() == "tokyo"
assert contract.units == "f"
if __name__ == "__main__":
test_basic_call()
print("contract validated; tests passed")
country_code field with a regex pattern, uses response_format={"type": "json_object"} to force the model to emit valid JSON, and wraps validate_tool_call in a try/raise so a malformed payload surfaces as a ValidationError rather than silent corruption. The test_basic_call assertion is the contract test for the basic-case lab requirement.Never run chaos tests against production systems without proper safeguards. Use isolated environments with synthetic data and mock external services. Chaos testing should validate that your error handling works correctly, not discover through production outages that it does not. Start with low failure rates (5 to 10%) and gradually increase to identify the breaking point.
Objective
In this lab, you will build a chaos testing framework for a multi-agent pipeline and use it to identify and fix resilience gaps.
Setup
Stand up the 3-agent pipeline in an isolated environment (Docker Compose works well) with mocked external services so chaos injection cannot escape to real APIs. Write a small chaos-injector module that wraps tool calls and message passing with random fault injection (failure, latency, malformed data) controlled by environment variables. Pin random seeds so each run is reproducible.
Steps
- Set up a 3-agent pipeline: Researcher, Analyst, Writer.
- Implement a chaos injector that randomly fails tools, introduces latency, and returns malformed data.
- Run the pipeline 20 times with a 30% failure rate and measure: success rate, graceful degradation rate, complete failure rate.
- Identify the weakest point in the pipeline and add error handling to improve resilience.
- Re-run the chaos tests and compare metrics before and after the improvement.
Expected Output
A before-and-after metrics table on the 20-run sample: baseline success rate, graceful-degradation rate, and complete-failure rate next to the post-fix numbers. The weakest-point analysis should identify a specific tool or handoff where retries, timeouts, or schema validation was missing, and the fix should measurably move complete failures into the graceful-degradation column.
- Agent testing requires trajectory-level validation, not just input-output unit tests.
- Chaos testing injects deliberate failures to verify graceful degradation and discover hidden failure modes.
- Test the full spectrum: unit tests for individual tools, integration tests for agent loops, and chaos tests for system resilience.
Show Answer
Agents have non-deterministic behavior (same input can produce different outputs), multi-step execution paths, dependencies on external LLM APIs, and emergent behaviors from tool interactions. Traditional unit testing cannot capture these properties; you need trajectory-level testing and chaos engineering.
Show Answer
Chaos testing deliberately injects failures (tool timeouts, malformed responses, agent crashes, network partitions) into a running multi-agent system to verify that the system degrades gracefully rather than catastrophically. It reveals hidden dependencies, missing error handlers, and cascading failure paths.
Exercises
Why is testing multi-agent systems harder than testing single agents? Identify three challenges specific to multi-agent interactions.
Answer Sketch
(1) Emergent behavior: the system's behavior is not simply the sum of individual agents; interactions produce unexpected outcomes. (2) Non-deterministic message ordering: agents may process messages in different orders across runs. (3) State explosion: with N agents and M possible states each, the state space grows as M^N. Traditional unit testing of individual agents misses interaction bugs.
Implement a simple contract test for two agents: a 'requester' agent that sends tasks in a specific JSON format and a 'worker' agent that returns results in another format. Verify that both agents respect the contract.
Answer Sketch
Define JSON schemas for the request and response formats. Write tests that: (1) generate a request from the requester agent and validate it against the request schema, (2) send the request to the worker agent and validate its response against the response schema, (3) verify round-trip consistency (the response references the correct request ID). Use jsonschema.validate() for schema checking.
Design a chaos testing framework that randomly injects failures into a multi-agent system: dropping messages between agents, adding latency, and corrupting tool outputs. Track how the system degrades.
Answer Sketch
Create a proxy layer between agents that randomly: (1) drops N% of messages, (2) adds random delays (100ms to 5s), (3) corrupts tool outputs by replacing content with garbage. Run the system on a set of test tasks and measure: task completion rate, average latency, error recovery success rate, and cost overhead. Compare against baseline (no chaos) to quantify resilience.
How should multi-agent interaction traces be structured for debugging? What information should each trace entry contain, and how should traces be correlated across agents?
Answer Sketch
Each trace entry: timestamp, agent_id, action_type (send, receive, tool_call, decision), message content, parent_trace_id (for correlation). Use a shared trace_id across all agents working on the same task. Store traces in a time-series database. Visualization should show a timeline with swim lanes (one per agent) and arrows showing message flow. This makes it easy to spot communication failures and bottlenecks.
Describe a regression testing strategy for a multi-agent system that is updated frequently. How do you balance test coverage with test execution time?
Answer Sketch
Maintain a golden test set of representative tasks with verified outputs. Run the full set on every major release. For frequent updates, use a smaller smoke test set (10% of tasks) that covers the most critical paths. Use LLM-as-judge to evaluate output quality for tasks without exact-match answers. Track metrics over time to detect gradual degradation. Cache expensive tool calls in tests to reduce cost and latency.
What Comes Next
Continue to Part VII: Multimodal and Applications. Having mastered agentic AI patterns, you will now extend LLMs beyond text: vision-language models, audio, document understanding, and production deployment of multimodal pipelines.