Part VI: Agentic AI
Chapter 26: Agent Safety, Production & Operations

Testing Multi-Agent Systems

"Testing one agent is hard. Testing five agents talking to each other is a combinatorial adventure."

Eval Eval, Thoroughly Tested AI Agent
Big Picture

Testing multi-agent systems is a combinatorial challenge that standard unit testing cannot solve alone. Non-deterministic LLM outputs, emergent inter-agent behaviors, and environmental dependencies mean that a test suite passing today might fail tomorrow with no code changes. This section introduces a four-level testing pyramid for agent systems (unit, integration, scenario, chaos), outcome-based assertions that tolerate natural variation in LLM outputs, and chaos testing techniques that inject failures to verify error handling and graceful degradation. The evaluation frameworks from Chapter 29 address model quality; this section addresses system-level reliability.

Prerequisites

This section builds on all previous chapters in Part VI, especially tool use (Chapter 23) and multi-agent systems (Chapter 24).

1. The Testing Challenge

Testing multi-agent systems is harder than testing traditional software because of non-determinism (the same input can produce different outputs), emergent behavior (agents interact in unexpected ways), and environmental dependencies (tools, APIs, and data sources can change between test runs). A test suite that passes today might fail tomorrow because the LLM produced a slightly different reasoning trace, triggering a different tool call sequence. Standard unit testing approaches are necessary but insufficient; testing agents requires additional strategies tailored to these challenges.

The testing pyramid for agent systems has four levels. Unit tests verify individual components: tool implementations, input validators, state management logic. These are deterministic and fast. Integration tests verify that components work together: the agent can call tools correctly, tools return results in the expected format, and state transitions work as designed. Scenario tests run the complete agent on predefined tasks and check for acceptable outcomes (not exact matches). Chaos tests inject failures into the system to verify that error handling, fallbacks, and graceful degradation work correctly.

For the non-deterministic layers (scenario tests), use outcome-based assertions rather than exact-match assertions. Instead of checking that the agent produced a specific string, check that the output contains the required information, that tool calls were made in a valid order, that the final answer is factually correct, and that the agent stayed within its budget. This makes tests robust to the natural variation in LLM outputs while still catching genuine failures.

Key Insight

The most valuable agent tests are regression tests built from production failures. When an agent fails in production, capture the full trace (input, tool calls, responses, output) and add it to the test suite as a regression test. Over time, this builds a collection of real-world edge cases that the agent must handle correctly. This is far more effective than trying to anticipate failure modes in advance, because real failures reveal blind spots that manual test design misses.

2. Contract Testing for Multi-Agent Systems

In a multi-agent system, each agent depends on the outputs of other agents. If Agent A changes its output format, Agent B (which consumes that output) may break. Contract testing verifies that each agent's inputs and outputs conform to agreed-upon schemas, catching integration issues before they reach production. The "contract" is a formal specification of what each agent expects to receive and what it promises to produce.

from pydantic import BaseModel
from typing import List

# Define the contract between the Research Agent and the Writing Agent
class ResearchOutput(BaseModel):
 """Contract: what the Research Agent must produce."""
 topic: str
 findings: List[dict] # Each finding has 'source', 'content', 'relevance'
 gaps: List[str] # Topics that need more research
 confidence: float # 0.0 to 1.0

class WritingInput(BaseModel):
 """Contract: what the Writing Agent expects to receive."""
 topic: str
 findings: List[dict]
 tone: str # "formal", "casual", "technical"
 max_length: int # words

def test_research_output_matches_writing_input():
 """Verify that the Research Agent's output satisfies the Writing Agent's input contract."""
 # Run the Research Agent on a test task
 research_result = research_agent.run("Summarize recent advances in RAG")

 # Validate against the contract
 output = ResearchOutput(**research_result)
 assert len(output.findings) >= 1, "Must produce at least one finding"
 assert 0 <= output.confidence <= 1, "Confidence must be between 0 and 1"

 # Verify it can be transformed into the Writing Agent's expected input
 writing_input = WritingInput(
 topic=output.topic,
 findings=output.findings,
 tone="technical",
 max_length=2000,
 )
 assert writing_input # Pydantic validation passed
Code Fragment 26.5.1: This snippet uses Pydantic BaseModel classes to define explicit contracts between a PlannerOutput (with steps list and confidence float) and an ExecutorInput that validates the plan. The strict typing ensures that malformed plans are rejected at the boundary before the executor agent processes them.

Contract testing is especially important for multi-agent systems because agents evolve independently. If the Research Agent's developer changes the output format from a list of dictionaries to a flat string, the Writing Agent breaks silently because it receives valid text but not the structure it expects. Contract tests catch this at the boundary before it manifests as a subtle quality degradation in production. This is the same principle that drives API versioning in microservice architectures: the interface between components must be explicitly defined and tested, independent of each component's internal implementation.

3. Chaos Engineering for Agents

Chaos engineering deliberately introduces failures into the system to verify that it handles them correctly. For agent systems, chaos tests inject: LLM API failures (timeouts, rate limits, garbage responses), tool failures (services returning errors, slow responses, incorrect data), data corruption (tools returning malformed JSON, unexpected data types), and resource exhaustion (memory limits, token budget depletion). Each injected failure tests the system's resilience and reveals gaps in error handling.

The approach is systematic: define a steady state (the agent successfully completes a reference task), introduce a failure, and verify that the system either recovers to the steady state or degrades gracefully. Each chaos test should have a clear hypothesis: "If the database tool fails, the agent should fall back to cached data and note the limitation in its response." Running chaos tests regularly, especially before major deployments, builds confidence that the system is resilient to real-world failures.

import random
from unittest.mock import patch

class ChaosInjector:
 """Inject failures into agent tool calls for chaos testing."""

 def __init__(self, failure_rate: float = 0.3):
 self.failure_rate = failure_rate
 self.injected_failures = []

 def maybe_fail(self, tool_name: str, original_func):
 """Wrap a tool function with random failure injection."""
 async def chaos_wrapper(*args, **kwargs):
 if random.random() < self.failure_rate:
 failure_type = random.choice([
 "timeout", "rate_limit", "server_error", "malformed_response"
 ])
 self.injected_failures.append((tool_name, failure_type))

 if failure_type == "timeout":
 raise TimeoutError(f"{tool_name} timed out")
 elif failure_type == "rate_limit":
 raise RateLimitError(f"{tool_name} rate limited")
 elif failure_type == "server_error":
 raise APIError(f"{tool_name} returned 500")
 elif failure_type == "malformed_response":
 return "{{invalid json"

 return await original_func(*args, **kwargs)
 return chaos_wrapper

def test_agent_resilience():
 """Chaos test: agent should handle random tool failures gracefully."""
 chaos = ChaosInjector(failure_rate=0.3)

 # Wrap all tools with chaos injection
 chaotic_tools = {
 name: chaos.maybe_fail(name, tool.execute)
 for name, tool in agent.tools.items()
 }

 # Run the agent on a reference task
 result = agent.run(
 "Analyze last month's sales data",
 tools=chaotic_tools,
 )

 # Verify graceful degradation
 assert result is not None, "Agent should produce some output even with failures"
 assert "error" not in result.lower() or "unavailable" in result.lower(), \
 "Error messages should be user-friendly"
 print(f"Injected {len(chaos.injected_failures)} failures: {chaos.injected_failures}")
 print(f"Agent output: {result[:200]}...")
Code Fragment 26.5.2: This snippet implements a ChaosInjector test harness that randomly injects failures (timeouts, malformed responses, rate limits) into agent tool calls at a configurable failure_rate. The inject method wraps real tool functions and probabilistically raises exceptions, enabling systematic resilience testing of agent error handling.

Lab: Chaos Test a Multi-Agent Pipeline

In this lab, you will build a chaos testing framework for a multi-agent pipeline and use it to identify and fix resilience gaps.

Tasks:

  1. Set up a 3-agent pipeline: Researcher, Analyst, Writer
  2. Implement a chaos injector that randomly fails tools, introduces latency, and returns malformed data
  3. Run the pipeline 20 times with a 30% failure rate and measure: success rate, graceful degradation rate, complete failure rate
  4. Identify the weakest point in the pipeline and add error handling to improve resilience
  5. Re-run the chaos tests and compare metrics before and after the improvement
Warning

Never run chaos tests against production systems without proper safeguards. Use isolated environments with synthetic data and mock external services. Chaos testing should validate that your error handling works correctly, not discover through production outages that it does not. Start with low failure rates (5 to 10%) and gradually increase to identify the breaking point.

Objective

Apply the concepts from this section by building a working implementation related to this lab exercise.

What You'll Practice

  • Implementing core algorithms covered in this section
  • Configuring parameters and evaluating results
  • Comparing different approaches and interpreting trade-offs

Setup

The following cell installs the required packages and configures the environment for this lab.

pip install torch transformers numpy
Code Fragment 26.5.3: This command installs torch, transformers, and numpy for the agent testing and contract-validation lab. These packages provide the foundation for the exercises below.

A free Colab GPU (T4) is sufficient for this lab.

Steps

Step 1: Setup and data preparation

Load the required libraries and prepare your data for this lab exercise.

# TODO: Implement setup code here
Code Fragment 26.5.4: Step 1 stub: load the required libraries and prepare data for the agent testing exercises.

Expected Output

  • A working implementation demonstrating this lab exercise
  • Console output showing key metrics and results

Stretch Goals

  • Experiment with different hyperparameters and compare outcomes
  • Extend the implementation to handle more complex scenarios
  • Benchmark performance and create visualizations of the results
Complete Solution
# Complete solution for this lab exercise
# TODO: Full implementation here
Code Fragment 26.5.5: Complete solution for the agent testing lab exercise. Students should implement the full contract validation and chaos testing pipeline.

Exercises

Exercise 26.5.1: Multi-Agent Testing Challenges Conceptual

Why is testing multi-agent systems harder than testing single agents? Identify three challenges specific to multi-agent interactions.

Answer Sketch

(1) Emergent behavior: the system's behavior is not simply the sum of individual agents; interactions produce unexpected outcomes. (2) Non-deterministic message ordering: agents may process messages in different orders across runs. (3) State explosion: with N agents and M possible states each, the state space grows as M^N. Traditional unit testing of individual agents misses interaction bugs.

Exercise 26.5.2: Contract Testing Coding

Implement a simple contract test for two agents: a 'requester' agent that sends tasks in a specific JSON format and a 'worker' agent that returns results in another format. Verify that both agents respect the contract.

Answer Sketch

Define JSON schemas for the request and response formats. Write tests that: (1) generate a request from the requester agent and validate it against the request schema, (2) send the request to the worker agent and validate its response against the response schema, (3) verify round-trip consistency (the response references the correct request ID). Use jsonschema.validate() for schema checking.

Exercise 26.5.3: Chaos Engineering for Agents Coding

Design a chaos testing framework that randomly injects failures into a multi-agent system: dropping messages between agents, adding latency, and corrupting tool outputs. Track how the system degrades.

Answer Sketch

Create a proxy layer between agents that randomly: (1) drops N% of messages, (2) adds random delays (100ms to 5s), (3) corrupts tool outputs by replacing content with garbage. Run the system on a set of test tasks and measure: task completion rate, average latency, error recovery success rate, and cost overhead. Compare against baseline (no chaos) to quantify resilience.

Exercise 26.5.4: Agent Interaction Traces Conceptual

How should multi-agent interaction traces be structured for debugging? What information should each trace entry contain, and how should traces be correlated across agents?

Answer Sketch

Each trace entry: timestamp, agent_id, action_type (send, receive, tool_call, decision), message content, parent_trace_id (for correlation). Use a shared trace_id across all agents working on the same task. Store traces in a time-series database. Visualization should show a timeline with swim lanes (one per agent) and arrows showing message flow. This makes it easy to spot communication failures and bottlenecks.

Exercise 26.5.5: Regression Testing Strategy Conceptual

Describe a regression testing strategy for a multi-agent system that is updated frequently. How do you balance test coverage with test execution time?

Answer Sketch

Maintain a golden test set of representative tasks with verified outputs. Run the full set on every major release. For frequent updates, use a smaller smoke test set (10% of tasks) that covers the most critical paths. Use LLM-as-judge to evaluate output quality for tasks without exact-match answers. Track metrics over time to detect gradual degradation. Cache expensive tool calls in tests to reduce cost and latency.

Key Takeaways
Self-Check
Q1: Why is testing agent systems fundamentally harder than testing traditional software?
Show Answer

Agents have non-deterministic behavior (same input can produce different outputs), multi-step execution paths, dependencies on external LLM APIs, and emergent behaviors from tool interactions. Traditional unit testing cannot capture these properties; you need trajectory-level testing and chaos engineering.

Q2: What is chaos testing for multi-agent systems, and what does it reveal?
Show Answer

Chaos testing deliberately injects failures (tool timeouts, malformed responses, agent crashes, network partitions) into a running multi-agent system to verify that the system degrades gracefully rather than catastrophically. It reveals hidden dependencies, missing error handlers, and cascading failure paths.

What Comes Next

Continue to Part VII: Multimodal and Applications for the next major topic in the book.

References and Further Reading

Testing Agent Systems

Kapoor, S., Stroebl, B., Siber, Z.S., et al. (2024). "AI Agents That Matter." arXiv preprint.

Identifies key pitfalls in agent evaluation including lack of statistical rigor and overfitting to benchmarks, informing testing methodology for multi-agent systems.

Paper

Ruan, Y., Dong, H., Wang, A., et al. (2024). "Identifying the Risks of LM Agents with an LM-Emulated Sandbox." ICLR 2024.

Proposes emulated environments for testing agent safety, enabling systematic identification of failure modes before production deployment.

Paper

Chaos Engineering and Reliability

Rosenthal, C., Jones, N. (2020). "Principles of Chaos Engineering." principlesofchaos.org.

Defines the principles of chaos engineering for distributed systems, applicable to testing agent resilience by injecting failures in tool calls, APIs, and inter-agent communication.

Guide

Chen, W., Su, Y., Zuo, J., et al. (2024). "AgentVerse: Facilitating Multi-Agent Collaboration and Exploring Emergent Behaviors." ICLR 2024.

Provides a testbed for studying multi-agent interaction dynamics and emergent behaviors, useful for validating multi-agent system behavior under various conditions.

Paper

Pact Foundation (2024). "Pact: Contract Testing." docs.pact.io.

Documentation for the Pact contract testing framework, applicable to defining and verifying communication contracts between agents in multi-agent systems.

Documentation