Reliability Engineering for LLM Applications

Section 66.1

"A reliable system is not one that never fails. It is one that fails gracefully, recovers quickly, and tells you exactly what went wrong."

DeployDeploy, Gracefully Failing AI Agent
Big Picture

LLM applications fail in ways that traditional software reliability engineering never anticipated. A web server either returns a page or it does not. An LLM can return a response that is syntactically valid, passes all type checks, and is completely wrong. This dual nature of failures (hard infrastructure failures alongside soft semantic failures) demands a new reliability framework that combines classical patterns (retries, circuit breakers, fallback chains) with LLM-specific defenses (hallucination detection, output validation, semantic accuracy monitoring). This section presents a comprehensive reliability engineering approach covering failure taxonomy, resilience patterns, guardrails, SLO design, incident response, and chaos engineering for LLM systems.

Prerequisites

This section builds on observability patterns from Chapter 42 and agent architectures from Section 26.1: Foundations of AI Agents. Application-architecture and deployment patterns are revisited in detail later in the book.

A cartoon robot engineer building a safety net underneath a tightrope while another robot carefully walks the tightrope carrying a production workload.
Figure 66.1.1: A reliable system is not one that never fails. It is one that fails gracefully, recovers quickly, and tells you exactly what went wrong.

66.1.1 LLM Failure Taxonomy

Your monitoring dashboard is green. Latency p99 is fine. Zero error rate. And yet the support inbox is filling up with users complaining that the answers are wrong. This is the failure mode that traditional reliability engineering was never designed to catch. LLM application failures split into two broad categories: hard failures where the system produces no usable response, and soft failures where the system produces a response that appears valid but contains errors. Hard failures are easier to detect and handle. Soft failures are more dangerous because they propagate silently through downstream systems and erode user trust gradually.

Fun Fact

During a major cloud provider outage in 2024, a well-designed LLM application with a proper fallback chain seamlessly routed 100% of traffic to an alternative provider for four hours. Users noticed nothing. The team that skipped the fallback chain spent those same four hours refreshing a status page and hand-typing apology messages to customers.

66.1.1.1 Hard Failures

Hard failures produce clear error signals: HTTP 429 rate limits, 500 server errors, connection timeouts, context length exceeded errors, and authentication failures. These are the failures that traditional retry logic handles well. The challenge with LLM APIs is that hard failure rates vary dramatically by provider, model, time of day, and request size. A system that sees 0.1% error rates at 10 requests per second may see 5% error rates at 100 requests per second due to provider-side throttling.

66.1.1.2 Soft Failures

Soft failures are unique to LLM systems and far more insidious. The model returns HTTP 200, the response parses correctly, but the content is wrong. Common soft failure modes include: hallucination (factually incorrect claims presented with high confidence; the deep treatment lives in Section 49.5), refusal (the model declines a legitimate request due to overly cautious safety filters), format violation (the model returns prose when JSON was requested), instruction drift (the model ignores part of the system prompt under long contexts), and quality degradation (responses become generic or repetitive under high load or after provider-side model updates).

from enum import Enum
from dataclasses import dataclass
from datetime import datetime
class FailureCategory(str, Enum):
    # Hard failures
    RATE_LIMIT = "rate_limit"
    TIMEOUT = "timeout"
    SERVER_ERROR = "server_error"
    AUTH_ERROR = "auth_error"
    CONTEXT_OVERFLOW = "context_overflow"
    # Soft failures
    HALLUCINATION = "hallucination"
    REFUSAL = "refusal"
    FORMAT_VIOLATION = "format_violation"
    INSTRUCTION_DRIFT = "instruction_drift"
    QUALITY_DEGRADATION = "quality_degradation"
@dataclass
class FailureEvent:
    category: FailureCategory
    timestamp: datetime
    model: str
    request_id: str
    details: str
    is_hard: bool
    retryable: bool
def classify_failure(status_code: int, response_body: str | None, expected_format: str | None = None) -> FailureEvent | None:
    """Classify an LLM API response into failure categories."""
    now = datetime.now()
    base = {"timestamp": now, "model": "", "request_id": ""}
    # Hard failure classification
    if status_code == 429:
        return FailureEvent(
            category=FailureCategory.RATE_LIMIT,
            details="Rate limit exceeded",
            is_hard=True, retryable=True, **base
            )
        if status_code == 408 or status_code == 504:
            return FailureEvent(
                category=FailureCategory.TIMEOUT,
                details=f"Request timed out (HTTP {status_code})",
                is_hard=True, retryable=True, **base
                )
            if status_code >= 500:
                return FailureEvent(
                    category=FailureCategory.SERVER_ERROR,
                    details=f"Server error (HTTP {status_code})",
                    is_hard=True, retryable=True, **base
                    )
                # Soft failure classification (HTTP 200 but bad content)
                if status_code == 200 and response_body:
                    if expected_format == "json":
                        try:
                            import json
                            json.loads(response_body)
                        except json.JSONDecodeError:
                            return FailureEvent(
                                category=FailureCategory.FORMAT_VIOLATION,
                                details="Expected JSON but received non-JSON response",
                                is_hard=False, retryable=True, **base
                                )
                            refusal_phrases = [
                                "I cannot", "I'm unable to", "I apologize, but I can't",
                                "I'm not able to", "As an AI, I cannot",
                                ]
                            if any(response_body.strip().startswith(p) for p in refusal_phrases):
                                return FailureEvent(
                                    category=FailureCategory.REFUSAL,
                                    details="Model refused the request",
                                    is_hard=False, retryable=True, **base
                                    )
                                return None # No failure detected
Code Fragment 66.1.1a: A FailureCategory enum that distinguishes hard failures (HTTP 5xx, timeout, rate_limit) from soft failures (hallucination, refusal, schema_mismatch). Classifying every error explicitly is the prerequisite for the differentiated retry/fallback logic that follows: hard failures want backoff, soft failures want a different prompt or model.

66.1.2 Cascading Failures in Multi-Agent Systems

Multi-agent architectures amplify failure risks through error cascading. When Agent A calls Agent B, which calls Agent C, a single soft failure at the deepest level can propagate upward and corrupt the final output in ways that are nearly impossible to trace without structured observability. Consider a research agent that queries a retrieval agent for sources, then passes those sources to a summarization agent. If the retrieval agent hallucinates a non-existent paper, the summarization agent will faithfully summarize a paper that does not exist, and the research agent will cite it with full confidence.

Error amplification follows a multiplicative model. If each agent in a three-agent chain has a 5% soft failure rate, the end-to-end reliability is not 95% but roughly $0.95^3 = 85.7\%$, assuming failures are independent. In practice, failures are correlated (a model degradation event affects all agents using that model), making the actual reliability worse.

Key Insight: Mental Model: The Telephone Game

Multi-agent cascading failures work like the children's telephone game. Each agent whispers the message to the next, and small distortions compound at every step. By the time the message reaches the last agent, it may be unrecognizable. The solution is the same one that engineers use in communication systems: error detection and correction at every hop, not just at the endpoints. Each agent should validate its inputs (did the previous agent actually return what was requested?), verify its outputs (does my response meet the quality bar?), and signal uncertainty explicitly rather than passing confident-sounding garbage downstream.

66.1.3 Resilience Patterns

Three classical resilience patterns adapt well to LLM applications: retries with exponential backoff, fallback chains, and circuit breakers. The key adaptation is that these patterns must handle both hard failures (where the pattern triggers on HTTP errors) and soft failures (where the pattern triggers on content quality checks).

import asyncio
import random
import time
from typing import Callable, Any


class RetryConfig:
    def __init__(
        self,
        max_retries: int = 3,
        base_delay: float = 1.0,
        max_delay: float = 60.0,
        jitter: bool = True,
        retryable_status_codes: set[int] = None,
    ):
        self.max_retries = max_retries
        self.base_delay = base_delay
        self.max_delay = max_delay
        self.jitter = jitter
        self.retryable_status_codes = retryable_status_codes or {429, 500, 502, 503, 504}


async def retry_with_backoff(
    func: Callable,
    config: RetryConfig,
    *args,
    **kwargs,
) -> Any:
    """Execute an async function with exponential backoff retry logic.

    Handles both hard failures (exceptions, bad status codes) and
    soft failures (via an optional validator function in kwargs).
    """
    validator = kwargs.pop("response_validator", None)
    last_exception = None
    for attempt in range(config.max_retries + 1):
        try:
            result = await func(*args, **kwargs)
            # Check for soft failures if a validator is provided
            if validator and not validator(result):
                if attempt < config.max_retries:
                    delay = min(
                        config.base_delay * (2 ** attempt),
                        config.max_delay,
                    )
                    if config.jitter:
                        delay *= (0.5 + random.random())
                    print(f"Soft failure on attempt {attempt + 1}, retrying in {delay:.1f}s")
                    await asyncio.sleep(delay)
                    continue
                raise ValueError(f"Response failed validation after {config.max_retries + 1} attempts")
            return result
        except Exception as e:
            last_exception = e
            if attempt < config.max_retries:
                delay = min(
                    config.base_delay * (2 ** attempt),
                    config.max_delay,
                )
                if config.jitter:
                    delay *= (0.5 + random.random())
                print(f"Attempt {attempt + 1} failed: {e}. Retrying in {delay:.1f}s")
                await asyncio.sleep(delay)
            else:
                raise last_exception
Code Fragment 66.1.2: An async retry wrapper with exponential backoff, jitter, and a max_attempts cap configured via RetryConfig. The jitter (random 0 to 1 multiplier) is critical at scale: without it, many failed clients retry on the same tick and create thundering-herd spikes that prolong the outage.
Library Shortcut: Tenacity for Retry and Resilience

The same result in 6 lines with Tenacity:

Show code
# pip install tenacity
from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type
from openai import RateLimitError, APITimeoutError
@retry(
    stop=stop_after_attempt(4),
    wait=wait_exponential(multiplier=1, min=1, max=60),
    retry=retry_if_exception_type((RateLimitError, APITimeoutError)),
    )
async def resilient_completion(client, **kwargs):
    return await client.chat.completions.create(**kwargs)
Code Fragment 66.1.9: Pip install tenacity.
import asyncio
import time
from dataclasses import dataclass, field
@dataclass
class FallbackProvider:
    name: str
    call: Callable # async callable that takes a prompt and returns a response
    priority: int
    max_latency_ms: float = 30000.0
    @dataclass
    class FallbackChain:
        """Implements a fallback chain across multiple LLM providers.
    If the primary provider fails or is too slow, fall back to
    secondary providers in priority order. As a last resort,
    return a cached or template response.
"""
        providers: list[FallbackProvider] = field(default_factory=list)
        cache: dict[str, str] = field(default_factory=dict)
        default_response: str = "I'm experiencing technical difficulties. Please try again shortly."
        async def call(self, prompt: str, validator: Callable = None) -> dict:
            """Try each provider in priority order until one succeeds."""
            self.providers.sort(key=lambda p: p.priority)
            errors = []
            for provider in self.providers:
                try:
                    start = time.monotonic()
                    result = await asyncio.wait_for(
                        provider.call(prompt),
                        timeout=provider.max_latency_ms / 1000.0,
                        )
                    latency = (time.monotonic() - start) * 1000
                    # Validate response quality
                    if validator and not validator(result):
                        errors.append(f"{provider.name}: failed validation")
                        continue
                    # Cache successful responses
                    cache_key = prompt[:200]
                    self.cache[cache_key] = result
                    return {
                        "response": result,
                        "provider": provider.name,
                        "latency_ms": latency,
                        "fallback_depth": self.providers.index(provider),
                        }
                except asyncio.TimeoutError:
                    errors.append(f"{provider.name}: timeout after {provider.max_latency_ms}ms")
                except Exception as e:
                    errors.append(f"{provider.name}: {type(e).__name__}: {e}")
                    # Last resort: check cache
                    cache_key = prompt[:200]
                    if cache_key in self.cache:
                        return {
                            "response": self.cache[cache_key],
                            "provider": "cache",
                            "latency_ms": 0,
                            "fallback_depth": len(self.providers),
                            }
                        return {
                            "response": self.default_response,
                            "provider": "default",
                            "latency_ms": 0,
                            "fallback_depth": len(self.providers) + 1,
                            "errors": errors,
                            }
Code Fragment 66.1.3: A FallbackProvider list (primary model, secondary model, cached response, static fallback) walked in order until one succeeds. Each provider exposes a uniform .call() interface so the chain code does not care whether the next entry is an LLM API or a Redis lookup.

66.1.4 Circuit Breaker Pattern

A circuit breaker prevents a failing provider from consuming retry budgets and adding latency when it is clearly unavailable. The circuit has three states: closed (normal operation, requests flow through), open (provider is down, requests fail immediately without attempting the call), and half-open (after a cooldown period, a single test request is sent to check if the provider has recovered).

For LLM applications, the circuit breaker should track both hard and soft failure rates. A provider that returns HTTP 200 but hallucinates on 30% of responses is just as unreliable as one that returns HTTP 500 on 30% of requests. The failure threshold should be configurable per failure category: you might tolerate a 5% rate limit error rate (transient, self-resolving) but open the circuit at a 2% hallucination rate (systemic, requiring investigation).

import time
from collections import deque
from enum import Enum
class CircuitState(str, Enum):
    CLOSED = "closed"
    OPEN = "open"
    HALF_OPEN = "half_open"
    class LLMCircuitBreaker:
        """Circuit breaker that monitors both hard and soft LLM failures."""
        def __init__(
            self,
            failure_threshold: float = 0.3,
            window_size: int = 100,
            cooldown_seconds: float = 60.0,
            half_open_max_calls: int = 3,
            ):
            self.failure_threshold = failure_threshold
            self.window_size = window_size
            self.cooldown_seconds = cooldown_seconds
            self.half_open_max_calls = half_open_max_calls
            self.state = CircuitState.CLOSED
            self.results: deque[bool] = deque(maxlen=window_size) # True = success
            self.last_failure_time: float = 0
            self.half_open_calls: int = 0
            @property
            def failure_rate(self) -> float:
                if not self.results:
                    return 0.0
                    failures = sum(1 for r in self.results if not r)
                    return failures / len(self.results)
        def can_execute(self) -> bool:
            """Check if a request should be allowed through."""
            if self.state == CircuitState.CLOSED:
                return True
                if self.state == CircuitState.OPEN:
                    # Check if cooldown has elapsed
                    if time.monotonic() - self.last_failure_time >= self.cooldown_seconds:
                        self.state = CircuitState.HALF_OPEN
                        self.half_open_calls = 0
                        return True
                        return False
                        if self.state == CircuitState.HALF_OPEN:
                            return self.half_open_calls < self.half_open_max_calls
                            return False
            def record_success(self):
                """Record a successful call (both hard and soft success)."""
                self.results.append(True)
                if self.state == CircuitState.HALF_OPEN:
                    self.half_open_calls += 1
                    if self.half_open_calls >= self.half_open_max_calls:
                        # All test calls succeeded; close the circuit
                        self.state = CircuitState.CLOSED
                def record_failure(self):
                    """Record a failed call (hard or soft failure)."""
                    self.results.append(False)
                    self.last_failure_time = time.monotonic()
                    if self.state == CircuitState.HALF_OPEN:
                        # Any failure in half-open state reopens the circuit
                        self.state = CircuitState.OPEN
                    elif self.state == CircuitState.CLOSED:
                        if len(self.results) >= 10 and self.failure_rate >= self.failure_threshold:
                            self.state = CircuitState.OPEN
Code Fragment 66.1.4: A circuit breaker that opens not just on HTTP errors but on soft failures (refusals, hallucinations flagged by guardrails) measured over a sliding window deque. Standard breakers miss the case where the provider is up but the model is returning garbage; tracking soft failures catches it.
Library Shortcut: pybreaker for Circuit Breaker

The same result in 5 lines with pybreaker:

Show code
# pip install pybreaker
import pybreaker
breaker = pybreaker.CircuitBreaker(
    fail_max=5, # open after 5 failures
    reset_timeout=30, # try again after 30 seconds
    )
@breaker
async def call_llm(client, prompt):
    return await client.chat.completions.create(
        model="gpt-4o", messages=[{"role": "user", "content": prompt}]
        )
    # Raises pybreaker.CircuitBreakerError when circuit is open
Code Fragment 66.1.10: Pip install pybreaker.

66.1.5 Guardrails as Infrastructure

Guardrails are validation layers that sit between the LLM and the user, checking both inputs and outputs for safety, correctness, and format compliance. Unlike ad hoc validation scattered throughout application code, production guardrails should be implemented as infrastructure: a centralized, configurable, and observable validation pipeline that every request passes through.

A well-designed guardrail pipeline validates inputs before they reach the model (rejecting prompt injections, enforcing length limits, checking for PII) and validates outputs before they reach the user (enforcing JSON schema compliance, detecting hallucinated citations, checking response coherence). Each guardrail should be independently toggleable, measurable, and bypassable for debugging.

import time
import json
import re
from dataclasses import dataclass
from typing import Callable


@dataclass
class GuardrailResult:
    passed: bool
    guardrail_name: str
    details: str = ""
    latency_ms: float = 0.0


class GuardrailPipeline:
    """Centralized input/output validation for LLM applications."""

    def __init__(self):
        self.input_guardrails: list = []
        self.output_guardrails: list = []

    def add_input_guardrail(self, name: str, check_fn: Callable[[str], GuardrailResult]):
        self.input_guardrails.append((name, check_fn))

    def add_output_guardrail(self, name: str, check_fn: Callable[[str, str], GuardrailResult]):
        self.output_guardrails.append((name, check_fn))

    def validate_input(self, user_input: str) -> list[GuardrailResult]:
        results = []
        for name, check_fn in self.input_guardrails:
            start = time.monotonic()
            result = check_fn(user_input)
            result.latency_ms = (time.monotonic() - start) * 1000
            results.append(result)
            if not result.passed:
                break  # fail fast on first input guardrail failure
        return results

    def validate_output(self, prompt: str, response: str) -> list[GuardrailResult]:
        results = []
        for name, check_fn in self.output_guardrails:
            start = time.monotonic()
            result = check_fn(prompt, response)
            result.latency_ms = (time.monotonic() - start) * 1000
            results.append(result)
        return results


# Example guardrails
def prompt_injection_check(user_input: str) -> GuardrailResult:
    """Detect common prompt injection patterns."""
    injection_patterns = [
        r"ignore\s+(all\s+)?previous\s+instructions",
        r"you\s+are\s+now\s+",
        r"system\s*:\s*",
        r"<\|im_start\|>",
    ]
    for pattern in injection_patterns:
        if re.search(pattern, user_input, re.IGNORECASE):
            return GuardrailResult(
                passed=False,
                guardrail_name="prompt_injection",
                details=f"Matched injection pattern: {pattern}",
            )
    return GuardrailResult(passed=True, guardrail_name="prompt_injection")


def json_schema_check(prompt: str, response: str) -> GuardrailResult:
    """Validate that JSON responses conform to expected schema."""
    try:
        parsed = json.loads(response)
        if not isinstance(parsed, dict):
            return GuardrailResult(
                passed=False,
                guardrail_name="json_schema",
                details="Response is valid JSON but not an object",
            )
        return GuardrailResult(passed=True, guardrail_name="json_schema")
    except json.JSONDecodeError as e:
        return GuardrailResult(
            passed=False,
            guardrail_name="json_schema",
            details=f"Invalid JSON: {e}",
        )


# Assembly
pipeline = GuardrailPipeline()
pipeline.add_input_guardrail("prompt_injection", prompt_injection_check)
pipeline.add_output_guardrail("json_schema", json_schema_check)
Code Fragment 66.1.5: A well-designed guardrail pipeline validates inputs before they reach the model (rejecting prompt injections, enforcing length limits.
Library Shortcut: Guardrails AI for Safety Guardrails

The same result in 8 lines with Guardrails AI:

Show code
# pip install guardrails-ai
import guardrails as gd
from guardrails.hub import DetectPII, ToxicLanguage, ValidJSON
guard = gd.Guard().use_many(
    DetectPII(pii_entities=["EMAIL_ADDRESS", "PHONE_NUMBER"], on_fail="fix"),
    ToxicLanguage(threshold=0.8, on_fail="noop"),
    ValidJSON(on_fail="reask"),
)
result = guard(
    llm_api=openai.chat.completions.create,
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": user_input}],
)
print(result.validated_output)
Code Fragment 66.1.11: Pip install guardrails-ai.
Note

Guardrail latency adds directly to user-perceived response time. A guardrail pipeline with five checks averaging 50ms each adds 250ms to every request. In practice, run input guardrails synchronously (they must complete before the LLM call) and output guardrails asynchronously where possible (stream the response to the user while validating in the background, with a mechanism to retract or flag responses that fail validation). For structured output requirements, consider using constrained decoding (via provider features like OpenAI's response_format or Anthropic's tool use) instead of post-hoc validation, which eliminates format violation failures entirely.

66.1.6 SLOs for LLM Systems

Traditional Service Level Objectives (SLOs) measure availability (99.9% uptime), latency (p95 under 200ms), and error rate (under 0.1%). These metrics are necessary but insufficient for LLM systems. A chatbot that is always available and responds quickly but hallucinates 10% of the time is not meeting user expectations, even if all traditional SLOs are green.

LLM-specific SLOs should include: semantic accuracy (percentage of responses that are factually correct, measured by periodic human evaluation or automated judges), hallucination rate (percentage of responses containing fabricated information), time to first token (TTFT, critical for streaming applications), goodput (percentage of requests that produce a useful response, excluding refusals and format violations), and cost per useful response (total API spend divided by goodput, accounting for retries and fallbacks).

from dataclasses import dataclass, field
from collections import deque
import time
@dataclass
class SLODefinition:
    name: str
    target: float # target value (e.g., 0.99 for 99%)
    window_minutes: int = 60
    breach_callback: Callable | None = None
@dataclass
class SLOTracker:
    """Track LLM-specific SLOs with rolling windows."""
    slos: dict[str, SLODefinition] = field(default_factory=dict)
    measurements: dict[str, deque] = field(default_factory=dict)
    def define_slo(self, slo: SLODefinition):
        self.slos[slo.name] = slo
        self.measurements[slo.name] = deque()
    def record(self, slo_name: str, value: float):
        """Record a measurement for an SLO."""
        if slo_name not in self.slos:
            return
        now = time.monotonic()
        self.measurements[slo_name].append((now, value))
        # Prune old measurements outside the window
        window_seconds = self.slos[slo_name].window_minutes * 60
        while (self.measurements[slo_name]
            and now - self.measurements[slo_name][0][0] > window_seconds):
            self.measurements[slo_name].popleft()
    def current_value(self, slo_name: str) -> float | None:
        """Get current SLO value over the rolling window."""
        if slo_name not in self.measurements:
            return None
        values = [v for _, v in self.measurements[slo_name]]
        if not values:
            return None
        return sum(values) / len(values)
    def check_breaches(self) -> list[str]:
        """Check all SLOs and return names of any that are breached."""
        breaches = []
        for name, slo in self.slos.items():
            current = self.current_value(name)
            if current is not None and current < slo.target:
                breaches.append(name)
                if slo.breach_callback:
                    slo.breach_callback(name, current, slo.target)
                    return breaches
                # Define LLM-specific SLOs
                tracker = SLOTracker()
                tracker.define_slo(SLODefinition("availability", target=0.999, window_minutes=60))
                tracker.define_slo(SLODefinition("goodput", target=0.95, window_minutes=60))
                tracker.define_slo(SLODefinition("ttft_p95_ms", target=800.0, window_minutes=30))
                tracker.define_slo(SLODefinition("hallucination_rate", target=0.02, window_minutes=1440))
                tracker.define_slo(SLODefinition("semantic_accuracy", target=0.90, window_minutes=1440))
Code Fragment 66.1.6: SLODefinition dataclasses for latency (p95 < 2s), availability (99.5% success), and quality (refusal rate < 1%), with a rolling deque per metric to compute live SLI values. The pattern translates cloud-style SLOs to LLM-specific KPIs that ops dashboards can alert on.

66.1.7 Incident Response for LLM Systems

LLM incidents differ from traditional software incidents in three important ways. First, detection is harder: a model degradation event may not trigger any alerts because the system continues returning HTTP 200 with plausible-looking responses. Second, root cause analysis is more complex: was the regression caused by a provider-side model update, a prompt change, a data distribution shift, or a combination? Third, rollback is less straightforward: you cannot simply revert a deployment if the problem is a provider-side model version change.

An effective LLM incident response playbook covers four phases: detection (SLO breach alerts, user complaint spikes, automated quality monitoring), triage (is this a hard failure, soft failure, cost anomaly, or security incident?), mitigation (switch to fallback provider, revert prompt changes, enable cached responses), and resolution (root cause identification, prevention measures, post-incident review).

Warning: Silent Model Updates

Cloud LLM providers routinely update model weights, safety filters, and rate limits without advance notice. OpenAI's "gpt-4o" endpoint may behave differently on Tuesday than it did on Monday with no version change visible in the API. This means that any LLM application can experience regression at any time without any deployment change on your side. The mitigation is threefold: pin model versions when available (use "gpt-4o-2024-08-06" instead of "gpt-4o"), run continuous evaluation against a fixed test set, and maintain fallback configurations that can be activated within minutes.

66.1.8 Chaos Engineering for LLM Systems

Chaos engineering tests system resilience by intentionally injecting failures in controlled conditions. For LLM systems, the failure injection surface extends beyond traditional infrastructure failures to include LLM-specific scenarios: simulating hallucinations, injecting adversarial prompts, throttling API responses to test timeout handling, and corrupting tool call responses to test error propagation.

A practical chaos engineering program for LLM applications should test five scenarios: provider outage (block all API calls to the primary provider), latency injection (add 5 to 30 seconds of delay to test timeout and fallback behavior), response corruption (replace model responses with random text to test output validation), prompt injection (submit known adversarial inputs to test guardrail effectiveness), and cost explosion (simulate a bug that sends 100x normal request volume to test budget controls).

Lab: Resilient LLM Client

Objective

This lab combines all resilience patterns into a single client that handles retries, fallback chains, circuit breaking, guardrail validation, and budget-aware stopping. The client is designed for production use where reliability matters more than raw performance.

Setup

You need asyncio-capable Python (3.10+), API keys for at least two LLM providers (the fallback chain assumes a primary and a secondary), and the supporting classes from earlier in this section: FallbackChain, LLMCircuitBreaker, GuardrailPipeline, RetryConfig, and SLOTracker. Set a budget limit in USD that bounds your worst-case spend during testing.

Steps

The implementation below walks the six-step request flow: budget check, input guardrails, circuit-breaker check, fallback chain with retries, output guardrails, and SLO recording. Read the complete() coroutine top to bottom and trace which short-circuit corresponds to which failure mode.

import asyncio
import time


class BudgetExhausted(Exception):
    pass


class ResilientLLMClient:
    """Production LLM client with retries, fallback, circuit breaker,
    guardrails, and budget controls.
    """

    def __init__(
        self,
        fallback_chain: "FallbackChain",
        circuit_breaker: "LLMCircuitBreaker",
        guardrail_pipeline: "GuardrailPipeline",
        retry_config: "RetryConfig",
        budget_limit_usd: float = 100.0,
    ):
        self.fallback_chain = fallback_chain
        self.circuit_breaker = circuit_breaker
        self.guardrails = guardrail_pipeline
        self.retry_config = retry_config
        self.budget_limit_usd = budget_limit_usd
        self.budget_spent_usd: float = 0.0
        self.request_count: int = 0
        self.slo_tracker = SLOTracker()

    async def complete(
        self,
        prompt: str,
        expected_format: str | None = None,
        cost_per_request: float = 0.01,
    ) -> dict:
        """Send a prompt through the full resilience stack.

        Flow:
            1. Budget check
            2. Input guardrails
            3. Circuit breaker check
            4. Fallback chain with retries
            5. Output guardrails
            6. SLO recording
        """
        start_time = time.monotonic()
        # 1. Budget check
        if self.budget_spent_usd + cost_per_request > self.budget_limit_usd:
            raise BudgetExhausted(
                f"Budget exhausted: ${self.budget_spent_usd:.2f} of "
                f"${self.budget_limit_usd:.2f} spent"
            )
        # 2. Input guardrails
        input_results = self.guardrails.validate_input(prompt)
        for result in input_results:
            if not result.passed:
                return {
                    "response": "Request blocked by input guardrails.",
                    "blocked": True,
                    "guardrail": result.guardrail_name,
                    "details": result.details,
                }
        # 3. Circuit breaker check
        if not self.circuit_breaker.can_execute():
            return await self._fallback_response(prompt, "circuit_open")
        # 4. Call with retries and fallback
        try:
            def response_validator(result):
                if not result or not result.get("response"):
                    return False
                output_results = self.guardrails.validate_output(
                    prompt, result["response"]
                )
                return all(r.passed for r in output_results)

            result = await self.fallback_chain.call(
                prompt, validator=response_validator
            )
            self.circuit_breaker.record_success()
            self.budget_spent_usd += cost_per_request
            self.request_count += 1
            # 6. Record SLO metrics
            latency_ms = (time.monotonic() - start_time) * 1000
            self.slo_tracker.record("availability", 1.0)
            self.slo_tracker.record("ttft_p95_ms", latency_ms)
            is_useful = result.get("provider") not in ("cache", "default")
            self.slo_tracker.record("goodput", 1.0 if is_useful else 0.0)
            result["latency_total_ms"] = latency_ms
            result["budget_remaining_usd"] = self.budget_limit_usd - self.budget_spent_usd
            return result
        except Exception as e:
            self.circuit_breaker.record_failure()
            self.slo_tracker.record("availability", 0.0)
            self.slo_tracker.record("goodput", 0.0)
            return await self._fallback_response(prompt, str(e))

    async def _fallback_response(self, prompt: str, reason: str) -> dict:
        """Return cached or default response when all else fails."""
        cache_key = prompt[:200]
        if cache_key in self.fallback_chain.cache:
            return {
                "response": self.fallback_chain.cache[cache_key],
                "provider": "cache",
                "fallback_reason": reason,
            }
        return {
            "response": self.fallback_chain.default_response,
            "provider": "default",
            "fallback_reason": reason,
        }
Code Fragment 66.1.7: ResilientLLMClient composes every defense from the section: per-user budget guard (BudgetExhausted), circuit breaker, retry loop, fallback chain, and SLO tracker. The single client.call() entry point hides the complexity from feature code while preserving fine-grained observability.

Expected Output

A nominal call returns a dict carrying the response, the upstream provider that served it, the total latency in milliseconds, and the remaining budget. Inducing a primary-provider outage should switch the response to the secondary or to a cached default with provider set accordingly; tripping the circuit breaker should short-circuit subsequent calls to the fallback path without raising; exhausting the budget should raise BudgetExhausted. Inspect self.slo_tracker after a run to confirm availability, TTFT, and goodput were recorded.

Exercises

Exercise 28.8.1: LLM Failure Taxonomy Conceptual

Production LLM failures fall into a small number of recurring categories. (a) Name four distinct LLM failure classes that a traditional 5xx-error monitor would miss. (b) For each, describe the customer-visible symptom. (c) What is the single biggest difference in how you instrument an LLM service versus a traditional REST API?

Answer Sketch

(a) (i) Hallucinated content (200 OK with wrong facts), (ii) prompt-injection compromise (model executes instructions from user-supplied text), (iii) refusal regression (model now refuses requests it used to handle), (iv) silent format drift (JSON output becomes invalid markdown). (b) Symptoms: customer reports of confidently wrong info; data leak or unauthorized actions; reduced product utility; downstream parsing crashes. (c) The biggest instrumentation difference: you must log inputs and outputs (or hashes plus structural features) because most LLM failures are content failures invisible to HTTP-status metrics. This raises privacy and storage costs that a stateless API never deals with.

Exercise 28.8.2: Predict the Cascade Predictive

An agent system has three components in series: (A) router model with 99% correct routing, (B) tool-use model with 95% per-step accuracy averaged over 4 tool calls, (C) summarizer with 98% accuracy. Predict: (a) end-to-end success rate; (b) which component is the highest-leverage to fix; (c) one architectural change that decouples the components and bounds blast radius.

Answer Sketch

(a) End-to-end = 0.99 x 0.95^4 x 0.98 ~= 0.99 x 0.8145 x 0.98 ~= 0.79 (79%). (b) Component B dominates failure: bringing per-step accuracy from 95% to 99% raises end-to-end from 79% to 95%, while perfecting A or C alone barely moves the dial. (c) Decouple by adding a per-step verifier and retry budget: each tool call is checked and re-tried (or escalated to human review) on failure, so individual step errors don't propagate. The change converts a multiplicative failure surface into an additive one capped by retry count.

Exercise 28.8.3: Add a Circuit Breaker Code Tweak

Sketch a 12-line circuit breaker for an LLM call that opens after 5 consecutive failures, stays open for 30 seconds, then attempts a half-open probe. State the corner case that bites every first implementation.

Answer Sketch
import time


class Breaker:
    fail = 0
    opened_at = 0
    STATE = "closed"

    def call(self, fn, *a):
        if self.STATE == "open":
            if time.time() - self.opened_at < 30:
                raise OpenError
            self.STATE = "half_open"
        try:
            r = fn(*a)
            self.fail = 0
            self.STATE = "closed"
            return r
        except Exception:
            self.fail += 1
            if self.fail >= 5:
                self.STATE = "open"
                self.opened_at = time.time()
            raise
Code Fragment 66.1.8: Sketch a 12-line circuit breaker for an LLM call that opens after 5 consecutive failures, stays open for 30 seconds.

The corner case: in a multi-process service every replica maintains its own counter, so a global outage is detected N times slower than expected. Fix: store the failure count in shared cache (Redis) keyed by upstream identity, with a TTL that auto-resets after the cooldown.

Exercise 28.8.4: SLO Failure Modes Failure Mode

You set an SLO of "99% of LLM calls return within 5 seconds with valid JSON". List three ways this SLO can be technically green while customers are unhappy, and the metric you should add to catch each.

Answer Sketch

(1) Valid JSON, wrong content: the schema validates but the values are hallucinated. Add a per-call quality eval (LLM judge or a sample-then-human-review) tracked as a separate SLO. (2) Valid latency, missing tokens: model returns a truncated answer because of an upstream max-tokens bug. Add an output-completeness check (does the response include the required fields or signal end-of-answer markers?). (3) Valid response, broken downstream: the call returned in 4.9s and parsed cleanly, but downstream tool-call rejected it for being slightly off-format. Track end-to-end task success, not just per-call success. The recurring lesson: LLM SLOs need at least one quality dimension on top of the standard latency-and-availability pair.

What Comes Next

This section covered the LLM failure taxonomy, cascading failures in multi-agent systems, and SLO definitions. In Section 66.2: Model Registry and Deployment Workflows, we cover the registry that anchors all of this: versioning, lineage, promotion workflows, and the deployment patterns that turn registered candidates into the current champion in production.

Further Reading

Reliability Engineering Foundations

Beyer, B. et al. (2016). Site Reliability Engineering: How Google Runs Production Systems. O'Reilly. The foundational text on SRE practices including SLOs, error budgets, and incident response. While not LLM-specific, the principles of service level management and on-call practices directly inform the SLO framework and incident response playbooks presented in this section.
Nygard, M. (2018). Release It! Design and Deploy Production-Ready Software. 2nd edition. Pragmatic Bookshelf. Introduces the stability patterns (circuit breaker, bulkhead, timeout) that form the basis of the resilience patterns in this section. The circuit breaker pattern described here is adapted directly from Nygard's formulation with extensions for soft failure detection.

LLM-Specific Reliability

Rebedea, T. et al. (2023). NeMo Guardrails: A Toolkit for Controllable and Safe LLM Applications. Presents NVIDIA's framework for adding programmable guardrails to LLM applications, including topical rails, safety rails, and output format enforcement. The guardrails-as-infrastructure approach in this section draws on NeMo's architecture of composable validation layers.
Alon, G. et al. (2024). Detecting LLM Hallucinations with Semantic Entropy. Proposes using semantic entropy across multiple sampled responses to detect hallucinations without external knowledge bases. This technique can be integrated into output guardrails for real-time hallucination detection in production systems.

Chaos Engineering and Testing

Rosenthal, C. et al. (2020). Chaos Engineering: System Resiliency in Practice. O'Reilly. Defines the principles and practices of chaos engineering, including the steady-state hypothesis, blast radius control, and automated experimentation. The LLM-specific chaos engineering scenarios in this section extend these principles to cover AI-specific failure modes like hallucination injection and adversarial prompt testing.
Mazeika, M. et al. (2024). HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Refusal Training. Provides a standardized framework for evaluating LLM robustness against adversarial attacks, directly relevant to the chaos engineering and adversarial testing practices described in this section.