
"A 99.9 percent SLO is a contract; a 99.9 percent vibe is a footnote."
Deploy, SLO-Defending AI Agent
Chapters 62 through 65 deployed and orchestrated LLM systems. This chapter is the operational discipline: SLOs, error budgets, incident response, postmortems, model registry, prompt-as-code, blue/green deployments, and the small habits that distinguish a reliable platform from one that hopes.
LLM applications fail in modes that traditional reliability engineering does not cover: hallucination spikes, prompt regressions, latency cliffs from long contexts. This chapter covers SLO definition for LLM systems, error budgets, model registry patterns, canary deployments, and incident response for production LLM apps.
Chapter Overview
LLM applications fail in ways that classical reliability engineering never anticipated: silent quality regressions, prompt drift, hallucination spikes, provider deprecations. This chapter teaches LLM reliability engineering: the SLO model adapted for LLMs (quality, safety, latency, cost as first-class objectives), the model registry as the inventory backbone, the runbook patterns that make on-call survivable, and the post-incident discipline that turns failures into eval cases.
Reliability for LLM systems is its own discipline. This chapter is the practitioner's syllabus for the metrics, registries, runbooks, and post-incident habits that compound.
- Apply an SLO model that treats quality, safety, latency, and cost as first-class objectives.
- Architect a model registry with versioning, lineage, and promotion workflows.
- Design runbooks for the LLM-specific failure modes: hallucination spikes, prompt drift, provider deprecation.
- Conduct LLM post-incident reviews that produce eval cases, not just retrospectives.
- Diagnose reliability regressions using observability, drift detection, and registry diffs.
Sections in This Chapter
Prerequisites
- Production engineering from Chapter 62
- AI gateways from Chapter 63
- Containers and Kubernetes from Chapter 65
- 66.1 Reliability Engineering for LLM Applications LLM applications fail in ways that traditional software reliability engineering never anticipated. Advanced
- 66.2 Model Registry and Deployment Workflows Versioning, lineage, promotion workflows, and lifecycle management for LLM artifacts. Intermediate
What's Next?
This chapter begins with Section 66.1: Reliability Engineering for LLM Applications. Each section builds on the previous one, so we recommend reading them in order.