Chapter 66: Reliability, SLOs & Model Registry

Chapter opener illustration: Reliability.

"A 99.9 percent SLO is a contract; a 99.9 percent vibe is a footnote."
Deploy, SLO-Defending AI Agent

Looking Back

Chapters 62 through 65 deployed and orchestrated LLM systems. This chapter is the operational discipline: SLOs, error budgets, incident response, postmortems, model registry, prompt-as-code, blue/green deployments, and the small habits that distinguish a reliable platform from one that hopes.

Big Picture

LLM applications fail in modes that traditional reliability engineering does not cover: hallucination spikes, prompt regressions, latency cliffs from long contexts. This chapter covers SLO definition for LLM systems, error budgets, model registry patterns, canary deployments, and incident response for production LLM apps.

Chapter Overview

LLM applications fail in ways that classical reliability engineering never anticipated: silent quality regressions, prompt drift, hallucination spikes, provider deprecations. This chapter teaches LLM reliability engineering: the SLO model adapted for LLMs (quality, safety, latency, cost as first-class objectives), the model registry as the inventory backbone, the runbook patterns that make on-call survivable, and the post-incident discipline that turns failures into eval cases.

Reliability for LLM systems is its own discipline. This chapter is the practitioner's syllabus for the metrics, registries, runbooks, and post-incident habits that compound.

Note: Learning Objectives

Apply an SLO model that treats quality, safety, latency, and cost as first-class objectives.
Architect a model registry with versioning, lineage, and promotion workflows.
Design runbooks for the LLM-specific failure modes: hallucination spikes, prompt drift, provider deprecation.
Conduct LLM post-incident reviews that produce eval cases, not just retrospectives.
Diagnose reliability regressions using observability, drift detection, and registry diffs.

Sections in This Chapter

Prerequisites

Production engineering from Chapter 62
AI gateways from Chapter 63
Containers and Kubernetes from Chapter 65

What's Next?

This chapter begins with Section 66.1: Reliability Engineering for LLM Applications. Each section builds on the previous one, so we recommend reading them in order.

Further Reading

SRE for LLM Services

Beyer, B., Murphy, N. R., Rensin, D. K., Kawahara, K., & Thorne, S. (2018). The Site Reliability Workbook: Practical Ways to Implement SRE. O'Reilly. Google SRE Workbook. The practitioner's reference for designing SLOs, error budgets, and incident response, the methodology this chapter adapts for LLM services.

Hendrickson, S., Sturdevant, S., Harter, T., Venkataramani, V., Arpaci-Dusseau, A. C., & Arpaci-Dusseau, R. H. (2016). "Serverless Computation with OpenLambda." HotCloud. USENIX HotCloud '16. Establishes the latency-tail and cold-start patterns that LLM endpoints inherit and that SLO design must account for.

Model Registry & Lineage

Schelter, S., Boese, J.-H., Kirschnick, J., Klein, T., & Seufert, S. (2017). "Automatically Tracking Metadata and Provenance of Machine Learning Experiments." NeurIPS Workshop on ML Systems. NeurIPS 2017 ML Systems Workshop. A canonical reference for what a model registry must track (data, code, hyperparameters, metrics), which LLMOps inherits and extends with prompts.

Mitchell, M., Wu, S., Zaldivar, A., Barnes, P., Vasserman, L., Hutchinson, B., et al. (2019). "Model Cards for Model Reporting." FAT*. arXiv:1810.03993. Establishes the model-card pattern that production model registries now embed for every registered model version.