Chapter 62: Production Engineering for LLM Systems

Chapter opener illustration: Production Engineering for LLM Systems.

"Production is where the bug reports begin."
Deploy, Perpetually-Shipping AI Agent

Looking Back

Part XII supplied the iron and the framework. Part XIII covers the lifecycle of an LLM service. This chapter starts with the core: deployment patterns, API design, prompt versioning, model registry, and the small engineering habits that distinguish a long-lived service from a hack-week demo.

Big Picture

Infrastructure-heavy engineering for LLM systems: scaling, AI gateways, workflow orchestration, edge deployment, reliability, and Kubernetes-native operations.

Chapter Overview

Production engineering for LLM systems is what turns a notebook into a 24-by-7 service. This chapter teaches the core: scaling, performance, and production guardrails that keep responses safe and SLOs intact under unpredictable traffic, plus the LLMOps practices that extend MLOps with LLM-specific lifecycle, eval ownership, prompt versioning, and the data-quality flywheel.

Production engineering is the discipline that separates demos from durable products. This chapter is the foundation for the rest of Part XIII.

Note: Learning Objectives

Architect scaling and performance patterns for unpredictable LLM traffic.
Layer production guardrails into the response path without breaking latency budgets.
Apply LLMOps practices (model lifecycle, eval ownership, prompt versioning, data-quality flywheel).
Design the deploy, monitor, learn loop that maintains LLM product quality over time.
Diagnose the LLM-specific failure modes that traditional MLOps does not anticipate.

Prerequisites

LLM APIs from Chapter 11
Evaluation foundations from Chapter 42
Web-service engineering basics (REST, async, caching)

Sections

What's Next?

This chapter begins with Section 62.1: Scaling, Performance & Production Guardrails. Each section builds on the previous one, so we recommend reading them in order.

Further Reading

LLM Serving Systems

Kwon, W., Li, Z., Zhuang, S., Sheng, Y., Zheng, L., Yu, C. H., et al. (2023). "Efficient Memory Management for Large Language Model Serving with PagedAttention." SOSP. arXiv:2309.06180. The vLLM paper that consolidated continuous batching and KV-cache paging as the production-serving baseline for Section 62.1.

Yu, G.-I., Jeong, J. S., Kim, G.-W., Kim, S., & Chun, B.-G. (2022). "Orca: A Distributed Serving System for Transformer-Based Generative Models." OSDI. USENIX OSDI '22. Introduced iteration-level scheduling and continuous batching, the algorithmic root of every modern LLM serving engine.

LLMOps Practice

Sculley, D., Holt, G., Golovin, D., Davydov, E., Phillips, T., Ebner, D., et al. (2015). "Hidden Technical Debt in Machine Learning Systems." NeurIPS. NeurIPS 2015. The canonical paper on ML technical debt; the conceptual baseline that LLMOps practice extends in Section 62.2.

Huyen, C. (2022). Designing Machine Learning Systems: An Iterative Process for Production-Ready Applications. O'Reilly. O'Reilly. The reference textbook for ML systems in production; covers the data, monitoring, and deployment patterns that LLMOps inherits.

Reliability Engineering

Beyer, B., Jones, C., Petoff, J., & Murphy, N. R. (2016). Site Reliability Engineering: How Google Runs Production Systems. O'Reilly. Google SRE Book. The reference text for SLOs, error budgets, and incident review patterns transferred to LLM production work.