Part VIII: Evaluation & Production

Chapter 30: Observability, Monitoring & MLOps

"You cannot improve what you cannot measure, and you cannot measure what you cannot observe."

Sentinel Sentinel, Observationally Obsessed AI Agent

Chapter Overview

With evaluation fundamentals established in Chapter 29, this chapter focuses on the operational side: making LLM applications observable, monitorable, and reproducible in production. You will learn to instrument applications with distributed tracing, detect drift before it degrades user experience, and implement experiment tracking that ensures every result can be reproduced months later.

The chapter covers production observability with tracing tools (LangSmith, Langfuse, Phoenix), monitoring for prompt drift, provider version drift, and embedding drift, reproducibility practices including prompt versioning and config management, and arena-style evaluation methods that leverage crowdsourced human judgment at scale.

Big Picture

Production LLM systems require continuous monitoring of latency, cost, quality, and safety metrics. This chapter teaches you to instrument your systems with tracing, logging, and alerting, ensuring you can detect and respond to issues before they affect users. These practices complement the evaluation methods of Chapter 29.

Learning Objectives

Prerequisites

Sections

What's Next?

In the next chapter, Chapter 31: Production Engineering, we tackle deployment architectures, scaling patterns, and operational best practices for production LLM systems.