Chapter 44: Online Evaluation, Observability, and Production Monitoring

Chapter opener illustration: Online Evaluation.

"Offline evals tell you it works; online observability tells you it is still working."
Deploy, Perpetually-Shipping AI Agent

Looking Back

Chapters 42 and 43 evaluated models in the lab. This chapter watches them in production: traces, dashboards, drift detection, online A/B tests, and the observability stack (LangSmith, Phoenix, Helicone, Langfuse) that keeps an LLM product healthy after launch.

Big Picture

Evaluation of production traffic: distributed tracing, observability platforms, OpenTelemetry, online A/B testing, drift detection, and eval-as-product workflows.

Chapter Overview

Online evaluation is the discipline of measuring an LLM system after it leaves the lab. This chapter covers the model registry and deployment workflows that bridge experimentation to production, the dashboards (W&B, MLflow) that track quality, safety, latency, and cost together, observability and the GenAI OpenTelemetry semantic conventions, the five flavors of LLM drift (input, model, context, performance, cost), the model-rotation strategy that keeps you portable across providers, and the eval-as-product category (Braintrust, Latitude, Laminar) that emerged between 2023 and 2026.

Online evaluation is where the 2025 silent-provider-update incidents taught the field that drift detection is non-optional. This chapter is the production-ready picture as of 2026.

Note: Learning Objectives

Architect a model registry with versioning, lineage, and promotion workflows for LLM artifacts.
Configure quality, safety, latency, and cost dashboards in W&B or MLflow.
Apply OpenLLMetry and the GenAI semantic conventions to instrument a production LLM system.
Detect the five flavors of LLM drift (input, model, context, performance, cost) in production traffic.
Design a model-rotation strategy that survives a provider deprecation or capability regression.
Compare Braintrust, Latitude, and Laminar as eval-as-product platforms.

Prerequisites

Offline evaluation from Chapter 42
Specialized evaluation from Chapter 43
Some prior exposure to production monitoring (Datadog, Prometheus) helps

Sections

Lab 44: Instrument a RAG App With Langfuse, Then Replay and Score Production Traffic

Objective

Add OpenTelemetry-based observability to the Lab 32 RAG bot, ship 100 simulated queries, then build a replay pipeline that re-scores them when prompts or models change. By the end you have the production-monitoring loop every LLM team eventually builds.

Steps

Step 1: Spin up Langfuse. Run docker compose up from the official langfuse/langfuse repo. Create a project, copy the public/secret keys.
Step 2: Instrument the RAG bot. Wrap the retriever and the LLM call with @observe() decorators (or manual trace.span calls). Each query should produce a hierarchical trace: retrieval span, generation span, total cost.
Step 3: Simulate traffic. Run 100 queries from a query log (your Lab 32 evaluation set, plus 50 random variations). Inspect the Langfuse dashboard: latency P50/P95, token spend, error rate.
Step 4: Add scorers. Register two automated scorers in Langfuse: (a) a heuristic for "answer mentions a source" and (b) an LLM-judge for faithfulness. They run on every trace.
Step 5: Replay with a new prompt. Bump the system prompt (e.g., add "Cite sources"). Re-run the same 100 queries; compare aggregate scorer outputs side-by-side. Make a go/no-go decision based on data, not vibes.
Step 6: Alert. Configure a webhook that pings you on Slack when faithfulness drops below 0.8 over a 1-hour window. Test it by injecting 10 bad traces. This is the "5% dumber overnight" alarm the chapter promised.

Expected Output

Expected time: 3 to 4 hours (chains from Lab 32). Difficulty: intermediate. Artifact: a Langfuse-instrumented RAG app + replay harness + alerting.

What's Next?

Next: Chapter 45: Tools of the Trade, Eval & Production Stack. Chapter 45 consolidates the eval-and-observability toolbox: OpenAI Evals, Inspect AI, Eleuther's lm-evaluation-harness, the eval-as-product platforms, OpenTelemetry instrumentation libraries, the open eval datasets (MMLU, BIG-Bench, HELM, HumanEval, GPQA), and the drift-detection rigs. Then Chapter 46 closes Part IX with the topic underneath all of it: how to use LLMs themselves as graders, and why doing it naively is a trap.