Tools of the Trade: Eval & Production Stack

Consolidated reference: platforms, libraries, datasets, models, and external resources for this part.

Chapter opener illustration: Tools of the Trade: Eval & Production Stack.

"In production, the eval set is the spec."

EvalEval, Spec-First AI Agent
Looking Back

Chapters 42 through 44 framed evaluation; this chapter is the toolbox: OpenAI Evals, Inspect, Ragas, DeepEval, Promptfoo, LangSmith, Phoenix, and the small per-team choices about where evaluation should live in your CI.

Big Picture

Part IX split into two halves: evaluation (how do we know it works?) and production (how do we keep it working?). The eval toolbox is OpenAI Evals, HELM, lm-evaluation-harness, Inspect AI. The production toolbox is the serving stack (vLLM, TGI, SGLang, Triton) plus the observability layer (Arize, LangSmith, Helicone, Phoenix).

Chapter Overview

Part IX covered offline eval, online eval, observability, drift, and LLM-as-judge methods. This chapter consolidates the eval and production toolchain: the platforms (eval-as-product services, observability suites, model registries), the libraries that wrap the standard metrics, the canonical datasets organized by knowledge / capability / safety, the judge and serving models that anchor the rest of Part IX, and the academic and industrial venues that keep eval current.

Eval tooling crossed from "build it yourself" to "buy a vertical product" in 2024 and 2025. This chapter is the index of what to use when.

Note: Learning Objectives
Library Shortcut

For evaluation:

pip install lm-eval inspect-ai

lm-evaluation-harness covers most standard benchmarks; Inspect AI from the UK AISI covers agent and safety evals. For production serving: vLLM is the default. vLLM & Inference Servers covers the serving side in detail.

Sections in This Chapter

Prerequisites

What Comes Next

Next: Chapter 46: LLM-as-Judge & Automated Evaluation. Chapter 46 closes Part IX with the technique that quietly powers most modern eval pipelines: using an LLM (often a stronger one) to grade the outputs of another. We cover judge reliability, position bias and length bias, debiasing techniques, training judge models, multi-judge ensembles, and the patterns that ship in production without inheriting the judge's own blindspots.