Specialized Evaluation: RAG, Agents, Multimodal, Long-Context

Chapter opener illustration: Specialized Evaluation: RAG.

"Specialized evaluation is what general-purpose benchmarks miss."

EvalEval, Trace-Sniffing AI Agent
Looking Back

Chapter 42 covered general evaluation. This chapter goes deeper: RAG eval (Ragas, BEIR), agentic eval (AgentBench, SWE-Bench, GAIA, tau-bench), simulation-based eval, code-gen eval, and multimodal eval, all the families where scalar text benchmarks fall short.

Big Picture

Evaluation methodologies for the 2026 frontier: RAG faithfulness, agentic trajectories, simulation-based eval, code-gen pass@k, multimodal grounding, and long-context benchmarks.

Chapter Overview

Specialized evaluation is what general-purpose benchmarks miss. This chapter walks the four specialized eval families that the 2024 to 2026 community standardized on: RAG evaluation (Ragas, BEIR, faithfulness vs groundedness), agentic evaluation (AgentBench, SWE-Bench, GAIA, tau-bench), simulation-based evaluation (tau-bench, MM-tau-p2), code-generation evaluation against test suites, and multimodal evaluation across vision-language, audio, and video.

Each of these eval families targets a failure mode that scalar text benchmarks cannot catch. By the end of this chapter you will know which eval to run, how to interpret its score, and where contamination, reward-hacking, or trajectory-vs-outcome confusion can mislead you.

Note: Learning Objectives

Prerequisites

Sections

Lab 43: Run RAGAS Plus Trajectory-Eval on Your Lab 26 and Lab 32 Bots

Objective

Apply specialized evaluation suites to the artifacts from earlier labs: Ragas on the RAG bot (Lab 32) and an agent-trajectory eval on the Wikipedia agent (Lab 26). By the end you will have category-specific scores for each system and a sense of which dimensions to optimize next.

Steps

  1. Step 1: Ragas for RAG. Take 20 question-answer pairs from Lab 32. Run ragas.evaluate(dataset, metrics=[faithfulness, answer_relevancy, context_precision, context_recall]). Save scores.
  2. Step 2: Diagnose RAG weakness. Look at the 5 worst context_recall failures: was the right doc not in the index, not in top-k, or retrieved but not used? This is the diagnostic chain.
  3. Step 3: Trajectory eval for the agent. Take 20 Lab 26 traces. Define a rubric: did the agent (a) call appropriate tools, (b) avoid redundant calls, (c) stop when confident. Use langsmith.evaluation.trajectory or hand-roll a judge prompt.
  4. Step 4: Long-context spot-check. Feed Lab 26's agent a 50k-token Wikipedia article and place an answer at depth 25k. Confirm retrieval works (this is the Needle-in-a-Haystack pattern).
  5. Step 5: Report card. Render a one-page Markdown report: RAG metrics, agent trajectory scores, long-context check. Highlight the single biggest improvement opportunity.
  6. Step 6: Apply one fix. Pick the worst metric. Apply one fix (re-chunk, change embedder, add reflection). Re-measure. Document delta.

Expected Output

Expected time: 3 hours (chains from Labs 26 and 32). Difficulty: intermediate. Artifact: a domain-specific eval report + one measured improvement.

What's Next?

Next: Chapter 44: Online Evaluation, Observability, and Production Monitoring. Offline evals tell you what the model did on a benchmark you chose. Production traffic tells you what it did on the traffic your users actually sent. Chapter 44 covers distributed tracing, OpenTelemetry for LLMs, drift detection, online A/B testing, eval-as-product platforms (Braintrust, Langfuse, Helicone), and the alerting setup that catches "the model got 5% dumber overnight" before customers do.