Section 29.3: RAG & Agent Evaluation

"Retrieving the right document is only half the battle. The other half is not hallucinating beyond what it says."
Eval, Retrieval-Haunted AI Agent

Big Picture

RAG systems and agents introduce evaluation challenges that go far beyond standard LLM metrics. A RAG pipeline can fail at retrieval (wrong documents), at generation (hallucinating beyond retrieved context), or at both. An agent can select the wrong tool, call tools in the wrong order, or produce a correct final answer through an unsafe trajectory. Building on the foundational metrics from Section 29.1 and the RAG architecture from Chapter 20, this section covers specialized evaluation metrics and frameworks for these compound systems, including RAGAS metrics (faithfulness, answer relevancy, context precision, context recall), agent task completion and trajectory evaluation, and practical frameworks like DeepEval, Ragas, and Phoenix.

Prerequisites

This section requires the evaluation basics from Section 29.1 and Section 29.2. Understanding inference optimization from Section 09.1 provides context for the performance observability patterns discussed here.

A cartoon quality inspector robot on an assembly line, checking RAG system outputs with a magnifying glass, with each item on the conveyor belt representing a different evaluation dimension such as relevance, faithfulness, and completeness. — Evaluating only the final answer misses where the system failed. RAG and agent evaluation must inspect every stage of the pipeline independently.

1. Why RAG and Agent Evaluation Is Different

Standard LLM evaluation treats the model as a black box: question in, answer out. RAG and agent systems are multi-component pipelines where failures can occur at any stage. Evaluating only the final answer misses critical information about where the system failed and how to fix it. Component-level evaluation isolates retrieval quality from generation quality, enabling targeted improvements.

For RAG systems, the two fundamental questions are: (1) Did the retriever find the right information? (2) Did the generator use that information faithfully?

Fun Fact

Faithfulness scoring catches a common RAG failure mode: the retriever finds the right document, the generator reads it, and then confidently adds extra "facts" from its training data. Think of it as a student who studied the textbook but cannot resist embellishing the essay with things they saw on Wikipedia.

Key Insight

Component-level evaluation reveals where to invest. End-to-end accuracy tells you the system is broken; component-level metrics tell you which piece to fix. A RAG system with perfect retrieval but poor generation needs better prompts or a better LLM. A system with excellent generation but poor retrieval needs better chunking or reranking. Without component-level metrics, teams waste weeks tuning the wrong layer. This principle extends to agents: a tool-selection error and a tool-execution error require entirely different fixes, but end-to-end metrics collapse them into a single "wrong answer" signal.

For agents, the questions expand to: Did the agent choose the right tools? Did it call them with correct parameters? Did it follow a safe and efficient trajectory? Was the final answer correct? Figure 29.3.1 maps evaluation metrics to each stage of the RAG pipeline.

Figure 29.3.1: Evaluation metrics mapped to each stage of the RAG pipeline.

2. RAGAS Metrics

RAGAS (Retrieval Augmented Generation Assessment) is a framework that decomposes RAG evaluation into component-level metrics. Each metric isolates a specific aspect of the pipeline, making it possible to diagnose whether failures originate in retrieval, generation, or both. Code Fragment 29.3.5 below puts this into practice.

Core RAGAS Metrics

Core RAGAS Metrics Comparison

Metric	What It Measures	Requires Ground Truth?	Score Range
Faithfulness	Whether the answer is supported by the retrieved context	No (uses context only)	0 to 1
Answer Relevancy	Whether the answer addresses the question	No	0 to 1
Context Precision	Whether retrieved chunks are relevant (not noisy)	Yes	0 to 1
Context Recall	Whether all necessary information was retrieved	Yes	0 to 1
Answer Correctness	Factual accuracy of the final answer	Yes	0 to 1

Each RAGAS metric uses an LLM judge to decompose the RAG output into scorable components. The formal definitions below show how faithfulness, relevancy, and context quality are each reduced to a single number between 0 and 1:

RAGAS Metric Formulas. Each metric uses an LLM judge to decompose and score: Faithfulness. Decompose the answer into atomic claims {c1, ..., ck}. For each claim, check if it is supported by the retrieved context. The score is the fraction of supported claims: $$Faithfulness = |{c_{i} : c_{i} is supported by context}| / k$$ Answer Relevancy. Generate n hypothetical questions from the answer, then compute the mean cosine similarity between each generated question and the original question: $$Relevancy = (1/n) \sum _{i=1..n} \cos(embed(q), embed(\hat{q}_{i}))$$ Context Precision. Among the top-K retrieved chunks, compute mean precision at each relevant rank: $$ContextPrecision = (1/K) \sum _{k=1..K} (Precision@k \cdot relevant(k))$$ Context Recall. Decompose the ground truth answer into claims and check if each is attributable to a retrieved chunk: ContextRecall = |{ground truth claims attributable to context}| / |{all ground truth claims}|

Real-World Scenario: Computing Faithfulness

Question: "What causes ocean tides?"

Retrieved context: "Ocean tides are primarily caused by the gravitational pull of the Moon on Earth's water."

Generated answer: "Tides are caused by the Moon's gravity, the Sun's gravity, and the rotation of the Earth."

The LLM judge decomposes the answer into three atomic claims: (1) "Tides are caused by the Moon's gravity" (supported by context), (2) "Tides are caused by the Sun's gravity" (not supported; the context mentions only the Moon), (3) "Tides are caused by the rotation of the Earth" (not supported by context).

Faithfulness = 1 supported claim / 3 total claims = 0.333. This low score correctly flags that the generator embellished beyond the retrieved context, even though claims (2) and (3) happen to be factually true. Faithfulness measures grounding in context, not factual correctness.


# Implementation example
# Key operations: results display, RAG pipeline, evaluation logic
from ragas import evaluate
from ragas.metrics import (
 faithfulness,
 answer_relevancy,
 context_precision,
 context_recall,
)
from datasets import Dataset

# Prepare evaluation dataset with required columns
eval_data = {
 "question": [
 "What is the capital of France?",
 "How does photosynthesis work?",
 "What causes tides?",
 ],
 "answer": [
 "The capital of France is Paris, which is also its largest city.",
 "Photosynthesis converts CO2 and water into glucose using sunlight.",
 "Tides are caused by gravitational pull of the moon and sun.",
 ],
 "contexts": [
 ["Paris is the capital and most populous city of France."],
 ["Photosynthesis is a process by which plants convert light energy into chemical energy."],
 ["Ocean tides are caused by the gravitational forces of the moon."],
 ],
 "ground_truth": [
 "Paris is the capital of France.",
 "Photosynthesis converts light energy, CO2, and water into glucose and oxygen.",
 "Tides are primarily caused by the moon's gravitational pull on Earth's oceans.",
 ],
}

dataset = Dataset.from_dict(eval_data)

# Run RAGAS evaluation
results = evaluate(
 dataset,
 metrics=[faithfulness, answer_relevancy, context_precision, context_recall],
)

print(results)
print("\nPer-question breakdown:")
df = results.to_pandas()
print(df.to_string(index=False))

{'faithfulness': 0.9444, 'answer_relevancy': 0.9231, 'context_precision': 0.8889, 'context_recall': 0.8333} Per-question breakdown: question faithfulness answer_relevancy context_precision context_recall What is the capital of France? 1.0000 0.9500 1.0000 1.0000 How does photosynthesis work? 0.8333 0.9193 0.6667 0.7500 What causes tides? 1.0000 0.9000 1.0000 0.7500

Code Fragment 29.3.1: Implementation example

Ragas API Version Note

The code above uses the Ragas v0.1.x API. Ragas v0.2+ introduced significant API changes, including a new SingleTurnSample class and restructured metric imports. Install with pip install ragas and check the official documentation for the latest API patterns if you are using v0.2 or later.

Faithfulness vs. Correctness

Faithfulness measures whether the answer is supported by the retrieved context, while correctness measures whether it matches the ground truth. An answer can be faithful (derived only from context) but incorrect (the context itself was wrong or incomplete). Conversely, an answer can be correct but unfaithful (the model "knew" the answer and ignored the context). Both metrics are needed to fully diagnose RAG failures. Code Fragment 29.3.7 below puts this into practice.

Custom Faithfulness Scorer

This snippet implements a custom faithfulness scorer that checks whether a generated answer is grounded in the retrieved context.


# implement score_faithfulness
# Key operations: results display, prompt construction, evaluation logic
from openai import OpenAI
import json

client = OpenAI()

def score_faithfulness(question: str, answer: str, contexts: list[str]) -> dict:
 """Score faithfulness: is the answer grounded in the provided context?

 Uses an LLM judge to decompose the answer into claims and verify
 each claim against the context.
 """
 context_text = "\n\n".join(f"Context {i+1}: {c}" for i, c in enumerate(contexts))

 prompt = f"""Evaluate the faithfulness of an answer to the provided context.

QUESTION: {question}
CONTEXT:
{context_text}

ANSWER: {answer}

Instructions:
1. Decompose the answer into individual factual claims
2. For each claim, determine if it is SUPPORTED or NOT SUPPORTED by the context
3. Return JSON with:
 - "claims": list of {{"claim": str, "supported": bool, "evidence": str}}
 - "faithfulness_score": fraction of supported claims (0.0 to 1.0)"""

 response = client.chat.completions.create(
 model="gpt-4o",
 messages=[{"role": "user", "content": prompt}],
 response_format={"type": "json_object"},
 temperature=0,
 )
 return json.loads(response.choices[0].message.content)

# Example usage
result = score_faithfulness(
 question="What is the capital of France?",
 answer="Paris is the capital of France with a population of 2.1 million.",
 contexts=["Paris is the capital and most populous city of France."]
)
print(json.dumps(result, indent=2))

{ "claims": [ {"claim": "Paris is the capital of France", "supported": true, "evidence": "Context states Paris is the capital of France"}, {"claim": "Paris has a population of 2.1 million", "supported": false, "evidence": "Population figure not mentioned in context"} ], "faithfulness_score": 0.5 }

Code Fragment 29.3.2: implement score_faithfulness

Library Shortcut: DeepEval for Faithfulness Evaluation

The same result in 6 lines with DeepEval:


from deepeval.metrics import FaithfulnessMetric
from deepeval.test_case import LLMTestCase

test_case = LLMTestCase(
 input="What is the capital of France?",
 actual_output="Paris is the capital of France with a population of 2.1 million.",
 retrieval_context=["Paris is the capital and most populous city of France."],
)
faithfulness = FaithfulnessMetric(threshold=0.7)
faithfulness.measure(test_case)
print(f"Score: {faithfulness.score}, Reason: {faithfulness.reason}")

Code Fragment 29.3.3: This snippet demonstrates the score_faithfulness function using API calls. Notice how the evaluation criteria are defined to measure quality along multiple dimensions. Tracing through each step builds the intuition needed when debugging or extending similar systems.

3. Agent Evaluation

Evaluating agents is fundamentally harder than evaluating simple question-answering systems because agents take multi-step actions with real-world side effects. A correct final answer does not mean the agent followed a safe or efficient path to get there. Agent evaluation therefore requires assessing multiple dimensions: task completion, tool selection accuracy, parameter correctness, trajectory efficiency, and safety. Figure 29.3.2 compares an ideal trajectory with an actual agent trajectory. Code Fragment 29.3.3 below puts this into practice.

Tip

For agent evaluation, log the full trajectory (every tool call, parameter, and intermediate result), not just the final answer. Two agents can produce the same correct answer, but one took 3 tool calls and the other took 15. Trajectory logs reveal inefficiency, unnecessary API spending, and near-miss safety violations that final-answer-only evaluation would completely miss.

Evaluation Dimensions for Agents

Task completion: Did the agent achieve the stated goal?
Tool accuracy: Did the agent select the correct tools for each step?
Parameter correctness: Were tool calls made with valid and appropriate parameters?
Trajectory efficiency: Did the agent take unnecessary steps or redundant tool calls?
Safety: Did the agent avoid dangerous or unauthorized actions?
Cost efficiency: How many LLM calls and tokens were consumed?

Figure 29.3.2: Comparing ideal vs. actual agent trajectories. The agent got the right answer but through an inefficient path.


# Define ToolCall, AgentTrajectory; implement evaluate_agent_trajectory
# Key operations: agent orchestration, tool integration, evaluation logic
from dataclasses import dataclass
from typing import Optional

@dataclass
class ToolCall:
 """A single tool call in an agent trajectory."""
 tool_name: str
 parameters: dict
 result: Optional[str] = None
 is_correct_tool: Optional[bool] = None

@dataclass
class AgentTrajectory:
 """Complete record of an agent's execution."""
 task: str
 tool_calls: list[ToolCall]
 final_answer: str
 total_tokens: int = 0
 total_latency_ms: float = 0

def evaluate_agent_trajectory(
 trajectory: AgentTrajectory,
 ideal_tools: list[str],
 expected_answer: str,
 answer_checker=None
) -> dict:
 """Evaluate an agent trajectory against an ideal reference.

 Args:
 trajectory: The actual execution trajectory
 ideal_tools: Ordered list of expected tool names
 expected_answer: The ground truth answer
 answer_checker: Optional function for fuzzy answer matching
 """
 actual_tools = [tc.tool_name for tc in trajectory.tool_calls]

 # Task completion: did the agent get the right answer?
 if answer_checker:
 task_complete = answer_checker(trajectory.final_answer, expected_answer)
 else:
 task_complete = trajectory.final_answer.strip().lower() == expected_answer.strip().lower()

 # Tool accuracy: fraction of calls using correct tools
 correct_tools = sum(
 1 for t in actual_tools if t in ideal_tools
 )
 tool_accuracy = correct_tools / len(actual_tools) if actual_tools else 0

 # Trajectory efficiency: ideal steps / actual steps
 efficiency = min(len(ideal_tools) / len(actual_tools), 1.0) if actual_tools else 0

 # Redundancy: duplicate consecutive tool calls
 redundant = sum(
 1 for i in range(1, len(actual_tools))
 if actual_tools[i] == actual_tools[i - 1]
 )

 return {
 "task_completed": task_complete,
 "tool_accuracy": round(tool_accuracy, 3),
 "trajectory_efficiency": round(efficiency, 3),
 "num_steps": len(actual_tools),
 "ideal_steps": len(ideal_tools),
 "redundant_calls": redundant,
 "total_tokens": trajectory.total_tokens,
 }

Code Fragment 29.3.4: Evaluating agentic workflows with custom metrics that check task completion, tool selection accuracy, and reasoning trace quality.

Library Shortcut: DeepEval in Practice

The DeepEval library (pip install deepeval) provides the same trajectory and RAG metrics without manual implementation:


from deepeval.metrics import ToolCorrectnessMetric
from deepeval.test_case import LLMTestCase, ToolCall

metric = ToolCorrectnessMetric()
test_case = LLMTestCase(
 input="Find Q3 revenue and format a report",
 actual_output="Q3 revenue was $4.2M...",
 tools_called=[ToolCall(name="search_db"), ToolCall(name="calculate")],
 expected_tools=[ToolCall(name="search_db"), ToolCall(name="calculate"),
 ToolCall(name="format_report")],
)
metric.measure(test_case)
print(f"Tool correctness: {metric.score}") # 0.667 (2/3 expected tools used)

Code Fragment 29.3.5: Define ToolCall, AgentTrajectory; implement evaluate_agent_trajectory

Key Insight

Agent evaluation should weight task completion most heavily, since a correct answer through an inefficient path is better than an efficient trajectory with a wrong answer. However, trajectory quality matters for cost, latency, and safety. In production, an agent that consistently takes extra steps will cost more and may expose the system to more failure points. Evaluate both dimensions and set acceptable thresholds for each. Code Fragment 29.3.3 below puts this into practice.

4. Evaluation Frameworks Comparison

4. Evaluation Frameworks Comparison Intermediate

Framework	Focus	Key Metrics	Strengths	Limitations
Ragas	RAG evaluation	Faithfulness, relevancy, context precision/recall	Most comprehensive RAG metrics; good HF integration	Relies on LLM judge; can be slow
DeepEval	General LLM testing	Hallucination, bias, toxicity, custom metrics	pytest integration; CI/CD friendly	Less RAG-specific depth than Ragas
Phoenix (Arize)	Observability + eval	Trace-level metrics, embedding analysis	Visual UI; traces + evals combined	Heavier infrastructure requirement
TruLens	Feedback functions	Groundedness, relevance, custom feedback	Modular feedback system; provider-agnostic	Smaller community than alternatives
promptfoo	Prompt testing	Assertion-based, custom evals	CLI-first; fast iteration; CI/CD native	Less suited for complex agent evaluation

Library Shortcut: Phoenix (Arize) in Practice

Launch a local Phoenix session to trace and evaluate LLM calls with an interactive UI.

# pip install arize-phoenix openai
import phoenix as px
from phoenix.trace.openai import OpenAIInstrumentor

# Launch local Phoenix UI at http://localhost:6006
session = px.launch_app()

# Auto-instrument all OpenAI calls for tracing
OpenAIInstrumentor().instrument()

from openai import OpenAI
client = OpenAI()
response = client.chat.completions.create(
 model="gpt-4o-mini",
 messages=[{"role": "user", "content": "Explain RAG in one sentence."}],
)
print(response.choices[0].message.content)
# View the traced call at http://localhost:6006

Code Fragment 29.3.6: pip install arize-phoenix openai


# implement test_rag_with_deepeval
# Key operations: retrieval pipeline, RAG pipeline, evaluation logic
from deepeval import assert_test
from deepeval.test_case import LLMTestCase
from deepeval.metrics import (
 FaithfulnessMetric,
 AnswerRelevancyMetric,
 HallucinationMetric,
)

def test_rag_with_deepeval():
 """Example: Testing RAG output quality with DeepEval."""
 test_case = LLMTestCase(
 input="What are the benefits of solar energy?",
 actual_output="Solar energy is renewable, reduces electricity bills, "
 "and has low maintenance costs.",
 retrieval_context=[
 "Solar energy is a renewable source of power that reduces "
 "dependence on fossil fuels.",
 "Solar panels require minimal maintenance and can reduce "
 "electricity bills by up to 50%.",
 ],
 )

 # Define metrics with thresholds
 faithfulness = FaithfulnessMetric(threshold=0.7, model="gpt-4o")
 relevancy = AnswerRelevancyMetric(threshold=0.7, model="gpt-4o")
 hallucination = HallucinationMetric(threshold=0.5, model="gpt-4o")

 # Assert all metrics pass (integrates with pytest)
 assert_test(test_case, [faithfulness, relevancy, hallucination])

# Run with: pytest test_rag.py -v

Code Fragment 29.3.7: implement test_rag_with_deepeval

LLM Judges Are Not Perfect

All framework-computed metrics that rely on LLM judges inherit the biases and limitations of the judge model. Faithfulness scores can be unreliable when the context is ambiguous or when the judge model hallucinates its own assessment. Always validate framework metrics against a set of human-annotated examples before trusting them for production decisions. Establish the correlation between the automated metric and human judgment on your specific data.

Self-Check

1. What is the difference between faithfulness and answer correctness in RAG evaluation?

Show Answer

Faithfulness measures whether the answer is supported by the retrieved context (regardless of whether the context is correct). Answer correctness measures whether the answer matches the ground truth. An answer can be faithful but incorrect (the retrieved context was wrong) or correct but unfaithful (the model ignored context and used its own knowledge). Faithfulness does not require ground truth labels, while answer correctness does.

2. Why is trajectory evaluation important for agents, even when the final answer is correct?

Show Answer

A correct final answer does not guarantee a safe or efficient execution path. The agent may have called unnecessary tools (increasing cost and latency), used dangerous tools it should have avoided, leaked sensitive data through intermediate steps, or reached the correct answer through an unreliable chain of reasoning that could fail on slightly different inputs. Trajectory evaluation helps identify these issues and improves the robustness and cost-efficiency of the agent.

3. When would you choose DeepEval over Ragas for RAG evaluation?

Show Answer

DeepEval is preferable when you need tight integration with pytest and CI/CD pipelines, when you want to evaluate dimensions beyond RAG quality (such as toxicity, bias, or hallucination), or when you want a unified framework for testing multiple types of LLM applications. Ragas is preferable when you need the most comprehensive set of RAG-specific metrics (especially component-level retrieval metrics like context precision and recall) and when RAG evaluation is your primary use case.

4. How does context precision differ from context recall, and which requires ground truth?

Show Answer

Context precision measures the proportion of retrieved chunks that are actually relevant to answering the question (penalizing noise in retrieval). Context recall measures the proportion of information needed to answer the question that was actually retrieved (penalizing missed information). Both require ground truth in the RAGAS framework: context precision needs annotated relevance judgments, and context recall needs a ground truth answer to assess whether all necessary information was retrieved.

5. What is a practical strategy for validating that an LLM-based evaluation metric correlates with human judgment?

Show Answer

Create a calibration set of 50 to 100 examples with human-annotated scores. Run the LLM-based metric on the same examples and compute the correlation (Spearman or Pearson) between automated and human scores. A correlation above 0.7 is generally acceptable. Also examine the examples where the automated metric and human judgment disagree most strongly to understand the metric's failure modes. Repeat this validation periodically, especially after changing the judge model or prompt.

Real-World Scenario: Diagnosing RAG Failures with Component-Level Evaluation

Who: ML platform team at an enterprise software company building an internal knowledge assistant

Situation: The RAG-based assistant answered questions about internal documentation, but users reported that 30% of answers were inaccurate. End-to-end accuracy metrics confirmed the problem but did not reveal whether retrieval or generation was at fault.

Problem: Without component-level metrics, the team could not decide whether to invest in better embeddings, improved chunking, or generation prompt tuning. Each path required different expertise and weeks of effort.

Dilemma: Improving retrieval would not help if the generator was hallucinating beyond the retrieved context. Improving generation prompts would not help if the retriever was fetching irrelevant documents.

Decision: The team deployed Ragas to compute separate retrieval metrics (context precision, context recall) and generation metrics (faithfulness, answer relevancy) on a golden test set of 200 questions with verified answers and source documents.

How: They created the golden set by having domain experts write questions, identify the correct source passages, and write reference answers. Ragas scored each component independently. They discovered context recall was 0.92 (retrieval was finding the right documents) but faithfulness was only 0.61 (the generator was adding information not present in the retrieved context).

Result: Armed with this diagnosis, the team focused exclusively on generation: adding "only use information from the provided context" instructions and implementing citation requirements. Faithfulness improved from 0.61 to 0.89, and user-reported inaccuracy dropped from 30% to 8%, all without changing the retrieval pipeline.

Lesson: Component-level RAG evaluation prevents wasted effort by pinpointing exactly where in the pipeline failures originate, enabling targeted fixes instead of guesswork.

LLM Metacognition and Calibration

A critical but often overlooked dimension of LLM evaluation is calibration: whether a model's expressed confidence matches its actual accuracy. A well-calibrated model that claims 80% confidence should be correct roughly 80% of the time. In practice, LLMs are frequently overconfident, particularly after RLHF alignment training, which rewards fluent, assertive responses and penalizes hedging. This creates a dangerous pattern where the model sounds maximally confident regardless of whether the answer is correct.

Two distinct approaches exist for measuring model uncertainty. Token-level logprobs provide the model's raw probability distribution over next tokens, offering a mathematical measure of confidence. Verbalized uncertainty ("I'm not sure, but...") reflects the model's trained ability to express doubt in natural language. These two signals can diverge: a model may produce high-confidence logprobs while verbally hedging, or vice versa. Neither is perfectly reliable on its own, but combining them provides a richer picture of model confidence.

Practical applications of calibration assessment include selective prediction (abstaining from answering when confidence is low rather than guessing), human-in-the-loop triggers (automatically escalating uncertain responses to human reviewers), and "I don't know" detection (training models to recognize the boundaries of their knowledge). For RAG systems, calibration is especially important: a model should express lower confidence when retrieved context is sparse or contradictory. Miscalibration has direct safety implications, as discussed in Section 32.3, because overconfident wrong answers in high-stakes domains (medical, legal, financial) can cause real harm.

Tip: Use LLM-as-Judge for Scalable Evaluation

For tasks without clear right/wrong answers (summarization, creative writing), use a strong model (GPT-4 class) as an automated judge with a detailed rubric. This scales better than human evaluation while correlating well with human preferences when the rubric is specific.

Key Takeaways

Evaluate RAG at component level, not just end-to-end. Use retrieval metrics (context precision, context recall) and generation metrics (faithfulness, answer relevancy) to isolate where failures occur and guide targeted improvements.
Faithfulness is the most critical RAG metric. A RAG system that generates unfaithful answers (hallucinating beyond context) undermines the entire purpose of retrieval-augmented generation. Monitor faithfulness continuously in production.
Agent evaluation requires trajectory analysis. Assess tool accuracy, parameter correctness, and trajectory efficiency alongside task completion. An agent that completes tasks through unsafe or wasteful paths is a production liability.
Choose frameworks based on your primary need. Ragas excels at RAG-specific metrics, DeepEval integrates with testing pipelines, Phoenix combines observability with evaluation, and promptfoo enables rapid prompt iteration.
Validate automated metrics against human judgment. LLM-based evaluation metrics inherit judge model biases. Always establish correlation with human annotations before trusting automated scores for deployment decisions.

Research Frontier

Open Questions in RAG and Agent Evaluation (2024-2026):

Multi-hop RAG evaluation: Evaluating RAG systems that synthesize information across multiple retrieved documents remains challenging. Current metrics assess single-document faithfulness well but struggle with cross-document reasoning quality.
Agent safety evaluation: How do you evaluate whether an agent will behave safely in novel situations it was not tested on? Formal verification approaches from software engineering are being adapted for agent trajectory analysis.
Long-context RAG evaluation: As context windows grow to 1M+ tokens, evaluating whether models actually use all retrieved information (rather than relying on the first few chunks) requires new evaluation methodologies like the "needle in a haystack" test family.

Explore Further: Build a component-level evaluation pipeline for a simple RAG system using Ragas, then deliberately introduce retrieval and generation failures to see which metrics detect each failure mode.

Exercises

Exercise 29.3.1: RAG Evaluation Decomposition Conceptual

Explain why evaluating only the final answer of a RAG system is insufficient. Describe the four RAGAS metrics (faithfulness, answer relevancy, context precision, context recall) and what each measures.

Answer Sketch

End-to-end evaluation cannot distinguish retrieval failures from generation failures. Faithfulness measures whether the answer is supported by the retrieved context (catches hallucination). Answer relevancy measures whether the answer addresses the question. Context precision measures how many of the retrieved chunks are actually relevant. Context recall measures how much of the required information was retrieved. Together they pinpoint which component needs improvement.

Exercise 29.3.2: Agent Trajectory Evaluation Conceptual

An agent is asked to find the weather in Paris. It calls a web search tool, then a calculator tool, then the weather API. The final answer is correct. Should this trajectory receive full marks? Justify your answer with evaluation criteria.

Answer Sketch

No. While the final answer is correct, the trajectory is inefficient: the web search and calculator calls were unnecessary. Trajectory evaluation should score efficiency (minimum number of tool calls), correctness of tool selection (weather API was the only relevant tool), parameter accuracy (correct city name), and safety (no unauthorized actions). A correct final answer via an inefficient path should receive partial credit.

Exercise 29.3.3: Faithfulness Scoring Coding

Write a Python function that uses an LLM to compute a faithfulness score. Given a context string and an answer string, the function should decompose the answer into individual claims, check each claim against the context, and return the fraction of claims supported by the context.

Answer Sketch

Step 1: Prompt an LLM to decompose the answer into atomic claims (one fact per claim). Step 2: For each claim, prompt the LLM with the context and ask "Is this claim supported by the context? Answer YES or NO." Step 3: Count the number of YES responses and divide by the total number of claims. This mirrors the RAGAS faithfulness pipeline. Handle edge cases where the answer contains no verifiable claims.

Exercise 29.3.4: Retrieval Quality Metrics Analysis

Your RAG system retrieves 5 chunks per query. On average, 2 of the 5 are relevant. What is the context precision? If the relevant chunks are needed to answer the question but only 60% of necessary information is retrieved, what is the context recall? Propose two changes to improve each metric.

Answer Sketch

Context precision is 2/5 = 40%. Context recall is 60%. To improve precision: (1) reduce the number of retrieved chunks (top-3 instead of top-5), (2) add a reranking step that scores retrieved chunks before passing them to the generator. To improve recall: (1) use hybrid search (dense + sparse retrieval), (2) apply query expansion to capture more relevant documents, (3) use smaller, more focused chunks so each chunk is more likely to contain the needed information.

Exercise 29.3.5: End-to-End RAG Evaluation Coding

Design an evaluation harness for a RAG system that computes retrieval metrics (precision@k, recall@k) and generation metrics (faithfulness, answer relevancy) on a labeled dataset of (query, relevant_doc_ids, gold_answer) triples. Outline the code structure and explain how you would report the results.

Answer Sketch

Create a pipeline that (1) runs each query through the retriever, (2) compares retrieved doc IDs to ground truth for precision@k and recall@k, (3) passes retrieved context and query to the generator, (4) scores the generated answer for faithfulness (claim verification against context) and answer relevancy (semantic similarity to the query). Report all metrics with bootstrap confidence intervals. Include a breakdown by query category if available.

What Comes Next

In the next section, Section 29.4: Testing LLM Applications, we examine testing practices for LLM applications, including unit tests, integration tests, and regression suites.

Bibliography

RAG Evaluation

Es, S., James, J., Espinosa-Anke, L., & Schockaert, S. (2024). "Ragas: Automated Evaluation of Retrieval Augmented Generation." arXiv:2309.15217

Introduces the RAGAS framework with metrics for faithfulness, answer relevancy, context precision, and context recall. Provides reference-free evaluation that works without ground-truth answers. Essential for any team deploying RAG systems in production.

RAG Evaluation

Saad-Falcon, J., Khattab, O., Potts, C., & Zaharia, M. (2024). "ARES: An Automated Evaluation Framework for Retrieval-Augmented Generation Systems." arXiv:2311.09476

Presents an automated RAG evaluation framework that uses prediction-powered inference to provide statistically valid quality estimates with fewer human annotations. Covers the mathematical foundation for confident evaluation at scale. Recommended for teams needing rigorous RAG evaluation with limited annotation budget.

RAG Evaluation

Tools

Arize AI. (2024). "Phoenix: ML Observability in a Notebook." https://github.com/Arize-ai/phoenix

Documentation for the open-source Phoenix toolkit that provides LLM trace visualization, embedding analysis, and evaluation in a notebook environment. Covers integration with LangChain, LlamaIndex, and other frameworks. Useful for teams wanting lightweight, local observability without a hosted platform.

Tools

Confident AI. (2024). "DeepEval: The Open-Source LLM Evaluation Framework." https://github.com/confident-ai/deepeval

Documentation for the open-source evaluation framework that provides pytest-style assertions for LLM outputs, including hallucination detection, toxicity checking, and custom metric definition. Integrates into CI/CD pipelines for automated evaluation. Recommended for engineering teams building evaluation into their development workflow.

Tools

promptfoo. (2024). "promptfoo: Test Your LLM App." https://www.promptfoo.dev/

Documentation for the open-source prompt testing tool that supports side-by-side comparisons, custom assertions, and multiple model providers. Designed for rapid prompt iteration with a simple YAML configuration. Valuable for developers who want fast feedback on prompt changes.

Tools