"Retrieving the right document is only half the battle. The other half is not hallucinating beyond what it says."
Eval, Retrieval-Haunted AI Agent
RAG systems and agents introduce evaluation challenges that go far beyond standard LLM metrics. A RAG pipeline can fail at retrieval (wrong documents), at generation (hallucinating beyond retrieved context), or at both. An agent can select the wrong tool, call tools in the wrong order, or produce a correct final answer through an unsafe trajectory. Building on the foundational metrics from Section 29.1 and the RAG architecture from Chapter 20, this section covers specialized evaluation metrics and frameworks for these compound systems, including RAGAS metrics (faithfulness, answer relevancy, context precision, context recall), agent task completion and trajectory evaluation, and practical frameworks like DeepEval, Ragas, and Phoenix.
Prerequisites
This section requires the evaluation basics from Section 29.1 and Section 29.2. Understanding inference optimization from Section 09.1 provides context for the performance observability patterns discussed here.
1. Why RAG and Agent Evaluation Is Different
Standard LLM evaluation treats the model as a black box: question in, answer out. RAG and agent systems are multi-component pipelines where failures can occur at any stage. Evaluating only the final answer misses critical information about where the system failed and how to fix it. Component-level evaluation isolates retrieval quality from generation quality, enabling targeted improvements.
For RAG systems, the two fundamental questions are: (1) Did the retriever find the right information? (2) Did the generator use that information faithfully?
Faithfulness scoring catches a common RAG failure mode: the retriever finds the right document, the generator reads it, and then confidently adds extra "facts" from its training data. Think of it as a student who studied the textbook but cannot resist embellishing the essay with things they saw on Wikipedia.
Component-level evaluation reveals where to invest. End-to-end accuracy tells you the system is broken; component-level metrics tell you which piece to fix. A RAG system with perfect retrieval but poor generation needs better prompts or a better LLM. A system with excellent generation but poor retrieval needs better chunking or reranking. Without component-level metrics, teams waste weeks tuning the wrong layer. This principle extends to agents: a tool-selection error and a tool-execution error require entirely different fixes, but end-to-end metrics collapse them into a single "wrong answer" signal.
For agents, the questions expand to: Did the agent choose the right tools? Did it call them with correct parameters? Did it follow a safe and efficient trajectory? Was the final answer correct? Figure 29.3.1 maps evaluation metrics to each stage of the RAG pipeline.
2. RAGAS Metrics
RAGAS (Retrieval Augmented Generation Assessment) is a framework that decomposes RAG evaluation into component-level metrics. Each metric isolates a specific aspect of the pipeline, making it possible to diagnose whether failures originate in retrieval, generation, or both. Code Fragment 29.3.5 below puts this into practice.
Core RAGAS Metrics
| Metric | What It Measures | Requires Ground Truth? | Score Range |
|---|---|---|---|
| Faithfulness | Whether the answer is supported by the retrieved context | No (uses context only) | 0 to 1 |
| Answer Relevancy | Whether the answer addresses the question | No | 0 to 1 |
| Context Precision | Whether retrieved chunks are relevant (not noisy) | Yes | 0 to 1 |
| Context Recall | Whether all necessary information was retrieved | Yes | 0 to 1 |
| Answer Correctness | Factual accuracy of the final answer | Yes | 0 to 1 |
Each RAGAS metric uses an LLM judge to decompose the RAG output into scorable components. The formal definitions below show how faithfulness, relevancy, and context quality are each reduced to a single number between 0 and 1:
RAGAS Metric Formulas.
Each metric uses an LLM judge to decompose and score:
Faithfulness.
Decompose the answer into atomic claims {c1, ..., ck}. For each claim, check if it is supported by the retrieved context. The score is the fraction of supported claims:
$$Faithfulness = |{c_{i} : c_{i} is supported by context}| / k$$Answer Relevancy.
Generate n hypothetical questions from the answer, then compute the mean cosine similarity between each generated question and the original question:
$$Relevancy = (1/n) \sum _{i=1..n} \cos(embed(q), embed(\hat{q}_{i}))$$Context Precision.
Among the top-K retrieved chunks, compute mean precision at each relevant rank:
$$ContextPrecision = (1/K) \sum _{k=1..K} (Precision@k \cdot relevant(k))$$Context Recall.
Decompose the ground truth answer into claims and check if each is attributable to a retrieved chunk:
ContextRecall = |{ground truth claims attributable to context}| / |{all ground truth claims}|
Question: "What causes ocean tides?"
Retrieved context: "Ocean tides are primarily caused by the gravitational pull of the Moon on Earth's water."
Generated answer: "Tides are caused by the Moon's gravity, the Sun's gravity, and the rotation of the Earth."
The LLM judge decomposes the answer into three atomic claims: (1) "Tides are caused by the Moon's gravity" (supported by context), (2) "Tides are caused by the Sun's gravity" (not supported; the context mentions only the Moon), (3) "Tides are caused by the rotation of the Earth" (not supported by context).
Faithfulness = 1 supported claim / 3 total claims = 0.333. This low score correctly flags that the generator embellished beyond the retrieved context, even though claims (2) and (3) happen to be factually true. Faithfulness measures grounding in context, not factual correctness.
# Implementation example
# Key operations: results display, RAG pipeline, evaluation logic
from ragas import evaluate
from ragas.metrics import (
faithfulness,
answer_relevancy,
context_precision,
context_recall,
)
from datasets import Dataset
# Prepare evaluation dataset with required columns
eval_data = {
"question": [
"What is the capital of France?",
"How does photosynthesis work?",
"What causes tides?",
],
"answer": [
"The capital of France is Paris, which is also its largest city.",
"Photosynthesis converts CO2 and water into glucose using sunlight.",
"Tides are caused by gravitational pull of the moon and sun.",
],
"contexts": [
["Paris is the capital and most populous city of France."],
["Photosynthesis is a process by which plants convert light energy into chemical energy."],
["Ocean tides are caused by the gravitational forces of the moon."],
],
"ground_truth": [
"Paris is the capital of France.",
"Photosynthesis converts light energy, CO2, and water into glucose and oxygen.",
"Tides are primarily caused by the moon's gravitational pull on Earth's oceans.",
],
}
dataset = Dataset.from_dict(eval_data)
# Run RAGAS evaluation
results = evaluate(
dataset,
metrics=[faithfulness, answer_relevancy, context_precision, context_recall],
)
print(results)
print("\nPer-question breakdown:")
df = results.to_pandas()
print(df.to_string(index=False))
The code above uses the Ragas v0.1.x API. Ragas v0.2+ introduced significant API changes, including a new SingleTurnSample class and restructured metric imports. Install with pip install ragas and check the official documentation for the latest API patterns if you are using v0.2 or later.
Faithfulness measures whether the answer is supported by the retrieved context, while correctness measures whether it matches the ground truth. An answer can be faithful (derived only from context) but incorrect (the context itself was wrong or incomplete). Conversely, an answer can be correct but unfaithful (the model "knew" the answer and ignored the context). Both metrics are needed to fully diagnose RAG failures. Code Fragment 29.3.7 below puts this into practice.
Custom Faithfulness Scorer
This snippet implements a custom faithfulness scorer that checks whether a generated answer is grounded in the retrieved context.
# implement score_faithfulness
# Key operations: results display, prompt construction, evaluation logic
from openai import OpenAI
import json
client = OpenAI()
def score_faithfulness(question: str, answer: str, contexts: list[str]) -> dict:
"""Score faithfulness: is the answer grounded in the provided context?
Uses an LLM judge to decompose the answer into claims and verify
each claim against the context.
"""
context_text = "\n\n".join(f"Context {i+1}: {c}" for i, c in enumerate(contexts))
prompt = f"""Evaluate the faithfulness of an answer to the provided context.
QUESTION: {question}
CONTEXT:
{context_text}
ANSWER: {answer}
Instructions:
1. Decompose the answer into individual factual claims
2. For each claim, determine if it is SUPPORTED or NOT SUPPORTED by the context
3. Return JSON with:
- "claims": list of {{"claim": str, "supported": bool, "evidence": str}}
- "faithfulness_score": fraction of supported claims (0.0 to 1.0)"""
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}],
response_format={"type": "json_object"},
temperature=0,
)
return json.loads(response.choices[0].message.content)
# Example usage
result = score_faithfulness(
question="What is the capital of France?",
answer="Paris is the capital of France with a population of 2.1 million.",
contexts=["Paris is the capital and most populous city of France."]
)
print(json.dumps(result, indent=2))
The same result in 6 lines with DeepEval:
from deepeval.metrics import FaithfulnessMetric
from deepeval.test_case import LLMTestCase
test_case = LLMTestCase(
input="What is the capital of France?",
actual_output="Paris is the capital of France with a population of 2.1 million.",
retrieval_context=["Paris is the capital and most populous city of France."],
)
faithfulness = FaithfulnessMetric(threshold=0.7)
faithfulness.measure(test_case)
print(f"Score: {faithfulness.score}, Reason: {faithfulness.reason}")
3. Agent Evaluation
Evaluating agents is fundamentally harder than evaluating simple question-answering systems because agents take multi-step actions with real-world side effects. A correct final answer does not mean the agent followed a safe or efficient path to get there. Agent evaluation therefore requires assessing multiple dimensions: task completion, tool selection accuracy, parameter correctness, trajectory efficiency, and safety. Figure 29.3.2 compares an ideal trajectory with an actual agent trajectory. Code Fragment 29.3.3 below puts this into practice.
For agent evaluation, log the full trajectory (every tool call, parameter, and intermediate result), not just the final answer. Two agents can produce the same correct answer, but one took 3 tool calls and the other took 15. Trajectory logs reveal inefficiency, unnecessary API spending, and near-miss safety violations that final-answer-only evaluation would completely miss.
Evaluation Dimensions for Agents
- Task completion: Did the agent achieve the stated goal?
- Tool accuracy: Did the agent select the correct tools for each step?
- Parameter correctness: Were tool calls made with valid and appropriate parameters?
- Trajectory efficiency: Did the agent take unnecessary steps or redundant tool calls?
- Safety: Did the agent avoid dangerous or unauthorized actions?
- Cost efficiency: How many LLM calls and tokens were consumed?
# Define ToolCall, AgentTrajectory; implement evaluate_agent_trajectory
# Key operations: agent orchestration, tool integration, evaluation logic
from dataclasses import dataclass
from typing import Optional
@dataclass
class ToolCall:
"""A single tool call in an agent trajectory."""
tool_name: str
parameters: dict
result: Optional[str] = None
is_correct_tool: Optional[bool] = None
@dataclass
class AgentTrajectory:
"""Complete record of an agent's execution."""
task: str
tool_calls: list[ToolCall]
final_answer: str
total_tokens: int = 0
total_latency_ms: float = 0
def evaluate_agent_trajectory(
trajectory: AgentTrajectory,
ideal_tools: list[str],
expected_answer: str,
answer_checker=None
) -> dict:
"""Evaluate an agent trajectory against an ideal reference.
Args:
trajectory: The actual execution trajectory
ideal_tools: Ordered list of expected tool names
expected_answer: The ground truth answer
answer_checker: Optional function for fuzzy answer matching
"""
actual_tools = [tc.tool_name for tc in trajectory.tool_calls]
# Task completion: did the agent get the right answer?
if answer_checker:
task_complete = answer_checker(trajectory.final_answer, expected_answer)
else:
task_complete = trajectory.final_answer.strip().lower() == expected_answer.strip().lower()
# Tool accuracy: fraction of calls using correct tools
correct_tools = sum(
1 for t in actual_tools if t in ideal_tools
)
tool_accuracy = correct_tools / len(actual_tools) if actual_tools else 0
# Trajectory efficiency: ideal steps / actual steps
efficiency = min(len(ideal_tools) / len(actual_tools), 1.0) if actual_tools else 0
# Redundancy: duplicate consecutive tool calls
redundant = sum(
1 for i in range(1, len(actual_tools))
if actual_tools[i] == actual_tools[i - 1]
)
return {
"task_completed": task_complete,
"tool_accuracy": round(tool_accuracy, 3),
"trajectory_efficiency": round(efficiency, 3),
"num_steps": len(actual_tools),
"ideal_steps": len(ideal_tools),
"redundant_calls": redundant,
"total_tokens": trajectory.total_tokens,
}
The DeepEval library (pip install deepeval) provides the same trajectory and RAG metrics without manual implementation:
from deepeval.metrics import ToolCorrectnessMetric
from deepeval.test_case import LLMTestCase, ToolCall
metric = ToolCorrectnessMetric()
test_case = LLMTestCase(
input="Find Q3 revenue and format a report",
actual_output="Q3 revenue was $4.2M...",
tools_called=[ToolCall(name="search_db"), ToolCall(name="calculate")],
expected_tools=[ToolCall(name="search_db"), ToolCall(name="calculate"),
ToolCall(name="format_report")],
)
metric.measure(test_case)
print(f"Tool correctness: {metric.score}") # 0.667 (2/3 expected tools used)
Agent evaluation should weight task completion most heavily, since a correct answer through an inefficient path is better than an efficient trajectory with a wrong answer. However, trajectory quality matters for cost, latency, and safety. In production, an agent that consistently takes extra steps will cost more and may expose the system to more failure points. Evaluate both dimensions and set acceptable thresholds for each. Code Fragment 29.3.3 below puts this into practice.
4. Evaluation Frameworks Comparison
| Framework | Focus | Key Metrics | Strengths | Limitations |
|---|---|---|---|---|
| Ragas | RAG evaluation | Faithfulness, relevancy, context precision/recall | Most comprehensive RAG metrics; good HF integration | Relies on LLM judge; can be slow |
| DeepEval | General LLM testing | Hallucination, bias, toxicity, custom metrics | pytest integration; CI/CD friendly | Less RAG-specific depth than Ragas |
| Phoenix (Arize) | Observability + eval | Trace-level metrics, embedding analysis | Visual UI; traces + evals combined | Heavier infrastructure requirement |
| TruLens | Feedback functions | Groundedness, relevance, custom feedback | Modular feedback system; provider-agnostic | Smaller community than alternatives |
| promptfoo | Prompt testing | Assertion-based, custom evals | CLI-first; fast iteration; CI/CD native | Less suited for complex agent evaluation |
Launch a local Phoenix session to trace and evaluate LLM calls with an interactive UI.
# pip install arize-phoenix openai
import phoenix as px
from phoenix.trace.openai import OpenAIInstrumentor
# Launch local Phoenix UI at http://localhost:6006
session = px.launch_app()
# Auto-instrument all OpenAI calls for tracing
OpenAIInstrumentor().instrument()
from openai import OpenAI
client = OpenAI()
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": "Explain RAG in one sentence."}],
)
print(response.choices[0].message.content)
# View the traced call at http://localhost:6006
# implement test_rag_with_deepeval
# Key operations: retrieval pipeline, RAG pipeline, evaluation logic
from deepeval import assert_test
from deepeval.test_case import LLMTestCase
from deepeval.metrics import (
FaithfulnessMetric,
AnswerRelevancyMetric,
HallucinationMetric,
)
def test_rag_with_deepeval():
"""Example: Testing RAG output quality with DeepEval."""
test_case = LLMTestCase(
input="What are the benefits of solar energy?",
actual_output="Solar energy is renewable, reduces electricity bills, "
"and has low maintenance costs.",
retrieval_context=[
"Solar energy is a renewable source of power that reduces "
"dependence on fossil fuels.",
"Solar panels require minimal maintenance and can reduce "
"electricity bills by up to 50%.",
],
)
# Define metrics with thresholds
faithfulness = FaithfulnessMetric(threshold=0.7, model="gpt-4o")
relevancy = AnswerRelevancyMetric(threshold=0.7, model="gpt-4o")
hallucination = HallucinationMetric(threshold=0.5, model="gpt-4o")
# Assert all metrics pass (integrates with pytest)
assert_test(test_case, [faithfulness, relevancy, hallucination])
# Run with: pytest test_rag.py -v
All framework-computed metrics that rely on LLM judges inherit the biases and limitations of the judge model. Faithfulness scores can be unreliable when the context is ambiguous or when the judge model hallucinates its own assessment. Always validate framework metrics against a set of human-annotated examples before trusting them for production decisions. Establish the correlation between the automated metric and human judgment on your specific data.
1. What is the difference between faithfulness and answer correctness in RAG evaluation?
Show Answer
2. Why is trajectory evaluation important for agents, even when the final answer is correct?
Show Answer
3. When would you choose DeepEval over Ragas for RAG evaluation?
Show Answer
4. How does context precision differ from context recall, and which requires ground truth?
Show Answer
5. What is a practical strategy for validating that an LLM-based evaluation metric correlates with human judgment?
Show Answer
Who: ML platform team at an enterprise software company building an internal knowledge assistant
Situation: The RAG-based assistant answered questions about internal documentation, but users reported that 30% of answers were inaccurate. End-to-end accuracy metrics confirmed the problem but did not reveal whether retrieval or generation was at fault.
Problem: Without component-level metrics, the team could not decide whether to invest in better embeddings, improved chunking, or generation prompt tuning. Each path required different expertise and weeks of effort.
Dilemma: Improving retrieval would not help if the generator was hallucinating beyond the retrieved context. Improving generation prompts would not help if the retriever was fetching irrelevant documents.
Decision: The team deployed Ragas to compute separate retrieval metrics (context precision, context recall) and generation metrics (faithfulness, answer relevancy) on a golden test set of 200 questions with verified answers and source documents.
How: They created the golden set by having domain experts write questions, identify the correct source passages, and write reference answers. Ragas scored each component independently. They discovered context recall was 0.92 (retrieval was finding the right documents) but faithfulness was only 0.61 (the generator was adding information not present in the retrieved context).
Result: Armed with this diagnosis, the team focused exclusively on generation: adding "only use information from the provided context" instructions and implementing citation requirements. Faithfulness improved from 0.61 to 0.89, and user-reported inaccuracy dropped from 30% to 8%, all without changing the retrieval pipeline.
Lesson: Component-level RAG evaluation prevents wasted effort by pinpointing exactly where in the pipeline failures originate, enabling targeted fixes instead of guesswork.
LLM Metacognition and Calibration
A critical but often overlooked dimension of LLM evaluation is calibration: whether a model's expressed confidence matches its actual accuracy. A well-calibrated model that claims 80% confidence should be correct roughly 80% of the time. In practice, LLMs are frequently overconfident, particularly after RLHF alignment training, which rewards fluent, assertive responses and penalizes hedging. This creates a dangerous pattern where the model sounds maximally confident regardless of whether the answer is correct.
Two distinct approaches exist for measuring model uncertainty. Token-level logprobs provide the model's raw probability distribution over next tokens, offering a mathematical measure of confidence. Verbalized uncertainty ("I'm not sure, but...") reflects the model's trained ability to express doubt in natural language. These two signals can diverge: a model may produce high-confidence logprobs while verbally hedging, or vice versa. Neither is perfectly reliable on its own, but combining them provides a richer picture of model confidence.
Practical applications of calibration assessment include selective prediction (abstaining from answering when confidence is low rather than guessing), human-in-the-loop triggers (automatically escalating uncertain responses to human reviewers), and "I don't know" detection (training models to recognize the boundaries of their knowledge). For RAG systems, calibration is especially important: a model should express lower confidence when retrieved context is sparse or contradictory. Miscalibration has direct safety implications, as discussed in Section 32.3, because overconfident wrong answers in high-stakes domains (medical, legal, financial) can cause real harm.
For tasks without clear right/wrong answers (summarization, creative writing), use a strong model (GPT-4 class) as an automated judge with a detailed rubric. This scales better than human evaluation while correlating well with human preferences when the rubric is specific.
- Evaluate RAG at component level, not just end-to-end. Use retrieval metrics (context precision, context recall) and generation metrics (faithfulness, answer relevancy) to isolate where failures occur and guide targeted improvements.
- Faithfulness is the most critical RAG metric. A RAG system that generates unfaithful answers (hallucinating beyond context) undermines the entire purpose of retrieval-augmented generation. Monitor faithfulness continuously in production.
- Agent evaluation requires trajectory analysis. Assess tool accuracy, parameter correctness, and trajectory efficiency alongside task completion. An agent that completes tasks through unsafe or wasteful paths is a production liability.
- Choose frameworks based on your primary need. Ragas excels at RAG-specific metrics, DeepEval integrates with testing pipelines, Phoenix combines observability with evaluation, and promptfoo enables rapid prompt iteration.
- Validate automated metrics against human judgment. LLM-based evaluation metrics inherit judge model biases. Always establish correlation with human annotations before trusting automated scores for deployment decisions.
Open Questions in RAG and Agent Evaluation (2024-2026):
- Multi-hop RAG evaluation: Evaluating RAG systems that synthesize information across multiple retrieved documents remains challenging. Current metrics assess single-document faithfulness well but struggle with cross-document reasoning quality.
- Agent safety evaluation: How do you evaluate whether an agent will behave safely in novel situations it was not tested on? Formal verification approaches from software engineering are being adapted for agent trajectory analysis.
- Long-context RAG evaluation: As context windows grow to 1M+ tokens, evaluating whether models actually use all retrieved information (rather than relying on the first few chunks) requires new evaluation methodologies like the "needle in a haystack" test family.
Explore Further: Build a component-level evaluation pipeline for a simple RAG system using Ragas, then deliberately introduce retrieval and generation failures to see which metrics detect each failure mode.
Exercises
Explain why evaluating only the final answer of a RAG system is insufficient. Describe the four RAGAS metrics (faithfulness, answer relevancy, context precision, context recall) and what each measures.
Answer Sketch
End-to-end evaluation cannot distinguish retrieval failures from generation failures. Faithfulness measures whether the answer is supported by the retrieved context (catches hallucination). Answer relevancy measures whether the answer addresses the question. Context precision measures how many of the retrieved chunks are actually relevant. Context recall measures how much of the required information was retrieved. Together they pinpoint which component needs improvement.
An agent is asked to find the weather in Paris. It calls a web search tool, then a calculator tool, then the weather API. The final answer is correct. Should this trajectory receive full marks? Justify your answer with evaluation criteria.
Answer Sketch
No. While the final answer is correct, the trajectory is inefficient: the web search and calculator calls were unnecessary. Trajectory evaluation should score efficiency (minimum number of tool calls), correctness of tool selection (weather API was the only relevant tool), parameter accuracy (correct city name), and safety (no unauthorized actions). A correct final answer via an inefficient path should receive partial credit.
Write a Python function that uses an LLM to compute a faithfulness score. Given a context string and an answer string, the function should decompose the answer into individual claims, check each claim against the context, and return the fraction of claims supported by the context.
Answer Sketch
Step 1: Prompt an LLM to decompose the answer into atomic claims (one fact per claim). Step 2: For each claim, prompt the LLM with the context and ask "Is this claim supported by the context? Answer YES or NO." Step 3: Count the number of YES responses and divide by the total number of claims. This mirrors the RAGAS faithfulness pipeline. Handle edge cases where the answer contains no verifiable claims.
Your RAG system retrieves 5 chunks per query. On average, 2 of the 5 are relevant. What is the context precision? If the relevant chunks are needed to answer the question but only 60% of necessary information is retrieved, what is the context recall? Propose two changes to improve each metric.
Answer Sketch
Context precision is 2/5 = 40%. Context recall is 60%. To improve precision: (1) reduce the number of retrieved chunks (top-3 instead of top-5), (2) add a reranking step that scores retrieved chunks before passing them to the generator. To improve recall: (1) use hybrid search (dense + sparse retrieval), (2) apply query expansion to capture more relevant documents, (3) use smaller, more focused chunks so each chunk is more likely to contain the needed information.
Design an evaluation harness for a RAG system that computes retrieval metrics (precision@k, recall@k) and generation metrics (faithfulness, answer relevancy) on a labeled dataset of (query, relevant_doc_ids, gold_answer) triples. Outline the code structure and explain how you would report the results.
Answer Sketch
Create a pipeline that (1) runs each query through the retriever, (2) compares retrieved doc IDs to ground truth for precision@k and recall@k, (3) passes retrieved context and query to the generator, (4) scores the generated answer for faithfulness (claim verification against context) and answer relevancy (semantic similarity to the query). Report all metrics with bootstrap confidence intervals. Include a breakdown by query category if available.
What Comes Next
In the next section, Section 29.4: Testing LLM Applications, we examine testing practices for LLM applications, including unit tests, integration tests, and regression suites.
Bibliography
Es, S., James, J., Espinosa-Anke, L., & Schockaert, S. (2024). "Ragas: Automated Evaluation of Retrieval Augmented Generation." arXiv:2309.15217
Saad-Falcon, J., Khattab, O., Potts, C., & Zaharia, M. (2024). "ARES: An Automated Evaluation Framework for Retrieval-Augmented Generation Systems." arXiv:2311.09476
Arize AI. (2024). "Phoenix: ML Observability in a Notebook." https://github.com/Arize-ai/phoenix
Confident AI. (2024). "DeepEval: The Open-Source LLM Evaluation Framework." https://github.com/confident-ai/deepeval
promptfoo. (2024). "promptfoo: Test Your LLM App." https://www.promptfoo.dev/
