Section Q.4: Evaluation and Metrics | Building Conversational AI with LLMs and Agents

Big Picture

Optimization without evaluation is guesswork. DSPy: Programmatic Prompt Optimization provides a built-in evaluation framework that lets you measure module performance against labeled datasets using custom metrics. Good metrics drive good optimization: they tell the optimizer what "better" means for your specific task. This section covers the dspy.Evaluate class, writing custom metrics, assertion-based evaluation, and strategies for bootstrapping validation sets when labeled data is scarce.

1. The dspy.Evaluate Class

The Evaluate class runs your compiled (or uncompiled) module against a development set and reports aggregate metrics. It handles parallelism, error handling, and result aggregation.

import dspy
from dspy.evaluate import Evaluate

# Prepare a development set
devset = [
    dspy.Example(
        question="What is the capital of Japan?",
        answer="Tokyo",
    ).with_inputs("question"),
    dspy.Example(
        question="Who wrote Hamlet?",
        answer="William Shakespeare",
    ).with_inputs("question"),
    # ... more examples
]

# Define a metric
def exact_match(example, prediction, trace=None):
    return prediction.answer.strip().lower() == example.answer.strip().lower()

# Run evaluation
evaluator = Evaluate(
    devset=devset,
    metric=exact_match,
    num_threads=4,
    display_progress=True,
    display_table=5,  # Show first 5 results in a table
)

# Evaluate a module
qa = dspy.ChainOfThought("question -> answer")
score = evaluator(qa)
print(f"Accuracy: {score:.1%}")

Custom metric evaluation: Total examples: 100 Correct: 84 Accuracy: 0.84 Avg confidence: 0.91

The display_table parameter outputs a formatted table showing individual predictions alongside ground truth, making it easy to identify failure patterns.

2. Writing Custom Metrics

A metric function receives an example (with ground truth) and a prediction (from your module), and returns a score. The score can be boolean (pass/fail) or numeric (0.0 to 1.0 for partial credit).

# Boolean metric: exact match
def exact_match(example, prediction, trace=None):
    return prediction.answer.strip().lower() == example.answer.strip().lower()

# Numeric metric: token-level F1 score
def token_f1(example, prediction, trace=None):
    pred_tokens = set(prediction.answer.lower().split())
    gold_tokens = set(example.answer.lower().split())

    if not pred_tokens or not gold_tokens:
        return 0.0

    common = pred_tokens & gold_tokens
    precision = len(common) / len(pred_tokens)
    recall = len(common) / len(gold_tokens)

    if precision + recall == 0:
        return 0.0
    return 2 * precision * recall / (precision + recall)

# Composite metric: combine multiple criteria
def quality_metric(example, prediction, trace=None):
    accuracy = exact_match(example, prediction)
    has_reasoning = len(getattr(prediction, "rationale", "")) > 20
    is_concise = len(prediction.answer.split()) < 50

    # Weighted combination
    return 0.6 * accuracy + 0.2 * has_reasoning + 0.2 * is_concise

Key Insight

The trace parameter in metric functions is used by optimizers during bootstrapping. When trace is not None, the optimizer is collecting successful demonstrations. You can use this to enforce stricter criteria during optimization than during evaluation. For example, require both correct answers and good reasoning during bootstrapping, but accept correct answers alone during evaluation.

3. LLM-as-Judge Metrics

For open-ended tasks (summarization, creative writing, explanations), exact match is too strict. You can use an LLM as a judge to evaluate quality.

class AssessQuality(dspy.Signature):
    """Assess the quality of an answer on a scale of 1-5."""
    question: str = dspy.InputField()
    gold_answer: str = dspy.InputField(desc="Reference answer")
    predicted_answer: str = dspy.InputField(desc="Generated answer")
    score: int = dspy.OutputField(desc="Quality score from 1 (poor) to 5 (excellent)")
    justification: str = dspy.OutputField(desc="Brief explanation of the score")

judge = dspy.ChainOfThought(AssessQuality)

def llm_judge_metric(example, prediction, trace=None):
    result = judge(
        question=example.question,
        gold_answer=example.answer,
        predicted_answer=prediction.answer,
    )
    # Normalize to 0-1 range
    return (int(result.score) - 1) / 4.0

# Use in evaluation
evaluator = Evaluate(devset=devset, metric=llm_judge_metric, num_threads=4)
score = evaluator(qa)
print(f"Average quality: {score:.2f}")

Typed predictor output: answer: 'Paris' confidence: 0.97 reasoning: 'France is a country in Western Europe...'

Warning

LLM judges add latency and cost to evaluation. Each example requires an additional LLM call for judging. Use LLM judges for final evaluation and selection, not for rapid iteration during development. During development, prefer cheaper heuristic metrics and switch to LLM judging for final quality gates.

4. Assertion-Based Evaluation

For structured outputs, you can write assertion-based metrics that check specific properties of the prediction. This is particularly useful when the output must conform to a schema or satisfy business rules.

import json

def json_validity_metric(example, prediction, trace=None):
    """Check that the output is valid JSON with required fields."""
    try:
        data = json.loads(prediction.output)
    except json.JSONDecodeError:
        return 0.0

    required_fields = {"title", "summary", "category"}
    present_fields = set(data.keys()) & required_fields
    field_score = len(present_fields) / len(required_fields)

    # Check field value constraints
    valid_categories = {"tech", "science", "business", "sports"}
    category_valid = data.get("category", "") in valid_categories

    return 0.7 * field_score + 0.3 * float(category_valid)

Assertion-based metrics are deterministic and fast to compute. They complement LLM-judge metrics: use assertions for structural correctness and LLM judges for semantic quality.

5. Bootstrapping Validation Sets

When labeled data is scarce, you can bootstrap a validation set using a strong model to generate reference answers, then have a human review and correct them.

# Generate candidate answers using a strong model
strong_lm = dspy.LM("openai/gpt-4o")

def bootstrap_valset(questions: list[str], k: int = 50) -> list[dspy.Example]:
    """Generate a validation set using a strong model."""
    with dspy.context(lm=strong_lm):
        qa = dspy.ChainOfThought("question -> answer")
        examples = []

        for q in questions[:k]:
            result = qa(question=q)
            examples.append(
                dspy.Example(
                    question=q,
                    answer=result.answer,
                ).with_inputs("question")
            )

    return examples

# Generate, then manually review
valset = bootstrap_valset(my_questions)

# Save for human review
import json
with open("valset_for_review.jsonl", "w") as f:
    for ex in valset:
        f.write(json.dumps({"question": ex.question, "answer": ex.answer}) + "\n")

# After human review, load the corrected set
# and use it for evaluation and optimization

Tip

A small, high-quality validation set (30 to 50 carefully reviewed examples) is more valuable than a large, noisy one. Focus your human review effort on ambiguous or borderline cases. The optimizer learns more from hard examples than from easy ones.

6. Comparing Modules and Configurations

Evaluation is most useful when comparing alternatives. Run the same evaluation across different module architectures, optimizers, or models to make data-driven decisions.

# Compare different module architectures
modules = {
    "Predict": dspy.Predict("question -> answer"),
    "CoT": dspy.ChainOfThought("question -> answer"),
    "PoT": dspy.ProgramOfThought("question -> answer"),
}

evaluator = Evaluate(devset=devset, metric=exact_match, num_threads=4)

results = {}
for name, module in modules.items():
    score = evaluator(module)
    results[name] = score
    print(f"{name}: {score:.1%}")

# Output:
# Predict: 62.0%
# CoT: 78.0%
# PoT: 85.0%

Assertion check: len(answer) < 100: PASSED answer contains no profanity: PASSED answer is factually grounded: PASSED

This systematic comparison replaces gut feelings with data. When a stakeholder asks "why did you choose ChainOfThought?", you can point to evaluation results showing a 16-point accuracy improvement over basic prediction.

7. Continuous Evaluation in Production

Evaluation should not stop at deployment. Monitor your module's performance on production traffic to detect drift, model degradation, or changes in user behavior.

import logging
from datetime import datetime

logger = logging.getLogger("dspy_monitor")

async def monitored_predict(module, **kwargs):
    """Wrap a module call with production monitoring."""
    start = datetime.now()
    result = module(**kwargs)
    latency = (datetime.now() - start).total_seconds()

    # Log for monitoring
    logger.info(
        "prediction",
        extra={
            "module": type(module).__name__,
            "inputs": kwargs,
            "output": str(result),
            "latency_seconds": latency,
            "timestamp": start.isoformat(),
        },
    )

    return result

# Periodically sample predictions and evaluate
# against human annotations

INFO:dspy_monitor:prediction module=ChainOfThought latency_seconds=1.23 timestamp=2025-06-15T14:32:01 inputs={'question': 'What is RAG?'} output=Retrieval-augmented generation combines retrieval with LLM...

Production monitoring creates a feedback loop: flagged predictions become new training examples, which drive re-optimization, which improves future predictions. This continuous improvement cycle is one of DSPy: Programmatic Prompt Optimization's strongest value propositions for production systems.