Optimization without evaluation is guesswork. DSPy: Programmatic Prompt Optimization provides a built-in evaluation framework that lets you measure module performance against labeled datasets using custom metrics. Good metrics drive good optimization: they tell the optimizer what "better" means for your specific task. This section covers the dspy.Evaluate class, writing custom metrics, assertion-based evaluation, and strategies for bootstrapping validation sets when labeled data is scarce.
1. The dspy.Evaluate Class
The Evaluate class runs your compiled (or uncompiled) module against a development set and reports aggregate metrics. It handles parallelism, error handling, and result aggregation.
import dspy
from dspy.evaluate import Evaluate
# Prepare a development set
devset = [
dspy.Example(
question="What is the capital of Japan?",
answer="Tokyo",
).with_inputs("question"),
dspy.Example(
question="Who wrote Hamlet?",
answer="William Shakespeare",
).with_inputs("question"),
# ... more examples
]
# Define a metric
def exact_match(example, prediction, trace=None):
return prediction.answer.strip().lower() == example.answer.strip().lower()
# Run evaluation
evaluator = Evaluate(
devset=devset,
metric=exact_match,
num_threads=4,
display_progress=True,
display_table=5, # Show first 5 results in a table
)
# Evaluate a module
qa = dspy.ChainOfThought("question -> answer")
score = evaluator(qa)
print(f"Accuracy: {score:.1%}")
The display_table parameter outputs a formatted table showing individual predictions alongside ground truth, making it easy to identify failure patterns.
2. Writing Custom Metrics
A metric function receives an example (with ground truth) and a prediction (from your module), and returns a score. The score can be boolean (pass/fail) or numeric (0.0 to 1.0 for partial credit).
# Boolean metric: exact match
def exact_match(example, prediction, trace=None):
return prediction.answer.strip().lower() == example.answer.strip().lower()
# Numeric metric: token-level F1 score
def token_f1(example, prediction, trace=None):
pred_tokens = set(prediction.answer.lower().split())
gold_tokens = set(example.answer.lower().split())
if not pred_tokens or not gold_tokens:
return 0.0
common = pred_tokens & gold_tokens
precision = len(common) / len(pred_tokens)
recall = len(common) / len(gold_tokens)
if precision + recall == 0:
return 0.0
return 2 * precision * recall / (precision + recall)
# Composite metric: combine multiple criteria
def quality_metric(example, prediction, trace=None):
accuracy = exact_match(example, prediction)
has_reasoning = len(getattr(prediction, "rationale", "")) > 20
is_concise = len(prediction.answer.split()) < 50
# Weighted combination
return 0.6 * accuracy + 0.2 * has_reasoning + 0.2 * is_concise
The trace parameter in metric functions is used by optimizers during bootstrapping. When trace is not None, the optimizer is collecting successful demonstrations. You can use this to enforce stricter criteria during optimization than during evaluation. For example, require both correct answers and good reasoning during bootstrapping, but accept correct answers alone during evaluation.
3. LLM-as-Judge Metrics
For open-ended tasks (summarization, creative writing, explanations), exact match is too strict. You can use an LLM as a judge to evaluate quality.
class AssessQuality(dspy.Signature):
"""Assess the quality of an answer on a scale of 1-5."""
question: str = dspy.InputField()
gold_answer: str = dspy.InputField(desc="Reference answer")
predicted_answer: str = dspy.InputField(desc="Generated answer")
score: int = dspy.OutputField(desc="Quality score from 1 (poor) to 5 (excellent)")
justification: str = dspy.OutputField(desc="Brief explanation of the score")
judge = dspy.ChainOfThought(AssessQuality)
def llm_judge_metric(example, prediction, trace=None):
result = judge(
question=example.question,
gold_answer=example.answer,
predicted_answer=prediction.answer,
)
# Normalize to 0-1 range
return (int(result.score) - 1) / 4.0
# Use in evaluation
evaluator = Evaluate(devset=devset, metric=llm_judge_metric, num_threads=4)
score = evaluator(qa)
print(f"Average quality: {score:.2f}")
LLM judges add latency and cost to evaluation. Each example requires an additional LLM call for judging. Use LLM judges for final evaluation and selection, not for rapid iteration during development. During development, prefer cheaper heuristic metrics and switch to LLM judging for final quality gates.
4. Assertion-Based Evaluation
For structured outputs, you can write assertion-based metrics that check specific properties of the prediction. This is particularly useful when the output must conform to a schema or satisfy business rules.
import json
def json_validity_metric(example, prediction, trace=None):
"""Check that the output is valid JSON with required fields."""
try:
data = json.loads(prediction.output)
except json.JSONDecodeError:
return 0.0
required_fields = {"title", "summary", "category"}
present_fields = set(data.keys()) & required_fields
field_score = len(present_fields) / len(required_fields)
# Check field value constraints
valid_categories = {"tech", "science", "business", "sports"}
category_valid = data.get("category", "") in valid_categories
return 0.7 * field_score + 0.3 * float(category_valid)
Assertion-based metrics are deterministic and fast to compute. They complement LLM-judge metrics: use assertions for structural correctness and LLM judges for semantic quality.
5. Bootstrapping Validation Sets
When labeled data is scarce, you can bootstrap a validation set using a strong model to generate reference answers, then have a human review and correct them.
# Generate candidate answers using a strong model
strong_lm = dspy.LM("openai/gpt-4o")
def bootstrap_valset(questions: list[str], k: int = 50) -> list[dspy.Example]:
"""Generate a validation set using a strong model."""
with dspy.context(lm=strong_lm):
qa = dspy.ChainOfThought("question -> answer")
examples = []
for q in questions[:k]:
result = qa(question=q)
examples.append(
dspy.Example(
question=q,
answer=result.answer,
).with_inputs("question")
)
return examples
# Generate, then manually review
valset = bootstrap_valset(my_questions)
# Save for human review
import json
with open("valset_for_review.jsonl", "w") as f:
for ex in valset:
f.write(json.dumps({"question": ex.question, "answer": ex.answer}) + "\n")
# After human review, load the corrected set
# and use it for evaluation and optimization
A small, high-quality validation set (30 to 50 carefully reviewed examples) is more valuable than a large, noisy one. Focus your human review effort on ambiguous or borderline cases. The optimizer learns more from hard examples than from easy ones.
6. Comparing Modules and Configurations
Evaluation is most useful when comparing alternatives. Run the same evaluation across different module architectures, optimizers, or models to make data-driven decisions.
# Compare different module architectures
modules = {
"Predict": dspy.Predict("question -> answer"),
"CoT": dspy.ChainOfThought("question -> answer"),
"PoT": dspy.ProgramOfThought("question -> answer"),
}
evaluator = Evaluate(devset=devset, metric=exact_match, num_threads=4)
results = {}
for name, module in modules.items():
score = evaluator(module)
results[name] = score
print(f"{name}: {score:.1%}")
# Output:
# Predict: 62.0%
# CoT: 78.0%
# PoT: 85.0%
This systematic comparison replaces gut feelings with data. When a stakeholder asks "why did you choose ChainOfThought?", you can point to evaluation results showing a 16-point accuracy improvement over basic prediction.
7. Continuous Evaluation in Production
Evaluation should not stop at deployment. Monitor your module's performance on production traffic to detect drift, model degradation, or changes in user behavior.
import logging
from datetime import datetime
logger = logging.getLogger("dspy_monitor")
async def monitored_predict(module, **kwargs):
"""Wrap a module call with production monitoring."""
start = datetime.now()
result = module(**kwargs)
latency = (datetime.now() - start).total_seconds()
# Log for monitoring
logger.info(
"prediction",
extra={
"module": type(module).__name__,
"inputs": kwargs,
"output": str(result),
"latency_seconds": latency,
"timestamp": start.isoformat(),
},
)
return result
# Periodically sample predictions and evaluate
# against human annotations
Production monitoring creates a feedback loop: flagged predictions become new training examples, which drive re-optimization, which improves future predictions. This continuous improvement cycle is one of DSPy: Programmatic Prompt Optimization's strongest value propositions for production systems.