Section 29.9: Evaluation Harness Ecosystems for Reproducible Research

"A benchmark score without a harness specification is just a number with an opinion."
Eval, Harness-Demanding AI Agent

Big Picture

Evaluation harnesses are the instruments of LLM science. Just as chemistry depends on calibrated instruments for reproducible measurements, LLM research depends on evaluation harnesses to produce benchmark scores that others can verify and compare. Three major open-source harnesses dominate the ecosystem: EleutherAI's lm-evaluation-harness (the original), HuggingFace's lighteval (tightly integrated with the HF Hub), and the UK AI Safety Institute's Inspect AI (designed for structured, auditable evaluations). Each harness makes different design choices about prompt formatting, tokenization, truncation, and scoring, and these choices can shift benchmark results by several percentage points. Understanding these harnesses, their architectures, and their divergence points is essential for anyone publishing or consuming evaluation results.

Prerequisites

Before starting, make sure you are familiar with evaluation metrics and benchmarks from Section 29.1: LLM Evaluation Fundamentals and experiment reproducibility practices from Section 29.7: LLM Experiment Reproducibility.

A cartoon scientist robot in a laboratory carefully running the same experiment multiple times through identical equipment, with a checklist clipboard tracking reproducibility metrics and test tubes containing different colored evaluation results. — Evaluation harnesses provide the standardized laboratory equipment that makes benchmark results reproducible and comparable across teams, models, and time.

1. Harness Design Patterns

Every evaluation harness, regardless of its specific implementation, must solve the same core problems: defining tasks, adapting model interfaces, scoring outputs, and presenting results. The architecture of a harness reflects how it decomposes these responsibilities into modular components. Understanding these patterns will help you choose the right harness for your use case, extend existing harnesses with custom tasks, and diagnose discrepancies when the same benchmark produces different scores across harnesses.

Key Insight

The four pillars of a harness. Every evaluation harness is built on four core abstractions: (1) a Task that defines the dataset, prompt template, and expected output format; (2) a Model Adapter that normalizes API differences across providers and local models; (3) a Scorer that computes metrics from predictions and references; and (4) a Log Viewer that enables inspection and debugging of individual examples. When comparing harness results, the differences almost always trace back to divergent implementations in one of these four components.

The task component is responsible for loading dataset splits, constructing prompts (including few-shot examples), and parsing model outputs into a format the scorer can evaluate. Model adapters handle the specifics of different inference backends: a local HuggingFace model requires tokenization and logit extraction, while an API model returns text completions. Scorers range from simple exact-match comparisons to complex rubric-based evaluation with chain-of-thought reasoning. Log viewers vary from plain JSON dumps to interactive web interfaces. Code Fragment 29.9.2 illustrates the abstract pattern that all harnesses share.


# Abstract evaluation harness pattern shared by all major frameworks
from abc import ABC, abstractmethod
from dataclasses import dataclass
from typing import Any

@dataclass
class EvalSample:
 """A single evaluation instance with input, reference, and metadata."""
 input_text: str
 reference: str
 metadata: dict

@dataclass
class EvalResult:
 """Result for a single sample: prediction, score, and trace."""
 sample: EvalSample
 prediction: str
 score: float
 trace: dict # Debug info: tokens, logprobs, latency

class Task(ABC):
 """Defines dataset loading, prompt construction, and output parsing."""

 @abstractmethod
 def load_samples(self, split: str = "test") -> list[EvalSample]:
 ...

 @abstractmethod
 def format_prompt(self, sample: EvalSample, num_fewshot: int = 0) -> str:
 ...

 @abstractmethod
 def parse_output(self, raw_output: str) -> str:
 ...

class ModelAdapter(ABC):
 """Normalizes inference across local and API models."""

 @abstractmethod
 def generate(self, prompt: str, max_tokens: int = 256) -> str:
 ...

 @abstractmethod
 def loglikelihood(self, context: str, continuation: str) -> float:
 ...

class Scorer(ABC):
 """Computes metrics from predictions and references."""

 @abstractmethod
 def score(self, prediction: str, reference: str) -> float:
 ...

class Harness:
 """Orchestrates task, model, and scorer into an evaluation run."""

 def __init__(self, task: Task, model: ModelAdapter, scorer: Scorer):
 self.task = task
 self.model = model
 self.scorer = scorer

 def evaluate(self, split: str = "test", num_fewshot: int = 0) -> list[EvalResult]:
 samples = self.task.load_samples(split)
 results = []
 for sample in samples:
 prompt = self.task.format_prompt(sample, num_fewshot)
 prediction = self.model.generate(prompt)
 parsed = self.task.parse_output(prediction)
 score = self.scorer.score(parsed, sample.reference)
 results.append(EvalResult(sample, parsed, score, {}))
 return results

Task: mmlu Model: openai/gpt-4o Accuracy: 0.873 Samples evaluated: 500

Code Fragment 29.9.1: Abstract evaluation harness pattern shared by all major frameworks

Fun Fact

In 2023, researchers discovered that simply changing the number of spaces before multiple-choice answer letters shifted a model's MMLU score by up to 4 percentage points. The model had not become smarter or dumber; the evaluation harness had just formatted the prompt differently. This is why harness interoperability matters: your benchmark score is partly measuring your model and partly measuring your whitespace choices.

This abstract pattern is realized differently in each harness. EleutherAI's lm-evaluation-harness uses YAML-based task definitions and a ConfigurableTask class. Inspect AI uses Python decorators and a typed task specification. lighteval uses a task class hierarchy with HuggingFace datasets integration. The next three subsections examine each harness in detail.

2. Inspect AI (UK AISI)

Inspect AI is the evaluation framework developed by the UK AI Safety Institute (AISI) for structured, auditable LLM evaluation. Its design prioritizes composability, determinism, and traceability. Every evaluation run produces a structured log that records the full chain of task specification, model inputs, model outputs, and scorer decisions. This makes Inspect AI particularly well suited for safety evaluations, compliance audits, and any context where reproducibility and auditability are mandatory requirements.

The framework uses a decorator-based API where tasks are Python functions annotated with @task. Each task returns a Task object composed of a dataset, a solver pipeline (which can include chain-of-thought prompting, tool use, or multi-turn interaction), and one or more scorers. The solver pipeline is a key differentiator: rather than treating evaluation as a single prompt-response cycle, Inspect allows multi-step interactions where the model can reason, use tools, and refine its answer before final scoring.


# Inspect AI: defining and running a custom evaluation task
# Install: pip install inspect-ai
from inspect_ai import Task, task, eval
from inspect_ai.dataset import csv_dataset, Sample
from inspect_ai.scorer import model_graded_fact, match
from inspect_ai.solver import chain_of_thought, generate, self_critique

# Define a custom task with composable solver pipeline
@task
def factual_qa() -> Task:
 """Evaluate factual question answering with chain-of-thought."""
 return Task(
 dataset=csv_dataset("eval_data/factual_qa.csv"),
 solver=[
 chain_of_thought(), # Step 1: Encourage reasoning
 generate(), # Step 2: Produce answer
 self_critique(), # Step 3: Self-review before final answer
 ],
 scorer=model_graded_fact(), # LLM-based factual accuracy scorer
 )

@task
def multiple_choice_mmlu() -> Task:
 """MMLU-style multiple choice with exact matching."""
 return Task(
 dataset=csv_dataset("eval_data/mmlu_subset.csv"),
 solver=[generate()],
 scorer=match(), # Deterministic exact-match scorer
 )

# Run evaluation from Python (also available via CLI: inspect eval task.py)
results = eval(
 factual_qa(),
 model="openai/gpt-4o",
 log_dir="./eval_logs",
 max_samples=100,
 temperature=0.0, # Deterministic generation for reproducibility
)

# Access structured results
for log in results:
 print(f"Task: {log.eval.task}")
 print(f"Model: {log.eval.model}")
 print(f"Accuracy: {log.results.metrics['accuracy'].value:.3f}")
 print(f"Samples evaluated: {log.eval.config.max_samples}")

Code Fragment 29.9.2: Inspect AI: defining and running a custom evaluation task

Tip

Use Inspect's built-in eval catalog for quick baselines. Inspect ships with pre-built implementations of popular benchmarks including MMLU, GSM8K, ARC, HellaSwag, and TruthfulQA. You can run these directly from the command line with inspect eval inspect_evals/mmlu --model openai/gpt-4o. The catalog evaluations use validated prompt templates and scorers, so they serve as reliable baselines against which you can compare custom task implementations. See the full catalog at inspect.ai-safety-institute.org.uk.

Inspect's log viewer is a browser-based interface that displays each evaluation sample with full detail: the constructed prompt, the model's raw output, the scorer's decision with rationale, and metadata such as token counts and latency. This granular traceability makes it straightforward to audit individual failures and understand why a model received a particular score. The logs are stored as JSON files, enabling programmatic analysis and comparison across evaluation runs.

3. lighteval (HuggingFace)

lighteval is HuggingFace's evaluation library, designed for tight integration with the HuggingFace Hub ecosystem. Its primary strengths are native support for models hosted on the Hub, seamless dataset loading via the datasets library, and built-in support for multi-GPU evaluation using accelerate or nanotron. If your workflow centers on HuggingFace models and you need to evaluate models at scale across multiple GPUs, lighteval is the natural choice.

Tasks in lighteval are defined using a LightevalTaskConfig that specifies the prompt function, dataset, metrics, and evaluation parameters. The prompt function receives a dataset row and returns a formatted prompt with gold references. lighteval distinguishes between "generative" tasks (where the model produces free-form text) and "loglikelihood" tasks (where the model scores candidate completions). This distinction affects both the inference path and the available metrics.


# lighteval: defining a custom task and running evaluation
# Install: pip install lighteval[accelerate]
from lighteval.tasks.lighteval_task import LightevalTaskConfig
from lighteval.tasks.requests import Doc
from lighteval.metrics.metrics import Metrics

# Define prompt formatting function
def mmlu_prompt(line: dict, task_name: str = "") -> Doc:
 """Format an MMLU-style multiple choice question."""
 choices = [line["A"], line["B"], line["C"], line["D"]]
 query = f"Question: {line['question']}\n"
 query += "".join(
 f"{letter}. {choice}\n"
 for letter, choice in zip("ABCD", choices)
 )
 query += "Answer:"

 return Doc(
 task_name=task_name,
 query=query,
 choices=[" A", " B", " C", " D"], # Note: leading space matters
 gold_index=["A", "B", "C", "D"].index(line["answer"]),
 )

# Register task configuration
custom_mmlu = LightevalTaskConfig(
 name="custom_mmlu",
 prompt_function=mmlu_prompt,
 suite=["custom"],
 hf_repo="cais/mmlu",
 hf_subset="abstract_algebra",
 metric=[Metrics.loglikelihood_acc, Metrics.loglikelihood_acc_norm],
 trust_dataset=True,
 stop_sequence=["\n"],
)

# Run from CLI with multi-GPU support:
# lighteval accelerate \
# --model_args "pretrained=meta-llama/Llama-3.1-8B" \
# --tasks "custom|custom_mmlu|0|0" \
# --output_dir ./results \
# --num_fewshot 5

arc_challenge: accuracy=0.8532 (+/- 0.0103) hellaswag: accuracy=0.9178 (+/- 0.0027) mmlu: accuracy=0.8614 (+/- 0.0039)

Code Fragment 29.9.3: lighteval: defining a custom task and running evaluation

The task string format "suite|task_name|num_fewshot|truncate_fewshot" is a convention specific to lighteval. The multi-GPU pipeline distributes samples across available devices using HuggingFace Accelerate, which is particularly valuable when evaluating large models on comprehensive benchmark suites. lighteval also integrates with the HuggingFace Open LLM Leaderboard, meaning you can reproduce leaderboard scores locally using the same harness configuration that produces the published results.

4. lm-evaluation-harness (EleutherAI)

The lm-evaluation-harness from EleutherAI is the most widely adopted evaluation framework in the open-source LLM community. Originally developed to evaluate the GPT-Neo and GPT-J model families, it has grown into a comprehensive evaluation platform with hundreds of pre-defined tasks spanning knowledge, reasoning, code generation, and multilingual capabilities. Many published model evaluations, including those on the HuggingFace Open LLM Leaderboard (prior to its migration to lighteval), use this harness.

Tasks are primarily defined using YAML configuration files that specify the dataset source, prompt template (via Jinja2 templates), few-shot configuration, output processing, and metrics. This declarative approach makes it easy to define new tasks without writing Python code. For more complex tasks, a Python-based ConfigurableTask class provides full programmatic control. Code Fragment 29.9.4 shows a typical YAML task definition and the corresponding CLI invocation.


# lm-evaluation-harness: YAML task definition and CLI usage
# Install: pip install lm-eval

# File: tasks/custom_qa.yaml
# ─────────────────────────────────
# task: custom_factual_qa
# dataset_path: json
# dataset_kwargs:
# data_files: eval_data/factual_qa.jsonl
# output_type: generate_until
# doc_to_text: "Question: {{question}}\nAnswer:"
# doc_to_target: "{{answer}}"
# generation_kwargs:
# until: ["\n\n"]
# max_gen_toks: 128
# temperature: 0.0
# metric_list:
# - metric: exact_match
# aggregation: mean
# - metric: f1
# aggregation: mean
# num_fewshot: 5

# Python API usage (equivalent to CLI)
import lm_eval

results = lm_eval.simple_evaluate(
 model="hf",
 model_args="pretrained=meta-llama/Llama-3.1-8B,dtype=bfloat16",
 tasks=["mmlu", "hellaswag", "arc_challenge"],
 num_fewshot=5,
 batch_size="auto",
 device="cuda",
 log_samples=True,
 output_path="./eval_results",
)

# Access aggregated results
for task_name, task_result in results["results"].items():
 acc = task_result.get("acc,none", task_result.get("acc_norm,none", "N/A"))
 stderr = task_result.get("acc_stderr,none", "N/A")
 print(f"{task_name}: accuracy={acc:.4f} (+/- {stderr:.4f})")

Code Fragment 29.9.4: lm-evaluation-harness: YAML task definition and CLI usage

The harness supports task groups, which bundle related tasks into a single evaluation run. For example, the mmlu group encompasses all 57 MMLU subtasks, and the leaderboard group reproduces the full Open LLM Leaderboard evaluation suite. Few-shot configuration is handled at the task level, with support for both random and fixed few-shot example selection. The batch_size="auto" option dynamically adjusts batch size based on available GPU memory, which simplifies evaluation across different hardware configurations.

Warning

YAML template whitespace is semantically significant. In lm-evaluation-harness task definitions, the Jinja2 templates in doc_to_text and doc_to_target are sensitive to whitespace. A trailing space or missing newline in the prompt template can change benchmark scores by 1 to 3 percentage points, especially for few-shot evaluation where the formatting compounds across examples. Always verify your prompt template by inspecting the fully rendered prompt for at least a few samples using --log_samples before running a full evaluation.

5. Cross-Harness Divergence

One of the most persistent and underappreciated problems in LLM evaluation is that the same benchmark can produce meaningfully different scores when run through different harnesses. A model that scores 72% on MMLU in one harness might score 68% or 76% in another, even with the same model weights, the same dataset split, and nominally the same metric. These discrepancies arise from implementation differences that are individually small but collectively significant.

The primary sources of divergence fall into four categories. First, prompt template differences: the exact wording, formatting, and few-shot example selection vary across harnesses. Second, tokenization and truncation: how the harness handles context-length limits when the prompt exceeds the model's maximum context differs. Third, scoring methodology: the way loglikelihood scores are normalized (per-token vs. per-character vs. unnormalized) and how ties are broken in multiple-choice scoring varies. Fourth, generation parameters: default temperature, top-p, and stop sequences can differ.


# Demonstrating cross-harness divergence on the same MMLU subset
# This script runs the same model through multiple harness configurations
# and quantifies the variance

import json
import numpy as np
from pathlib import Path

def compare_harness_results(results_dir: str) -> dict:
 """Load and compare results from multiple harness runs."""
 results_dir = Path(results_dir)
 harness_scores = {}

 for result_file in results_dir.glob("*.json"):
 with open(result_file) as f:
 data = json.load(f)
 harness_name = data.get("harness", result_file.stem)
 scores = data.get("task_scores", {})
 harness_scores[harness_name] = scores

 # Compute cross-harness statistics for each task
 comparison = {}
 all_tasks = set()
 for scores in harness_scores.values():
 all_tasks.update(scores.keys())

 for task in sorted(all_tasks):
 task_scores = []
 for harness, scores in harness_scores.items():
 if task in scores:
 task_scores.append((harness, scores[task]))

 if len(task_scores) >= 2:
 values = [s for _, s in task_scores]
 comparison[task] = {
 "scores": dict(task_scores),
 "mean": np.mean(values),
 "std": np.std(values),
 "range": max(values) - min(values),
 "max_divergence_pct": (max(values) - min(values)) / np.mean(values) * 100,
 }

 return comparison

# Example output (illustrative, based on published divergence studies):
# MMLU (abstract_algebra):
# lm-eval-harness: 0.720 (5-shot, space-separated choices)
# lighteval: 0.695 (5-shot, newline-separated choices)
# inspect: 0.710 (5-shot, lettered choices with periods)
# Range: 0.025, Divergence: 3.5%

{ "task": "mmlu_abstract_algebra", "harness_scores": { "lm_eval_harness": {"accuracy": 0.42, "n_samples": 100}, "inspect_ai": {"accuracy": 0.45, "n_samples": 100}, "lighteval": {"accuracy": 0.41, "n_samples": 100} }, ... }

Code Fragment 29.9.5: Demonstrating cross-harness divergence on the same MMLU subset

Key Insight

Always report the harness, version, and configuration. When publishing or comparing benchmark scores, the harness name and version, the exact task configuration (including prompt template), the number of few-shot examples, generation parameters (temperature, top-p, max tokens), and any truncation settings are all necessary for reproducibility. A benchmark score without this metadata is essentially unverifiable. The best practice is to include the full harness configuration file (YAML or Python) as supplementary material alongside any published results. Cross-reference the reproducibility practices from Section 29.7 for config management with Hydra and experiment tracking with MLflow.

Lab: Reproducing a Published Benchmark Score

This lab walks through the process of reproducing a published MMLU score, quantifying the variance introduced by harness choice and configuration, and documenting the results for a reproducibility report. The exercise uses all three harnesses to evaluate the same model on the same MMLU subset, then analyzes the sources of divergence.


# Lab: Reproduce MMLU score and quantify cross-harness variance
# Step 1: Run evaluation with lm-evaluation-harness
import subprocess
import json

model_name = "meta-llama/Llama-3.1-8B"
mmlu_subset = "mmlu_abstract_algebra"

# Run lm-eval-harness (via subprocess; also works via Python API)
subprocess.run([
 "lm_eval",
 "--model", "hf",
 "--model_args", f"pretrained={model_name},dtype=bfloat16",
 "--tasks", mmlu_subset,
 "--num_fewshot", "5",
 "--batch_size", "auto",
 "--output_path", "./results/lm_eval",
 "--log_samples",
], check=True)

# Step 2: Run with Inspect AI
subprocess.run([
 "inspect", "eval",
 "inspect_evals/mmlu",
 "--model", f"hf/{model_name}",
 "-T", "subject=abstract_algebra",
 "--log-dir", "./results/inspect",
], check=True)

# Step 3: Compare results across harnesses
def load_lm_eval_results(path: str) -> float:
 with open(path) as f:
 data = json.load(f)
 return data["results"][mmlu_subset]["acc,none"]

def load_inspect_results(log_dir: str) -> float:
 from inspect_ai.log import read_eval_log
 logs = list(Path(log_dir).glob("*.json"))
 log = read_eval_log(str(logs[-1]))
 return log.results.metrics["accuracy"].value

# Step 4: Generate reproducibility report
report = {
 "model": model_name,
 "task": "MMLU (abstract_algebra)",
 "fewshot": 5,
 "results": {
 "lm_eval_harness": {"version": "0.4.x", "accuracy": None},
 "inspect_ai": {"version": "0.3.x", "accuracy": None},
 },
 "divergence_analysis": {
 "prompt_template_diff": "See attached templates",
 "tokenization_notes": "Both use HF tokenizer; truncation strategy differs",
 "scoring_notes": "lm-eval uses unnormalized loglikelihood; Inspect uses normalized",
 },
}
print(json.dumps(report, indent=2))

Code Fragment 29.9.6: Lab: Reproduce MMLU score and quantify cross-harness variance

Tip

Log the fully rendered prompts, not just the templates. When debugging cross-harness divergence, the most productive first step is to extract the fully rendered prompt (with few-shot examples, formatting, and any special tokens) from each harness for the same sample. Side-by-side comparison of the rendered prompts will immediately reveal formatting differences that cause score divergence. In lm-evaluation-harness, use --log_samples. In Inspect, check the evaluation log's sample records. In lighteval, use the --save_details flag.

7. Choosing a Harness

The choice of harness depends on your specific requirements. The following comparison summarizes the key trade-offs across the three major harnesses.

# Comparison: Evaluation Harnesses

Feature	lm-evaluation-harness	lighteval	Inspect AI
Organization	EleutherAI	HuggingFace	UK AISI
Task definition	YAML + Jinja2	Python classes	Python decorators
Pre-built tasks	400+	100+	50+ (safety focus)
Multi-GPU	Via FSDP	Via Accelerate/Nanotron	Via model provider
API model support	OpenAI, Anthropic, etc.	Inference Endpoints	OpenAI, Anthropic, HF, etc.
Multi-turn evaluation	Limited	Limited	Native solver pipelines
Log viewer	JSON files	JSON + Leaderboard	Interactive web UI
Safety evaluations	Community contributed	Some	First-class support
Best for	Comprehensive benchmarking	HuggingFace ecosystem	Auditable safety evals

For comprehensive model benchmarking with maximum task coverage, lm-evaluation-harness remains the standard. For workflows embedded in the HuggingFace ecosystem with multi-GPU requirements, lighteval is the most natural fit. For safety evaluations, compliance audits, and scenarios requiring full auditability, Inspect AI provides the strongest guarantees. Many organizations use multiple harnesses, running lm-evaluation-harness for broad benchmarking and Inspect for targeted safety evaluations. The testing strategies from Section 29.4 complement harness-based evaluation by covering application-level behaviors that benchmarks do not capture.

Research Frontier

Toward harness-agnostic evaluation specifications. The cross-harness divergence problem has motivated efforts toward standardized evaluation specifications that can be executed by any harness and produce comparable results.

The key open problems include: (1) defining a portable prompt specification format that eliminates template-induced variance; (2) standardizing tokenization and truncation behavior for fair comparison; (3) creating reference implementations with certified outputs that harnesses can validate against; and (4) building meta-evaluation suites that quantify harness-induced measurement error. The Holistic Evaluation of Language Models (HELM) project at Stanford has made progress on specification standardization, but a truly portable evaluation format remains an open challenge. See also the evaluation integrity concerns discussed in Section 29.2 regarding benchmark contamination.

Bibliography

Evaluation Harnesses

UK AI Safety Institute. (2024). "Inspect: A Framework for Large Language Model Evaluations." inspect.ai-safety-institute.org.uk

The official documentation and source for Inspect AI, the UK AISI's evaluation framework. Covers the task specification API, solver pipelines, scorer implementations, and the built-in evaluation catalog. The design prioritizes auditability and determinism for safety evaluations.

Inspect AISafety Evaluation

Fourrier, C., Habib, N., Launay, J., et al. (2024). "lighteval: A Lightweight Framework for LLM Evaluation." GitHub: huggingface/lighteval

HuggingFace's evaluation library, designed for seamless integration with the HuggingFace Hub, Accelerate, and Nanotron. Supports both generative and loglikelihood evaluation modes with multi-GPU scaling. Powers the HuggingFace Open LLM Leaderboard evaluation pipeline.

lightevalHuggingFace

Gao, L., Tow, J., Abbasi, B., et al. (2024). "A Framework for Few-Shot Language Model Evaluation." GitHub: EleutherAI/lm-evaluation-harness

The original and most widely adopted open-source LLM evaluation framework. Provides 400+ pre-defined tasks, YAML-based task specification, and support for local and API-based models. The de facto standard for published model evaluations in the open-source community.

lm-eval-harnessEleutherAI

Evaluation Methodology

Liang, P., Bommasani, R., Lee, T., et al. (2023). "Holistic Evaluation of Language Models." Transactions on Machine Learning Research. arXiv:2211.09110

The HELM benchmark suite from Stanford, which evaluates models across a wide range of scenarios with standardized prompting and metrics. Provides a framework for understanding how evaluation design choices affect reported model performance.

HELMComprehensive Evaluation

Biderman, S., Schoelkopf, H., et al. (2024). "Lessons from the Trenches on Reproducible Evaluation of Language Models." arXiv:2405.14782

A detailed analysis of reproducibility challenges in LLM evaluation, documenting specific sources of cross-harness divergence and proposing best practices for reporting evaluation results. Essential reading for understanding why the same benchmark can yield different scores across frameworks.

ReproducibilityBest Practices

Hendrycks, D., Burns, C., Basart, S., et al. (2021). "Measuring Massive Multitask Language Understanding." ICLR 2021. arXiv:2009.03300

The original MMLU benchmark paper, used throughout this section as the primary example for cross-harness comparison. Defines the 57-subject evaluation covering STEM, humanities, social sciences, and professional domains.

MMLUBenchmark

Self-Check Questions

Why can the same benchmark produce different scores across different evaluation harnesses? Name at least three sources of cross-harness divergence.
Inspect AI focuses on safety evaluations. What architectural feature (sandboxed code execution) makes it particularly suited for testing dangerous model capabilities?
When reproducing a published benchmark score, what steps should you verify to ensure your setup matches the original evaluation (prompt template, few-shot selection, tokenization)?
Under what circumstances would you choose lm-evaluation-harness over lighteval, and when would lighteval be preferable?

Key Takeaways

Evaluation harnesses (lm-evaluation-harness, Inspect AI, lighteval) provide standardized, reproducible pipelines for running benchmarks across models.
Cross-harness divergence is a real problem: the same benchmark can produce different scores across different harnesses due to prompt formatting, few-shot example selection, and tokenization handling.
Inspect AI (from UK AISI) focuses on safety evaluations with composable task definitions and sandboxed code execution.
lm-evaluation-harness (EleutherAI) is the most widely used open-source harness, with 400+ tasks and support for both local and API-based models.
Always pin the harness version, commit hash, and exact prompt templates when publishing benchmark results to enable reproduction.

What Comes Next

In this section we covered harness design patterns, inspect ai (uk aisi), and related topics. In Section 29.10: LLM-as-Judge: Reliability, Debiasing, and Training Judge Models, we continue starting with judge bias taxonomy.