Appendices
Appendix J: Datasets, Benchmarks, and Leaderboards

Evaluation Benchmarks

Benchmarks provide standardized measurements of model capabilities. No single benchmark tells the full story; each measures a specific aspect of intelligence. The following covers the most widely referenced benchmarks in LLM evaluation.

Knowledge and Reasoning Benchmarks

MMLU (Massive Multitask Language Understanding)

What It MeasuresBroad knowledge across 57 subjects: STEM, humanities, social sciences, professional domains (law, medicine, accounting). Tests factual recall and reasoning.
FormatMultiple-choice questions (4 options), 14,042 questions total
EvaluationAccuracy (% correct). Reported as overall average and per-subject scores. Few-shot (5-shot) is the standard evaluation protocol.
AccessHugging Face (cais/mmlu); also available via lm-evaluation-harness
LimitationsSignificant data contamination risk (many questions appear on the open web). Some questions have disputed correct answers. The multiple-choice format limits what it can measure. MMLU-Pro (harder, 10-choice variant) was introduced to address ceiling effects.

ARC (AI2 Reasoning Challenge)

What It MeasuresScience reasoning at the grade-school level. The Challenge set (ARC-C) contains questions that were answered incorrectly by simple retrieval and co-occurrence methods, requiring genuine reasoning.
FormatMultiple-choice science questions; 7,787 questions (Easy + Challenge sets)
EvaluationAccuracy on the Challenge set (ARC-C) is the standard reported metric
AccessHugging Face (allenai/ai2_arc)
LimitationsRelatively small dataset. Grade-school level means frontier models now achieve near-perfect scores, reducing its discriminative power.

HellaSwag

What It MeasuresCommonsense reasoning through sentence completion. Given the beginning of a scenario, the model must select the most plausible continuation from four options.
FormatMultiple-choice (4 options), ~10,000 questions
EvaluationAccuracy. Designed to be easy for humans (~95%) but hard for models at the time of creation (2019).
AccessHugging Face (Rowan/hellaswag)
LimitationsFrontier models now exceed 95% accuracy, approaching the ceiling. Adversarial filtering was applied during creation, which means the wrong answers are specifically designed to fool older models, not necessarily current ones.

TruthfulQA

What It MeasuresWhether models generate truthful answers rather than reproducing common misconceptions. Questions are designed so that popular but incorrect answers are likely to be learned from web text.
Format817 questions spanning 38 categories (health, law, finance, conspiracies, etc.)
EvaluationPercentage of responses judged truthful and informative (by a fine-tuned GPT-judge or human eval)
AccessHugging Face (truthful_qa)
LimitationsSmall dataset size. Automated truthfulness evaluation via GPT-judge introduces its own biases. Some questions are culturally specific.

Mathematics Benchmarks

GSM8K (Grade School Math 8K)

What It MeasuresMulti-step mathematical reasoning at the grade-school level. Problems require 2 to 8 sequential arithmetic operations with natural language reasoning.
Format1,319 test problems; each has a step-by-step solution and a final numerical answer
EvaluationAccuracy (exact match on the final numerical answer). Chain-of-thought prompting is standard.
AccessHugging Face (gsm8k)
LimitationsFrontier models now score above 95%, reaching ceiling. High data contamination risk due to widespread internet availability. GSM-Plus and GSM-Hard extend the difficulty.

MATH

What It MeasuresCompetition-level mathematics spanning algebra, geometry, number theory, combinatorics, and probability. Significantly harder than GSM8K.
Format12,500 problems (5,000 test) with LaTeX solutions; difficulty levels 1 through 5
EvaluationAccuracy (exact match on the final answer after normalization)
AccessHugging Face (lighteval/MATH); original at hendrycks/math
LimitationsLaTeX formatting of answers makes exact matching tricky; equivalent expressions may be scored as incorrect. Contamination concerns exist for problems sourced from well-known competitions.

Code Generation Benchmarks

HumanEval / HumanEval+

What It MeasuresFunctional code generation: given a function signature and docstring, generate a correct Python implementation that passes all unit tests.
Format164 problems (HumanEval); HumanEval+ adds 80x more tests per problem to reduce false positives
Evaluationpass@k: probability that at least one of k generated solutions passes all tests. pass@1 is the most commonly reported metric.
AccessGitHub (openai/human-eval); HumanEval+ via evalplus
LimitationsPython only. Small dataset (164 problems). Many problems are simple algorithmic tasks. Significant contamination risk since the problems are widely known. SWE-bench is emerging as a more realistic alternative.

SWE-bench / SWE-bench Verified

What It MeasuresEnd-to-end software engineering: given a real GitHub issue from a popular Python repository, generate a patch that resolves the issue and passes the repository's existing test suite.
FormatSWE-bench contains 2,294 tasks drawn from 12 repositories (Django, Flask, scikit-learn, sympy, etc.). SWE-bench Verified is a human-validated 500-task subset with confirmed solvability and unambiguous test assertions.
EvaluationPercentage of issues resolved: the generated patch must apply cleanly and pass all relevant tests. Unlike HumanEval's isolated functions, SWE-bench tasks require navigating multi-file repositories, understanding existing codebases, and writing patches that respect project conventions.
Accessswebench.com; GitHub (princeton-nlp/SWE-bench)
LimitationsPython only. Requires sandboxed execution environments for safe evaluation. Agent scaffolding (file navigation, tool use) significantly affects scores, making it hard to isolate model capability from system design. Cost per evaluation run is high due to repository setup and test execution.

Conversational and Preference Benchmarks

MT-Bench

What It MeasuresMulti-turn conversational ability across 8 categories: writing, roleplay, extraction, reasoning, math, coding, knowledge (STEM), and knowledge (humanities).
Format80 two-turn questions; GPT-4 judges model responses on a 1-10 scale
EvaluationAverage GPT-4 score across all questions (max 10.0)
AccessGitHub (lm-sys/FastChat); part of the lmsys ecosystem
LimitationsVery small dataset (80 questions). GPT-4 as judge introduces bias toward GPT-4-like responses. Scores tend to cluster in a narrow range (6 to 9), making it hard to distinguish similar models. The 1-10 scale is subjective.

Chatbot Arena (LMSYS)

What It MeasuresOverall model quality as perceived by real users. Users submit any prompt, receive anonymized responses from two random models, and vote for the better one.
FormatPairwise human preference judgments, continuously collected via a public web interface
EvaluationElo rating (and Bradley-Terry model coefficients) derived from pairwise win rates. Higher Elo indicates stronger perceived quality.
AccessLive leaderboard at chat.lmsys.org; vote data periodically released
LimitationsScores reflect the preferences of the user population, which skews toward tech-savvy English speakers. Prompt distribution is uncontrolled and evolves over time. Models that produce longer, more formatted responses may receive a "verbosity bias" advantage.
lm-evaluation-harness in Practice

Run standardized benchmarks on any Hugging Face model from the command line.

# pip install lm-eval
# Evaluate a model on MMLU and HellaSwag
lm_eval --model hf \
    --model_args pretrained=meta-llama/Llama-3.1-8B-Instruct \
    --tasks mmlu,hellaswag \
    --num_fewshot 5 \
    --batch_size 8 \
    --output_path ./eval_results/
Code Fragment 1: Running lm-evaluation-harness from the command line on MMLU and HellaSwag with 5-shot prompting. The --model_args flag accepts any Hugging Face model ID, and results are saved as JSON to the output path for comparison across models.
Loading Benchmarks with datasets

Load individual benchmark datasets programmatically for custom evaluation.

# pip install datasets
from datasets import load_dataset

# Load GSM8K math benchmark
gsm8k = load_dataset("gsm8k", "main", split="test")
print(f"GSM8K test size: {len(gsm8k)}")
print(f"Sample question: {gsm8k[0]['question'][:100]}...")

# Load MMLU (a specific subject)
mmlu = load_dataset("cais/mmlu", "college_physics", split="test")
print(f"MMLU college_physics: {len(mmlu)} questions")
Code Fragment 2: Loading benchmark datasets programmatically with datasets. GSM8K is loaded as a single split, while MMLU requires specifying the subject as a configuration name. This approach enables custom evaluation pipelines beyond what the harness supports.