Section 14.3: Datasets & Benchmarks

"The benchmark is the prompt is the curriculum is the evaluation. Pick a dataset and you have picked a worldview."
Census, Benchmark-Genealogist AI Agent

Big Picture

Part III's dataset layer differs from Part II's: instead of pretraining corpora at trillion-token scale, you care about instruction-tuning sets (Alpaca, ShareGPT, OpenAssistant, UltraChat), preference data (HH-RLHF, UltraFeedback, PRM800K), and the evaluation benchmarks that govern your prompt and your agent (MT-Bench, AlpacaEval, Arena-Hard, IFEval). This section catalogs them and tells you which to use when.

Prerequisites

This section assumes the instruction-tuning recipes from Section 8.1 and basic familiarity with Hugging Face Datasets from Section 12.4. The LLM-as-judge methodology covered later in the book deepens the evaluation aspects discussed here.

Part III's dataset layer differs from Part II's. You are not pretraining anything; you are calling APIs, building prompts, and measuring whether your calls work. The relevant datasets are prompt libraries, evaluation suites for instruction-following, and small public benchmarks you can run against any chat endpoint in minutes.

The benchmarks listed below are the ones whose numbers you will see quoted in model cards. Knowing what each measures (and what each does not measure) lets you read a "Claude beats GPT on X" claim with the right amount of skepticism.

Part III chat-benchmark and tool-use benchmark stack — **Figure 14.3.1:** The Part III benchmark map. Chat-quality benchmarks (LM Arena as the human-Elo ground truth, Arena-Hard-Auto as the cheap LLM-judge proxy, AllenAI's WildBench for real-user distributions, MT-Bench multi-turn, AlpacaEval 2.0, IFEval for instruction-following, SimpleBench for common-sense) tell you whether your prompt is good. Reasoning and knowledge benchmarks (MMLU-Pro, GPQA-Diamond, BBH, AIME 2024/25, MATH-500, HumanEval and MBPP for code, SWE-bench Verified for coding agents) tell you whether the model is capable. Function-calling and agent benchmarks (BFCL v3, AgentBench, Princeton's tau-bench v2) tell you whether tool use will work in production. Prompt registries (LangSmith Hub, Anthropic Cookbook, OpenAI Cookbook) supply starting points.

14.3.1 Instruction-following and chat benchmarks

AlpacaEval 2.0: pairwise LLM-as-judge eval against GPT-4-turbo on 805 prompts. Largely superseded by Arena-Hard-Auto for new work; still cited in older 2023-24 model cards.
MT-Bench: 80 multi-turn questions across 8 categories, GPT-4 as judge. The reference multi-turn benchmark before Arena-Hard.
Arena-Hard-Auto: 500 difficult prompts, LLM-as-judge. The current frontier-model proxy benchmark (Hard Prompts of Arena-Hard v2 landed in 2025).
WildBench (AllenAI, 2024-25): real-user chat distributions; the dominant chat-eval benchmark in 2024-25 and often more discriminating than Arena-Hard.
MMLU-Pro (2024): the harder MMLU successor; should sit next to GPQA in model-card tables.
IFEval (Zhou et al., 2023, arXiv:2311.07911): the standard instruction-following benchmark.
SimpleBench (Philip, 2024-25): contamination-resistant "common sense" benchmark widely cited in 2025 model cards.
BIG-bench and BIG-bench Hard: a 200-task collection from Google. BBH is the 23 hardest subtasks; still used for reasoning evaluation.

14.3.1.1 BIG-Bench and BBH: How They Are Scored

The original BIG-Bench (Srivastava et al., 2022) bundles 204 tasks contributed by 444 authors across 132 institutions. Each task ships with a fixed prompt template, a small held-out evaluation split, and a per-task scoring function (exact match, multiple choice accuracy, BLEU, or task-specific). A model's BIG-Bench score is the macro-average of the normalised score across tasks:

\mathrm{BIG\text{-}Bench}(M) \;=\; \frac{1}{|\mathcal{T}|} \sum_{t \in \mathcal{T}} \frac{s_t(M) - s_t^{\text{random}}}{s_t^{\text{best-human}} - s_t^{\text{random}}},

so a model that ties the average crowdworker gets a 1.0 on that task and a random-guess baseline gets 0.0. The macro-average penalises models that are stellar on a few tasks but middling on the long tail of unusual ones, which was the whole point: BIG-Bench was designed to be wide rather than tall.

BBH (Suzgun et al., 2022) is the 23-task subset where the original BIG-Bench paper reported that no language model beat the average human evaluator. By 2024, Gemini 2.0 Pro reached 85% on BBH zero-shot and Claude 3.5 Sonnet reached 87%, so BBH is now reported as a saturating reasoning benchmark rather than a frontier one. It is still useful for spot-checking new architecture work on hard chain-of-thought reasoning.

# Run BBH zero-shot via lm-evaluation-harness
pip install lm-eval[hf]==0.4.4
lm-eval --model hf \
        --model_args pretrained=meta-llama/Llama-3.1-8B-Instruct,dtype=bfloat16 \
        --tasks bbh_zeroshot \
        --batch_size 8 \
        --output_path results/llama-3.1-8b-bbh.json
# Average per-task accuracy lands around 0.62 for Llama 3.1 8B Instruct,
# versus 0.85+ for the current frontier models.

Code Fragment 14.3.1a: Reproducing a BBH score takes one shell command. The harness handles prompt templating, exact-match scoring, and per-task aggregation for all 23 BBH subtasks.

Worked Example: BBH per-task variance vs the headline number

A 0.62 average BBH score sounds uniform, but the per-task spread is usually 0.20 to 0.95. Llama 3.1 8B Instruct's BBH report shows it scoring 0.93 on boolean_expressions (a closed-form evaluation task) but only 0.21 on tracking_shuffled_objects_seven (multi-step state tracking). When picking a small open model for an agentic product, ignore the BBH average and inspect the per-task scores that match your workload. A bot that calls Python tools cares about boolean_expressions; a personal-assistant bot that remembers many entities cares about tracking_shuffled_objects. Two models with the same 0.62 average can have wildly different fit for your task.

HumanEval and MBPP: canonical coding-task benchmarks. Heavily contaminated in 2026 but still required for model cards.
SWE-bench: real GitHub issues turned into a benchmark. SWE-bench Verified is the human-validated subset; the standard for agentic-coding evaluation.

14.3.2 Prompt registries and example libraries

LangSmith Hub: thousands of community-shared prompt templates with version history.
Anthropic Cookbook: Anthropic-curated patterns (tool use, vision, prompt caching, computer use).
OpenAI Cookbook: the most-cited prompt and orchestration recipe book on the web.
awesome-chatgpt-prompts: enormous community prompt list. Best for inspiration, not for production.

14.3.3 Function-calling and tool-use benchmarks

Berkeley Function-Calling Leaderboard (BFCL v3, 2024): tool-use evaluation across model families. v3 added multi-turn function-calling tracks.
AgentBench: agentic-task evaluation across simulated environments.
tau-bench / tau-bench v2 (2024-12): tool-use with realistic user simulation; the better benchmark for "does my agent actually solve a customer-service task". v2 fixed several validation issues from the original.

14.3.4 Comparing the chat benchmarks

Table 14.3.1b: 16.3.1 Chat-quality benchmarks to know.

Benchmark	Format	Size	Judge	Best for
LM Arena	Blind pairwise	continuous	Human	Ground truth ranking
Arena-Hard	Hard prompts	500	LLM-as-judge	Cheap LM Arena proxy
AlpacaEval 2.0	Pairwise vs GPT-4	805	LLM-as-judge	Quick instruct-tuning eval
MT-Bench	Multi-turn	80	LLM-as-judge	Conversation flow
SWE-bench Verified	Real GitHub issues	500	Test suite	Coding agents

Warning: LLM-as-judge is biased toward its own family

If you score AlpacaEval with GPT-4 as judge, GPT-family models tend to score better. If you score with Claude as judge, Claude-family models tend to score better. The effect is small but real. The defenses are (a) report both judges, (b) use a deterministic test suite (SWE-bench, HumanEval) whenever feasible, and (c) trust LM Arena over any single LLM-judged benchmark for high-stakes decisions. The standard reference is "From Generation to Judgment: Opportunities and Challenges of LLM-as-judge" (Li et al., 2024, arXiv:2411.16594), which catalogues the failure modes.

Warning

SWE-bench Verified scores depend on harness, not just model

SWE-bench Verified results swing 10-15 percentage points depending on whether the agent has internet access, how the test harness allocates retries, and how the patch-application logic handles malformed diffs. When a model card claims "X% on SWE-bench Verified", check the accompanying scaffolding description before comparing across labs. The agent loop, not the model, often dominates the score.

Tip

build your own micro-benchmark before trusting anyone else's

The most useful eval is the 30-prompt set you wrote that targets your actual use case. Run it against three model families at the start of any Part III project; revisit it whenever a new model is released. A custom 30-prompt benchmark with regression history beats any single public benchmark for "should I switch providers".

14.3.5 Public prompt corpora for fine-tuning

Alpaca (52k synthetic instructions): the canonical instruction dataset of 2023; still used for Chapter 18 examples.
UltraChat-200k: 200k multi-turn chats; the canonical "real instruction tuning" dataset.
Tulu 3 SFT Mixture (~1M examples): the cleanest large open SFT dataset of 2025.

Key Takeaways

Part III datasets are prompts and judgments, not corpora: AlpacaEval, MT-Bench, Arena-Hard, WildBench, IFEval, and SimpleBench measure instruction-following on chat endpoints in minutes, not pretraining at trillion-token scale.
WildBench and Arena-Hard are the 2025 chat-eval workhorses: real-user distributions and 500 difficult prompts with LLM-as-judge typically discriminate models better than legacy MT-Bench and AlpacaEval, with LM Arena pairwise human Elo as the ground truth.
SWE-bench Verified is the agentic-coding gold standard: 500 human-validated GitHub issues with deterministic test-suite judging, though scores swing 10-15 points based on harness scaffolding, not just model quality.
Function-calling benchmarks consolidated around BFCL v3 and tau-bench v2: multi-turn function-calling on BFCL v3 plus realistic customer-service simulation on tau-bench v2 cover the actual tool-use surface for agent deployments.
LLM-as-judge is biased toward its own family: GPT-4 judging favors GPT-family outputs, Claude judging favors Claude outputs, so reporting both judges, preferring deterministic test suites, and trusting LM Arena for high-stakes decisions all matter.
A 30-prompt custom benchmark beats public ones: targeted regression coverage of your actual use case across three model families at project start is more useful than any single public leaderboard for "should I switch providers."

What's Next?

In the next section, Section 14.4: Models, we build on the material covered here.

Further Reading

Instruction Datasets

Wang, Y., Mishra, S., Alipoormolabashi, P., et al. (2022). "Super-NaturalInstructions." EMNLP 2022. arXiv:2204.07705. Reference large-scale instruction-tuning dataset.

Conover, M., Hayes, M., Mathur, A., et al. (2023). "Free Dolly: Introducing the World's First Truly Open Instruction-Tuned LLM." Databricks. databricks.com/blog/dolly. The Dolly-15k dataset; reference for fully-open instruction tuning data.

Evaluation Datasets

Hendrycks, D., et al. (2021). "MMLU." arXiv:2009.03300. Standard general-knowledge LLM benchmark.

Zheng, L., Chiang, W.-L., Sheng, Y., et al. (2023). "MT-Bench." NeurIPS 2023. arXiv:2306.05685. Reference multi-turn LLM benchmark.