Datasets & Benchmarks

Section 14.3

"The benchmark is the prompt is the curriculum is the evaluation. Pick a dataset and you have picked a worldview."

CensusCensus, Benchmark-Genealogist AI Agent
Big Picture

Part III's dataset layer differs from Part II's: instead of pretraining corpora at trillion-token scale, you care about instruction-tuning sets (Alpaca, ShareGPT, OpenAssistant, UltraChat), preference data (HH-RLHF, UltraFeedback, PRM800K), and the evaluation benchmarks that govern your prompt and your agent (MT-Bench, AlpacaEval, Arena-Hard, IFEval). This section catalogs them and tells you which to use when.

Prerequisites

This section assumes the instruction-tuning recipes from Section 8.1 and basic familiarity with Hugging Face Datasets from Section 12.4. The LLM-as-judge methodology covered later in the book deepens the evaluation aspects discussed here.

Part III's dataset layer differs from Part II's. You are not pretraining anything; you are calling APIs, building prompts, and measuring whether your calls work. The relevant datasets are prompt libraries, evaluation suites for instruction-following, and small public benchmarks you can run against any chat endpoint in minutes.

The benchmarks listed below are the ones whose numbers you will see quoted in model cards. Knowing what each measures (and what each does not measure) lets you read a "Claude beats GPT on X" claim with the right amount of skepticism.

Part III chat-benchmark and tool-use benchmark stack
Figure 14.3.1: The Part III benchmark map. Chat-quality benchmarks (LM Arena as the human-Elo ground truth, Arena-Hard-Auto as the cheap LLM-judge proxy, AllenAI's WildBench for real-user distributions, MT-Bench multi-turn, AlpacaEval 2.0, IFEval for instruction-following, SimpleBench for common-sense) tell you whether your prompt is good. Reasoning and knowledge benchmarks (MMLU-Pro, GPQA-Diamond, BBH, AIME 2024/25, MATH-500, HumanEval and MBPP for code, SWE-bench Verified for coding agents) tell you whether the model is capable. Function-calling and agent benchmarks (BFCL v3, AgentBench, Princeton's tau-bench v2) tell you whether tool use will work in production. Prompt registries (LangSmith Hub, Anthropic Cookbook, OpenAI Cookbook) supply starting points.

14.3.1 Instruction-following and chat benchmarks

14.3.1.1 BIG-Bench and BBH: How They Are Scored

The original BIG-Bench (Srivastava et al., 2022) bundles 204 tasks contributed by 444 authors across 132 institutions. Each task ships with a fixed prompt template, a small held-out evaluation split, and a per-task scoring function (exact match, multiple choice accuracy, BLEU, or task-specific). A model's BIG-Bench score is the macro-average of the normalised score across tasks:

$$\mathrm{BIG\text{-}Bench}(M) \;=\; \frac{1}{|\mathcal{T}|} \sum_{t \in \mathcal{T}} \frac{s_t(M) - s_t^{\text{random}}}{s_t^{\text{best-human}} - s_t^{\text{random}}},$$

so a model that ties the average crowdworker gets a 1.0 on that task and a random-guess baseline gets 0.0. The macro-average penalises models that are stellar on a few tasks but middling on the long tail of unusual ones, which was the whole point: BIG-Bench was designed to be wide rather than tall.

BBH (Suzgun et al., 2022) is the 23-task subset where the original BIG-Bench paper reported that no language model beat the average human evaluator. By 2024, Gemini 2.0 Pro reached 85% on BBH zero-shot and Claude 3.5 Sonnet reached 87%, so BBH is now reported as a saturating reasoning benchmark rather than a frontier one. It is still useful for spot-checking new architecture work on hard chain-of-thought reasoning.

# Run BBH zero-shot via lm-evaluation-harness
pip install lm-eval[hf]==0.4.4
lm-eval --model hf \
        --model_args pretrained=meta-llama/Llama-3.1-8B-Instruct,dtype=bfloat16 \
        --tasks bbh_zeroshot \
        --batch_size 8 \
        --output_path results/llama-3.1-8b-bbh.json
# Average per-task accuracy lands around 0.62 for Llama 3.1 8B Instruct,
# versus 0.85+ for the current frontier models.
Code Fragment 14.3.1a: Reproducing a BBH score takes one shell command. The harness handles prompt templating, exact-match scoring, and per-task aggregation for all 23 BBH subtasks.
Worked Example: BBH per-task variance vs the headline number

A 0.62 average BBH score sounds uniform, but the per-task spread is usually 0.20 to 0.95. Llama 3.1 8B Instruct's BBH report shows it scoring 0.93 on boolean_expressions (a closed-form evaluation task) but only 0.21 on tracking_shuffled_objects_seven (multi-step state tracking). When picking a small open model for an agentic product, ignore the BBH average and inspect the per-task scores that match your workload. A bot that calls Python tools cares about boolean_expressions; a personal-assistant bot that remembers many entities cares about tracking_shuffled_objects. Two models with the same 0.62 average can have wildly different fit for your task.

14.3.2 Prompt registries and example libraries

14.3.3 Function-calling and tool-use benchmarks

14.3.4 Comparing the chat benchmarks

Table 14.3.1b: 16.3.1 Chat-quality benchmarks to know.
BenchmarkFormatSizeJudgeBest for
LM ArenaBlind pairwisecontinuousHumanGround truth ranking
Arena-HardHard prompts500LLM-as-judgeCheap LM Arena proxy
AlpacaEval 2.0Pairwise vs GPT-4805LLM-as-judgeQuick instruct-tuning eval
MT-BenchMulti-turn80LLM-as-judgeConversation flow
SWE-bench VerifiedReal GitHub issues500Test suiteCoding agents
Warning: LLM-as-judge is biased toward its own family

If you score AlpacaEval with GPT-4 as judge, GPT-family models tend to score better. If you score with Claude as judge, Claude-family models tend to score better. The effect is small but real. The defenses are (a) report both judges, (b) use a deterministic test suite (SWE-bench, HumanEval) whenever feasible, and (c) trust LM Arena over any single LLM-judged benchmark for high-stakes decisions. The standard reference is "From Generation to Judgment: Opportunities and Challenges of LLM-as-judge" (Li et al., 2024, arXiv:2411.16594), which catalogues the failure modes.

Warning
SWE-bench Verified scores depend on harness, not just model

SWE-bench Verified results swing 10-15 percentage points depending on whether the agent has internet access, how the test harness allocates retries, and how the patch-application logic handles malformed diffs. When a model card claims "X% on SWE-bench Verified", check the accompanying scaffolding description before comparing across labs. The agent loop, not the model, often dominates the score.

Tip
build your own micro-benchmark before trusting anyone else's

The most useful eval is the 30-prompt set you wrote that targets your actual use case. Run it against three model families at the start of any Part III project; revisit it whenever a new model is released. A custom 30-prompt benchmark with regression history beats any single public benchmark for "should I switch providers".

14.3.5 Public prompt corpora for fine-tuning

Key Takeaways

What's Next?

In the next section, Section 14.4: Models, we build on the material covered here.

Further Reading

Instruction Datasets

Wang, Y., Mishra, S., Alipoormolabashi, P., et al. (2022). "Super-NaturalInstructions." EMNLP 2022. arXiv:2204.07705. Reference large-scale instruction-tuning dataset.
Conover, M., Hayes, M., Mathur, A., et al. (2023). "Free Dolly: Introducing the World's First Truly Open Instruction-Tuned LLM." Databricks. databricks.com/blog/dolly. The Dolly-15k dataset; reference for fully-open instruction tuning data.

Evaluation Datasets

Hendrycks, D., et al. (2021). "MMLU." arXiv:2009.03300. Standard general-knowledge LLM benchmark.
Zheng, L., Chiang, W.-L., Sheng, Y., et al. (2023). "MT-Bench." NeurIPS 2023. arXiv:2306.05685. Reference multi-turn LLM benchmark.