Section 56.3

Datasets and Benchmarks

"Whichever bias benchmark you skip is the one the customer will discover, usually on a Friday afternoon."

EvalEval, Suite-Assembling Fairness AI Agent
Big Picture

Responsible-AI evaluation datasets and benchmarks partition into six families: LLM bias benchmarks (BBQ, BOLD, StereoSet, CrowS-Pairs, WinoBias, Winogender, Bias in Bios) that probe stereotypical preferences and disparate-impact behaviors in language models; classical tabular fairness datasets (UCI Adult / Census, COMPAS, German Credit, IBM HR, Folktables) that ground the fairness literature; toxicity and hate-speech benchmarks (RealToxicityPrompts, ToxiGen, Civil Comments, HateCheck, Detoxify training data, OLID, OffensEval) for content-moderation evaluation; truthfulness and hallucination benchmarks (TruthfulQA, FActScore, HaluEval, FELM, FreshQA) for measuring factual fidelity; privacy-attack benchmarks (membership-inference datasets, training-data-extraction benchmarks, MIMIC re-identification studies) for measuring privacy leakage; and multi-dimensional aggregations (HELM, BIG-bench safety/bias slices, MASSIVE multilingual fairness, AIR-Bench) that combine individual benchmarks into a single composite picture. This section catalogs them with vendor / dataset URLs and pick-when guidance.

Prerequisites

This section assumes the LLM evaluation methodology from Section 42.1 and the bias-and-fairness vocabulary from Section 50.1.

Picking benchmarks well matters more than picking the right library, because the benchmark defines what the team will optimize. A team that runs only BBQ and StereoSet will think their model is bias-safe until a customer-found edge case shows otherwise; a team that runs a multi-dimensional suite (HELM, internal red-team set, and a domain-specific bias dataset) will have a better picture and will also spend more on evaluation. The 2024-26 best practice is to assemble a layered suite: a public leaderboard slice for external comparison, a curated internal set held out from training data, and continuous red-team additions from production incidents.

56.3.1 LLM bias benchmarks

Ask your favorite LLM "the doctor told the nurse that she had to go home; who needed to leave?" and watch where the pronoun lands. That single sentence is the seed of WinoBias, one of the canonical benchmarks below. LLM bias benchmarks turn questions like this into thousands of probes that measure whether language models prefer stereotypical completions, behave differently across protected attributes, or quietly default to social stereotypes when the prompt is ambiguous.

56.3.2 Classical tabular fairness datasets

Before LLM-era benchmarks, the fairness literature was anchored by a small set of tabular datasets where group-disparate outcomes were starkly visible. These remain the canonical reference and the easiest place to teach fairness methods.

Key Takeaways
The fairness impossibility theorem (Kleinberg-Mullainathan-Raghavan / Chouldechova 2016)

The COMPAS debate has a formal resolution. Kleinberg, Mullainathan, and Raghavan (ITCS 2017) and Chouldechova (Big Data 2017) independently proved that whenever a binary risk score $\hat{R}$ is used to predict a binary outcome $Y \in \{0,1\}$ across two groups $A=0$ and $A=1$ with different base rates $P(Y=1 \mid A=0) \neq P(Y=1 \mid A=1)$, three intuitively desirable conditions cannot all hold simultaneously (except in the degenerate cases of perfect prediction or equal base rates):

Sketch of why: calibration forces $P(Y=1 \mid \hat{R}=r, A=a) = r$ in every group, so the distribution of $\hat{R}$ conditional on $Y$ depends on the group's base rate $P(Y=1 \mid A=a)$ via Bayes' rule. When base rates differ, the conditional distributions of $\hat{R} \mid (Y, A)$ cannot simultaneously equalize their means across groups for both $Y=0$ and $Y=1$; the algebra forces a trade-off. The implication is not that fairness is impossible, but that "fairness" requires a choice among incompatible criteria, and the choice is normative not technical. COMPAS satisfied calibration; ProPublica's audit measured the balance-for-negative-class property; both critiques were mathematically correct.

56.3.3 Toxicity and hate-speech benchmarks

Toxicity benchmarks measure whether language models generate toxic content when prompted with adversarial or innocuous inputs, and whether content classifiers correctly identify toxic content.

56.3.4 Truthfulness and hallucination benchmarks

Truthfulness benchmarks measure how often a model produces factually correct answers; hallucination benchmarks specifically measure whether the model fabricates plausible-but-false content.

56.3.5 Privacy-attack benchmarks

Privacy benchmarks measure whether trained models leak training data (training-data extraction), whether membership in the training set can be inferred (membership inference), or whether re-identification of supposedly-anonymous records is possible.

56.3.6 Multi-dimensional aggregated benchmarks

Aggregated benchmarks combine multiple datasets into a single evaluation suite to give a composite view of responsible-AI performance.

Library Shortcut
lm-evaluation-harness as the canonical multi-benchmark runner

Of the aggregated suites above, lm-evaluation-harness is the one you actually invoke from a CI pipeline. It ships with hundreds of tasks (BBQ, TruthfulQA, CrowS-Pairs, ToxiGen, WMDP, the HELM scenarios, and the standard capability set), parallelizes across GPUs, and emits a single JSON report you can diff between releases. It also accepts HELM-style scenario specs, so you can run the HELM bias and toxicity slices without leaving the harness.

Show code
pip install "lm-eval[vllm]"
# Run a responsible-AI panel against a self-hosted vLLM endpoint
lm_eval \
    --model vllm \
    --model_args pretrained=meta-llama/Llama-3.1-8B-Instruct \
    --tasks bbq,truthfulqa_mc2,crows_pairs_english,toxigen,wmdp \
    --batch_size auto \
    --output_path ./eval_results/llama31_8b/
Code Fragment 56.3.1: Add --num_fewshot 0 for zero-shot reporting; the harness logs per-group scores so you can spot fairness disparities (e.g., BBQ accuracy by demographic axis) without writing a separate aggregator.
Responsible-AI benchmark families positioned by stage and risk category
Figure 56.3.1a: The six benchmark families from this section positioned against the six evaluation stages from Section 56.3.9. Empty cells are not coverage gaps; they are deliberate scope. Tabular-fairness benchmarks (Folktables, COMPAS) belong at post-finetune for predictive-ML systems but not earlier; privacy-attack benchmarks (LM Extraction, ML Privacy Meter) belong at pre-launch red-team rather than at base-model audit. The EleutherAI lm-evaluation-harness orchestrates the four rows it can automate (bias, toxicity, truthfulness, dangerous capability) inside a single CI invocation; the remaining two rows (tabular fairness, privacy attacks) need dedicated harnesses (Folktables / Aequitas, ML Privacy Meter).

56.3.7 A canonical 2026 evaluation stack

Real-World Scenario
A boring-but-correct 2026 responsible-AI evaluation suite

Who: A platform-evaluation team at an LLM-deploying organization in 2026, owning the responsible-AI test harness for every release.

Situation: The team had to gate each model release against bias, toxicity, truthfulness, and dangerous-capability regressions while also producing evidence for compliance reviewers.

Problem: Public benchmarks alone were contaminated and incomplete, internal benchmarks alone were incomparable across vendors, and trying to use every benchmark every release blew out CI budgets.

Dilemma: Either run a maximalist suite (slow CI, expensive, brittle) or trim to a single aggregate (cheap but blind to specific harms).

Decision: They standardized a "boring-but-correct" core suite of public benchmarks plus an internal held-out set, and ran the combination per release as a CI gate.

How: The suite was roughly: BBQ + BOLD for bias (QA + generation), StereoSet or CrowS-Pairs for stereotype preference, RealToxicityPrompts + Civil Comments for toxicity, TruthfulQA + FActScore for truthfulness, HarmBench + PyRIT-generated red-team set for jailbreak resistance, WMDP for dangerous-capability sanity check, plus an internal held-out set built from production incidents; the HELM dashboard provided the cross-vendor comparison, and the internal set provided drift detection. For predictive-ML systems specifically, they added Folktables or Adult for tabular fairness reproduction and a domain dataset.

Result: Each release shipped with a per-benchmark report, regressions blocked merges automatically, and the internal-set component caught issues the public benchmarks missed entirely.

Lesson: Treat the responsible-AI evaluation suite the same way you treat unit tests: a small, fast, opinionated core that runs every release, plus an accumulating internal benchmark from real incidents, beats both maximalist suites and single-aggregate scores.

Warning: Benchmark contamination is the silent killer

Public benchmarks are themselves on the web, and modern LLMs are trained on the web. The result: top-of-leaderboard models routinely score abnormally high on benchmark items that appear verbatim or near-verbatim in their training data. The 2024-25 wave of contamination studies (Sainz et al. on MMLU, Magar & Schwartz on multiple benchmarks) showed contamination is widespread. The mitigations: (a) hold out an internal benchmark from any training data, (b) periodically rotate benchmark items, (c) check verbatim memorization on benchmark items (if the model can recite the question, the score is suspect), and (d) report on time-rotated benchmarks (FreshQA, MMLU-Pro variants) alongside the canonical ones.

Key Insight
No single benchmark suffices, and no aggregate is unambiguous

Every individual benchmark above measures one specific construct, and any aggregate (HELM, AIR-Bench) imposes weights that are themselves a choice. The right interpretation is: a model that performs well on a multi-dimensional suite is probably better than one that performs poorly, but small differences across leaderboards are not reliable signals. Production teams should pick three to five benchmarks aligned to their specific deployment risks (bias for hiring tools, toxicity for consumer chatbots, hallucination for medical / legal applications), report them separately, and treat the aggregate as a sanity check rather than a ranking.

56.3.8 Licensing and access considerations

The benchmarks above vary in licensing and access constraints:

For audit deliverables, the documentation should always cite the benchmark version, the date the evaluation was run, and the exact model checkpoint, because benchmarks (Folktables ACS year, FreshQA refresh date) and models (GPT-4o snapshot 2024-08-06 vs 2024-11-20) both shift.

56.3.9 Benchmarks by deployment stage

Different responsible-AI benchmarks belong at different stages of the deployment pipeline. The 2026 best-practice map:

Real-World Scenario
A legal-tech bot's evaluation suite evolution

A legal-research startup deploying an LLM-based legal-question-answering bot built its evaluation suite over 18 months as follows: (a) initial release used HELM bias slices and TruthfulQA for general competence checks; (b) after a customer-reported case where the bot fabricated case citations, FActScore + a domain-specific 200-citation verification set were added as ship-blockers; (c) after a discovery that the bot's responses changed depending on whether the case party was named with a stereotypically male or female name, BBQ + a custom name-variation benchmark were added; (d) after EU AI Act readiness review, an AIR-Bench slice plus a custom "jurisdiction-correctness" set were added; (e) by month 18, the production CI ran 11 distinct evaluation suites, with regressions on any of them blocking release. The pattern (start with public benchmarks, accumulate internal ones from incidents and audits) is common; the alternative pattern (one big public-benchmark-only suite forever) does not survive contact with production.

56.3.10 Datasheets, data statements, and dataset hygiene

Beyond benchmarks themselves, the responsible-AI community has converged on standards for documenting datasets so users can evaluate whether they fit a given use case.

What's Next?

In the next section, Section 56.4: Models, we build on the material covered here.

Further Reading
Parrish, A., et al. (2022). "BBQ: A Hand-Built Bias Benchmark for Question Answering." Findings of ACL 2022. arxiv.org/abs/2110.08193. The canonical QA-bias benchmark and its ambiguous-vs-disambiguated context construction.
Lin, S., Hilton, J., & Evans, O. (2022). "TruthfulQA: Measuring How Models Mimic Human Falsehoods." ACL 2022. arxiv.org/abs/2109.07958. The imitative-falsehoods benchmark; introduces the construct of "model mimics common human errors" as a distinct evaluation axis.
Liang, P., et al. (2022). "Holistic Evaluation of Language Models." arXiv:2211.09110. crfm.stanford.edu/helm. The HELM framework and its 7-metric multi-axis evaluation paradigm.
Gebru, T., et al. (2018). "Datasheets for Datasets." Communications of the ACM. arxiv.org/abs/1803.09010. The dataset-documentation standard now expected by major venues and most governance platforms.
Ding, F., et al. (2021). "Retiring Adult: New Datasets for Fair Machine Learning." NeurIPS 2021. arxiv.org/abs/2108.04884. The Folktables redesign that addresses UCI Adult's limitations and provides yearly-refreshable fairness benchmarks.
Carlini, N., et al. (2022). "Quantifying Memorization Across Neural Language Models." ICLR 2023. arxiv.org/abs/2202.07646. The systematic study of training-data extraction underlying the LM Extraction Benchmark.