Section 51.3: Datasets & Benchmarks

Safety datasets cover three areas: harmful-output benchmarks, jailbreak / adversarial-input corpora, and bias / fairness benchmarks.

Safety datasets organized by what they measure — **Figure 51.3.1**: The Part X safety-dataset menagerie sorted by what each one measures. HarmBench, AdvBench, JailbreakBench, and HH-RLHF target refusal of harmful behaviors. Anthropic Global Opinions, BBQ, Winogender, and discrim-eval target demographic and worldview bias. TruthfulQA, HaluEval, and MT-Bench target truthfulness and hallucination.

51.3.1 Harmful-output benchmarks

HarmBench (Mazeika et al., 2024) is the harmful-behavior benchmark of 400 prompts across 7 categories (cybercrime, hate speech, fraud, weapons, biothreats, illegal activity, harmful misinformation). Its objective is to standardize how labs measure harmful-output refusal so safety claims are comparable, which matters because pre-HarmBench every lab used different attack sets. The core concept is per-category attack prompts plus a classifier that scores whether the response actually completed the harmful task. Pick HarmBench as the cross-lab refusal benchmark.
AdvBench (Zou et al., 2023) is the 520-prompt adversarial benchmark from the original GCG (Greedy Coordinate Gradient) attack paper. Its objective is to provide a standardized set of harmful behaviors against which automated attacks can be measured, which matters because GCG and follow-on optimization attacks need fixed targets. Pick AdvBench when reproducing automated-attack research or when establishing a jailbreak-robustness baseline.
JailbreakBench (Chao et al., 2024) is the standardized jailbreak evaluation framework with a fixed set of harmful behaviors plus a public leaderboard. Its objective is to make jailbreak research comparable by fixing the harm targets, the attack budget, and the success-detection classifier, which matters because every jailbreak paper previously used a slightly different setup. Pick JailbreakBench as the academic comparison standard for jailbreak research.
Anthropic HH-RLHF (Anthropic, 2022) is the helpfulness-and-harmlessness preference dataset that powered the first RLHF safety pipelines. Its objective was to provide annotated trade-off pairs where annotators chose the safer (or more helpful) response, which mattered as the canonical reference dataset. Pick HH-RLHF for safety-tuning research or reward-model training; for newer safety preference data, more recent harmlessness datasets have updated examples.

51.3.2 Bias and fairness

Anthropic Global Opinions (Anthropic, 2023) is the survey-based cross-cultural bias benchmark measuring how closely model opinions align with different country populations on Pew survey questions. Its objective is to make worldview alignment measurable across cultures, which matters because models tend to default to a US-coastal-elite worldview that they were not trained to. Pick it to measure whose preferences your model expresses.
BBQ (NYU, 2022) is the Bias Benchmark for QA, with 58K questions probing how models complete stereotype-linked patterns under ambiguous and disambiguated contexts. Its objective is to measure stereotype-driven completion errors across 11 demographic categories (age, disability, gender, race, religion, etc.), which matters because biased completions cause real harm in production. The core concept is paired ambiguous and disambiguated context to distinguish stereotype-only completions from context-aware ones. Pick BBQ for measuring stereotype bias in QA tasks.
Winogender (Rudinger et al., 2018) is the gender-bias coreference schema where the model must resolve a pronoun referring to a profession, testing whether profession-gender stereotypes affect resolution. Its objective is to expose stereotype-driven coreference errors, which matters as one of the longest-running gender-bias benchmarks. Pick Winogender for historical comparison or quick stereotype probes; modern bias eval typically uses BBQ.
Anthropic discrim-eval (Anthropic, 2023) is the discrimination measurement dataset probing whether models recommend different decisions for matched candidates differing only in demographic attribute (loan approval, hiring, etc.). Its objective is to test for downstream-decision discrimination, which matters when models drive consequential decisions. Pick discrim-eval for fairness audits of decision-support applications.

51.3.3 Truthfulness and hallucination

TruthfulQA (Lin et al., 2021) is the truthfulness benchmark of 817 questions designed to elicit common misconceptions ("Why do people in Britain drive on the left?"). Its objective is to measure how often models repeat plausible-but-false beliefs, which matters because models trained on web text inherit human myths. Pick TruthfulQA for hallucination-resistance evaluation.
HaluEval (Patronus AI, 2023) is a benchmark of 35K samples for hallucination detection in QA, dialogue, and summarization. Its objective is to evaluate whether models can identify hallucinated content in given outputs, which matters for hallucination-classifier training. Pick HaluEval for training or benchmarking hallucination detectors.
MT-Bench (LMSYS, 2023) is the multi-turn quality benchmark with 80 questions across 8 categories, scored by GPT-4-as-judge. Its objective is to evaluate multi-turn conversation quality where standard benchmarks miss the depth, which matters for chat-product evaluation. Pick MT-Bench as a quick multi-turn quality probe; for production eval, RAGAS-style domain-specific eval beats MT-Bench.

51.3.4 Comparing the datasets

Table 51.3.1a: 39.3.1 Safety datasets (2026).

Dataset	Focus	Size	Use
HarmBench	Harmful behaviors	400	Refusal eval
AdvBench	Adversarial prompts	520	Robustness eval
JailbreakBench	Jailbreak attacks	100+	Standardized comparison
BBQ	Bias in QA	~58K	Bias measurement
TruthfulQA	Truthfulness	817	Hallucination eval

Warning: Adversarial datasets age fast

Jailbreak benchmarks are arms races, and the public sets are the slow side. A model that passes the 2023 jailbreak set routinely fails the attacks discovered in the last month. Always pair public benchmarks with live red-team campaigns using Garak or PyRIT, and treat "passes the benchmark" as the floor, not the ceiling.

What's Next?

In the next section, Section 51.4: Models, we build on the material covered here.

Further Reading

Security Benchmarks

Bhardwaj, R., & Poria, S. (2023). "Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment." arXiv:2308.09662. Reference red-team benchmark for LLMs.

Zou, A., Wang, Z., Carlini, N., et al. (2023). "Universal and Transferable Adversarial Attacks on Aligned Language Models." arXiv:2307.15043. Reference for transferable jailbreaks; defines benchmark adversarial inputs.

Mazeika, M., Phan, L., Yin, X., et al. (2024). "HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal." ICML 2024. arXiv:2402.04249. The current standard automated red-team benchmark with 18 harm categories; the reference for measuring attack success rate against safety-tuned LLMs.

Chao, P., Debenedetti, E., Robey, A., et al. (2024). "JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models." NeurIPS 2024. arXiv:2404.01318. Open leaderboard of jailbreak attacks and defenses with a reproducible evaluation pipeline; the canonical reference for tracking defense robustness over time.

OWASP (2025). "OWASP Top 10 for Large Language Model Applications." genai.owasp.org/llm-top-10. Industry-standard taxonomy of LLM application risks; the practical complement to academic adversarial-input benchmarks.