Datasets & Benchmarks

Section 51.3

Safety datasets cover three areas: harmful-output benchmarks, jailbreak / adversarial-input corpora, and bias / fairness benchmarks.

Safety datasets organized by what they measure
Figure 51.3.1: The Part X safety-dataset menagerie sorted by what each one measures. HarmBench, AdvBench, JailbreakBench, and HH-RLHF target refusal of harmful behaviors. Anthropic Global Opinions, BBQ, Winogender, and discrim-eval target demographic and worldview bias. TruthfulQA, HaluEval, and MT-Bench target truthfulness and hallucination.

51.3.1 Harmful-output benchmarks

51.3.2 Bias and fairness

51.3.3 Truthfulness and hallucination

51.3.4 Comparing the datasets

Table 51.3.1a: 39.3.1 Safety datasets (2026).
Dataset Focus Size Use
HarmBench Harmful behaviors 400 Refusal eval
AdvBench Adversarial prompts 520 Robustness eval
JailbreakBench Jailbreak attacks 100+ Standardized comparison
BBQ Bias in QA ~58K Bias measurement
TruthfulQA Truthfulness 817 Hallucination eval
Warning: Adversarial datasets age fast

Jailbreak benchmarks are arms races, and the public sets are the slow side. A model that passes the 2023 jailbreak set routinely fails the attacks discovered in the last month. Always pair public benchmarks with live red-team campaigns using Garak or PyRIT, and treat "passes the benchmark" as the floor, not the ceiling.

What's Next?

In the next section, Section 51.4: Models, we build on the material covered here.

Further Reading

Security Benchmarks

Bhardwaj, R., & Poria, S. (2023). "Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment." arXiv:2308.09662. Reference red-team benchmark for LLMs.
Zou, A., Wang, Z., Carlini, N., et al. (2023). "Universal and Transferable Adversarial Attacks on Aligned Language Models." arXiv:2307.15043. Reference for transferable jailbreaks; defines benchmark adversarial inputs.
Mazeika, M., Phan, L., Yin, X., et al. (2024). "HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal." ICML 2024. arXiv:2402.04249. The current standard automated red-team benchmark with 18 harm categories; the reference for measuring attack success rate against safety-tuned LLMs.
Chao, P., Debenedetti, E., Robey, A., et al. (2024). "JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models." NeurIPS 2024. arXiv:2404.01318. Open leaderboard of jailbreak attacks and defenses with a reproducible evaluation pipeline; the canonical reference for tracking defense robustness over time.
OWASP (2025). "OWASP Top 10 for Large Language Model Applications." genai.owasp.org/llm-top-10. Industry-standard taxonomy of LLM application risks; the practical complement to academic adversarial-input benchmarks.