Section 74.3: Datasets & Benchmarks

Each vertical has domain-specific benchmarks. The ones below appear in published industry comparisons.

74.3.1 Domain benchmarks

Legal: LegalBench (Stanford), LegalBench-RAG (2024) for retrieval-augmented legal QA, plus the canonical 2023 result of GPT-4 passing the bar exam (cited in nearly every legal-AI deployment paper).
Finance: FinanceBench, FLUE, FinQA, ConvFinQA, BizBench (2024): the 2024-25 finance benchmark family covering numerical reasoning over filings, multi-turn financial dialog, and business-domain tasks.
Medical: MedQA (USMLE), MedMCQA, PubMedQA, MedHELM (Stanford HELM Med, 2024), the NEJM AI Challenge datasets, and HealthBench (OpenAI, 2025).
Code: SWE-bench, HumanEval.
Customer service: tau-bench.
Cybersecurity: CTIBench, CyberSecEval (Meta, 2024).
Educational: K-12 reading-level eval datasets (Lexile-based) plus the NAEP-aligned tutoring benchmarks emerging in 2024-25.

Table 74.3.1a: 60.3.1 Vertical benchmarks (2026).

In the next section, Section 74.4: Models, we build on the material covered here.

Further Reading

Guha, N., et al. (2023). "LegalBench." NeurIPS 2023. arXiv:2308.11462. Reference legal-LLM benchmark.

Singhal, K., et al. (2023). "Med-PaLM Evaluation." arXiv:2212.13138. Reference clinical-LLM benchmark methodology.

Xie, Q., et al. (2023). "PIXIU." NeurIPS 2023. arXiv:2306.05443. Reference finance-LLM benchmark.