Datasets & Benchmarks

Section 74.3

Each vertical has domain-specific benchmarks. The ones below appear in published industry comparisons.

74.3.1 Domain benchmarks

Frontier-model scores on the canonical 2026 vertical benchmarks.
Figure 74.3.1: Frontier-model scores on the canonical 2026 vertical benchmarks. Medical reasoning (MedQA / USMLE) crosses the 90% production-ready floor with Med-PaLM 2; legal reasoning (LegalBench, Stanford) sits at ~81%. Finance (FinanceBench / BizBench family at ~71%), code (SWE-bench Verified at ~71%), customer service (tau-bench at ~70%), and cybersecurity (CyberSecEval at ~67%) remain in the "human-in-loop required" or "advisory only" bands. The right read for procurement: only medical and legal are at the production-only threshold; the rest still demand the layered architecture from Chapters 73 and 79.

74.3.2 Industry data sources

74.3.3 Comparing the benchmarks

Table 74.3.1a: 60.3.1 Vertical benchmarks (2026).
Industry Benchmark Frontier-model score
Legal LegalBench ~80%
Medical MedQA (USMLE) ~90% (Med-PaLM 2)
Finance FinanceBench ~70%
Code SWE-bench Verified ~70%
Customer service tau-bench ~70%

What's Next?

In the next section, Section 74.4: Models, we build on the material covered here.

Further Reading

Industry Benchmarks

Guha, N., et al. (2023). "LegalBench." NeurIPS 2023. arXiv:2308.11462. Reference legal-LLM benchmark.
Singhal, K., et al. (2023). "Med-PaLM Evaluation." arXiv:2212.13138. Reference clinical-LLM benchmark methodology.
Xie, Q., et al. (2023). "PIXIU." NeurIPS 2023. arXiv:2306.05443. Reference finance-LLM benchmark.