Appendices
Appendix J: Datasets, Benchmarks, and Leaderboards

Benchmark Summary Table

Benchmark Comparison
Benchmark Category Size Metric Saturated?
MMLUKnowledge14K questionsAccuracyNearly (90%+)
MMLU-ProKnowledge12K questionsAccuracyNo (~70-80%)
ARC-ChallengeScience Reasoning2.6K questionsAccuracyYes (97%+)
HellaSwagCommonsense10K questionsAccuracyYes (95%+)
TruthfulQATruthfulness817 questions% TruthfulPartially
GSM8KGrade-school Math1.3K testAccuracyYes (95%+)
MATHCompetition Math5K testAccuracyPartially (~85%)
HumanEvalCode Generation164 problemspass@1Nearly (90%+)
SWE-bench VerifiedReal-world Coding500 tasks% ResolvedNo (~50%)
MT-BenchConversation80 questionsGPT-4 score (1-10)Yes (9.0+)
Chatbot ArenaOverall QualityOngoingElo ratingNo
Benchmark Saturation

Many older benchmarks (such as SuperGLUE) are now saturated, with frontier models scoring near ceiling. When evaluating a model, prefer newer benchmarks that still discriminate between models, and always include task-specific evaluations relevant to your use case.