Section J.4: Benchmark Summary Table

Benchmark Comparison

Benchmark	Category	Size	Metric	Saturated?
MMLU	Knowledge	14K questions	Accuracy	Nearly (90%+)
MMLU-Pro	Knowledge	12K questions	Accuracy	No (~70-80%)
ARC-Challenge	Science Reasoning	2.6K questions	Accuracy	Yes (97%+)
HellaSwag	Commonsense	10K questions	Accuracy	Yes (95%+)
TruthfulQA	Truthfulness	817 questions	% Truthful	Partially
GSM8K	Grade-school Math	1.3K test	Accuracy	Yes (95%+)
MATH	Competition Math	5K test	Accuracy	Partially (~85%)
HumanEval	Code Generation	164 problems	pass@1	Nearly (90%+)
SWE-bench Verified	Real-world Coding	500 tasks	% Resolved	No (~50%)
MT-Bench	Conversation	80 questions	GPT-4 score (1-10)	Yes (9.0+)
Chatbot Arena	Overall Quality	Ongoing	Elo rating	No

Benchmark Saturation

Many older benchmarks (such as SuperGLUE) are now saturated, with frontier models scoring near ceiling. When evaluating a model, prefer newer benchmarks that still discriminate between models, and always include task-specific evaluations relevant to your use case.