Benchmark Comparison
| Benchmark | Category | Size | Metric | Saturated? |
|---|---|---|---|---|
| MMLU | Knowledge | 14K questions | Accuracy | Nearly (90%+) |
| MMLU-Pro | Knowledge | 12K questions | Accuracy | No (~70-80%) |
| ARC-Challenge | Science Reasoning | 2.6K questions | Accuracy | Yes (97%+) |
| HellaSwag | Commonsense | 10K questions | Accuracy | Yes (95%+) |
| TruthfulQA | Truthfulness | 817 questions | % Truthful | Partially |
| GSM8K | Grade-school Math | 1.3K test | Accuracy | Yes (95%+) |
| MATH | Competition Math | 5K test | Accuracy | Partially (~85%) |
| HumanEval | Code Generation | 164 problems | pass@1 | Nearly (90%+) |
| SWE-bench Verified | Real-world Coding | 500 tasks | % Resolved | No (~50%) |
| MT-Bench | Conversation | 80 questions | GPT-4 score (1-10) | Yes (9.0+) |
| Chatbot Arena | Overall Quality | Ongoing | Elo rating | No |
Benchmark Saturation
Many older benchmarks (such as SuperGLUE) are now saturated, with frontier models scoring near ceiling. When evaluating a model, prefer newer benchmarks that still discriminate between models, and always include task-specific evaluations relevant to your use case.