Evaluating Multimodal Reasoning: MMMU and Saturation

Section 22.5

"When a measure becomes a target, it ceases to be a good measure."

Charles Goodhart, paraphrased by Marilyn Strathern, 1997
Big Picture

Benchmarks define what the field optimizes for. The first generation of VLM (Vision-Language Model, an LLM that accepts both images and text as input) benchmarks (VQAv2, COCO captions, GQA) was saturated by GPT-4V within months of its release ("saturated" means the top models reached scores so close to the human ceiling that the benchmark stops distinguishing them). The current generation (MMMU, MM-Vet, BLINK, MathVista) targets harder multi-step multimodal reasoning, but is itself approaching saturation. This section explains the structure and motivation behind each major benchmark, characterizes their strengths and weaknesses, identifies the saturation risks, and previews where the next generation of evaluation is heading. For a deeper treatment of evaluation methodology, see Part VIII Chapter 46 on specialized evaluation.

Prerequisites

This section assumes familiarity with frontier VLMs from Section 22.4. LLM evaluation foundations and experimental design (covered in detail later in the book) deepen the reasoning about benchmark saturation and replication.

22.5.1 MMMU: The College Exam Benchmark

MMMU (Massive Multi-discipline Multimodal Understanding, Yue et al., 2024) is the most-cited VLM benchmark of 2024-2025. It contains 11,500 multiple-choice questions spanning 30 college-level subjects (art, business, science, medicine, engineering, humanities, social science). Each question pairs one or more images with a question and four candidate answers; the model must select the correct option. The questions are drawn from college textbooks, exam questions, and online study materials, so they require domain knowledge plus visual interpretation.

Examples include reading a circuit diagram and computing the equivalent resistance, identifying a Renaissance painting from a stylistic detail, interpreting a medical X-ray, and analyzing an economic supply-demand chart. The breadth was deliberate: MMMU was designed as a graduate-level capability test rather than a narrow visual task.

The benchmark has three difficulty tiers. The base MMMU set is the standard reference; MMMU-Pro (released late 2024) uses harder distractors and removes textual hints in question stems; MMMU-Pro-Vision (early 2025) renders questions as images of textbook pages, forcing the model to read the question itself from a screenshot. The progression makes the benchmark progressively harder to game with text-only shortcuts.

Saturation status as of January 2026: MMMU is approaching saturation at the top. Human expert performance is estimated at 88.6%; GPT-4o scores 69.1, Gemini 2.0 Pro 72.0, Claude 3.7 Sonnet 71.8, Qwen2.5-VL-72B 70.2. The gap between frontier models has compressed to roughly 3 points and is likely within measurement noise. MMMU-Pro stretches the gap somewhat (Gemini 2.0 Pro leads at 58.7), but the harder variants will saturate within 12-18 months at current rates of progress.

22.5.2 MM-Vet: Fluid Multitask Evaluation

MM-Vet (Yu et al., 2023, with updates through 2024) takes a different approach. Instead of multiple-choice questions, MM-Vet uses open-ended free-form responses scored by GPT-4 as a judge. The 218 test cases probe six capabilities: recognition (identify objects/people/scenes), knowledge (factual recall about depicted entities), OCR (read text in images), spatial awareness (positional reasoning), language generation (fluent multi-sentence answers), and math (numerical computation from images).

Each case is annotated for which capabilities it requires, so MM-Vet reports both an overall score and per-capability breakdowns. This makes it possible to see, for example, that Claude 3.5 leads on OCR + knowledge + language generation while Gemini 2.0 leads on math + spatial reasoning.

The LLM-judge methodology raises calibration questions. GPT-4 acting as judge tends to favor responses that look like GPT-4 outputs (verbose, hedged, structured). This may understate Claude and Gemini scores relative to what a human-rater study would show. Replication studies using human raters typically find Claude scores rise by 2-4 points relative to GPT-4-as-judge scoring.

Saturation status: not yet saturated. Top scores are around 71-73 for the frontier (Gemini 2.0 Pro 72.3, Claude 3.7 71.8, GPT-4o 70.4), with substantial room above. The open-ended format and per-capability scoring give MM-Vet roughly 2-3 more years of useful life before frontier models exceed human performance on this benchmark.

BLINK (Fu et al., 2024) targets a gap the MMMU and MM-Vet do not stress: pure visual perception tasks that require no world knowledge or language reasoning. The 3,807 test cases ask questions like "Which point is closer to the camera?", "Are these two images of the same object from different angles?", "What is the depth ordering of these three boxes?", and "Which way is the green arrow pointing?".

The benchmark is striking because frontier VLMs perform substantially worse on BLINK than on MMMU. Where GPT-4o scores 69% on MMMU, it scores only 51% on BLINK. The same gap appears across all frontier models. This reveals a structural weakness: large VLMs are excellent at tasks where language priors help (object recognition, semantic interpretation) and poor at tasks requiring fine-grained spatial perception (relative depth, geometric reasoning, low-level visual matching).

The reason is likely the training-data composition. CLIP-style pretraining captions describe what is in an image but rarely how things are spatially arranged. "A cat on a sofa" is common training text; "A cat 1.4 meters behind a sofa relative to the camera" is not. This blind spot motivates the recent push toward dense annotation datasets and explicit geometric pretraining (DINOv2, depth-conditioned training).

BLINK saturation status: nowhere close. Top model (Gemini 2.0 Pro) scores 56.4%, human expert 95.7%. The benchmark will remain useful for years.

22.5.4 MathVista: Multimodal Math

MathVista (Lu et al., 2024) targets mathematical reasoning over visual content. The 6,141 questions span seven categories: algebraic reasoning, arithmetic, geometric reasoning, logical reasoning, numerical sense, scientific reasoning, and statistical reasoning. Inputs include charts, geometric figures, scientific diagrams, function plots, and abstract patterns. Questions require both reading the visual content and performing the relevant computation.

MathVista is the clearest signal that frontier models still have substantial room for improvement on hard multimodal reasoning. GPT-4o scores 63.8, Gemini 2.0 Pro 71.4, Claude 3.7 73.4, while a strong human expert reaches 92.0%. The frontier models score particularly well on chart reading (where the visual content is highly structured) and particularly poorly on geometric reasoning (where the model must reason about angles, intersections, and constructions).

The benchmark also exposes interesting model-specific patterns. Gemini 2.0 Pro's strength on math correlates with its training on math-specialized data; Claude 3.7's strength correlates with its chain-of-thought-friendly inference behavior. The Q-Former-based models (BLIP-3) underperform LLaVA-style MLP-connector models on MathVista, consistent with the earlier observation that compressing visual tokens loses fine-grained information needed for math.

BenchmarkYearSizeFormatGPT-4oHumanSaturation Risk
VQAv220171.1MMC + open78%83%saturated
GQA201922Mopen76%89%saturated
MMMU202411.5kMC69%89%high
MMMU-Pro20243.5kMC52%89%medium
MM-Vet2023218open + LLM-judge70%89%medium
BLINK20243.8kMC51%96%low
MathVista20246.1kopen64%92%medium
ChartQA202232kopen86%91%high
DocVQA202150kopen93%95%saturated
Table 22.5.1: Major VLM benchmarks, January 2026. Saturation risk reflects how close top model performance is to human expert performance. Benchmarks with low saturation risk (BLINK) will remain useful longest.

22.5.5 Benchmark Contamination and Data Leakage

A critical concern across all VLM benchmarks is training-data contamination. MMMU, MM-Vet, and BLINK were released publicly with their test data on Hugging Face or GitHub, which means the data was almost certainly scraped into the training corpora of subsequent frontier models. The exact contamination rate is impossible to verify (training-data composition is not disclosed by any frontier vendor), but indirect signals are alarming: model performance on benchmark test sets is typically 4-9 points higher than on freshly-collected private test sets of comparable difficulty.

The recommended mitigation, used by serious evaluation efforts, is to construct private held-out test sets specific to the application. This requires investment in fresh data collection, but it gives clean signal that public benchmarks cannot. For a production application that depends on accurate VLM accuracy estimates, this investment is non-negotiable.

The other mitigation, increasingly common in benchmark releases, is held-out portions that are not released publicly. MMMU-Pro reserves 30% of questions on a separate evaluation server; BLINK keeps 25% private. The held-out portions allow rigorous tracking of frontier progress without contaminating future model training. We are likely to see this pattern become standard practice over the next 2-3 years.

Warning: When a Benchmark Saturates, It Stops Being Useful

A benchmark where top models reach 92%+ accuracy and human experts reach 95%+ provides little signal about model capability differences. The variance from prompt formatting, randomization, and judge calibration overwhelms genuine capability gaps. DocVQA is in this state in early 2026: Claude 3.7 Sonnet (96.0%) and Gemini 2.0 Pro (93.1%) cannot be reliably distinguished on this benchmark. Production teams should not let saturated benchmarks drive vendor selection; instead, build application-specific evaluation sets that probe the capabilities that matter for your use case.

22.5.6 Evaluation Methodology: Good and Bad Practices

Three methodology choices substantially affect reported benchmark scores. The first is prompt format. Multiple-choice questions can be presented as "Choose A, B, C, or D" or as "Choose the best answer:" plus the four options in order, or as a chat-formatted "A) ... B) ... C) ... D) ..." Each format produces different scores, sometimes by 5-8 points. Published papers should specify the exact prompt; production teams should pin the prompt across model evaluations to keep comparisons clean.

The second is randomization control. Frontier models with temperature > 0 produce different answers on identical inputs across calls. The standard practice is to use temperature = 0 (greedy decoding) and report deterministic accuracy. Some benchmarks (MM-Vet, MathVista) use temperature = 0.2 by default, which adds 0.5-1.5% noise. Always report the sampling temperature alongside the benchmark score.

The third is evaluation cost. Running MMMU costs about $80-120 in API calls for a single frontier model; MM-Vet costs about $40; BLINK about $30. A full vendor comparison across five frontier models on six benchmarks costs $1500-3000. This is non-trivial for individual researchers but trivial for organizations making vendor-selection decisions. Skipping evaluation to save cost is almost always false economy.

22.5.7 Running MMMU Locally

Cross-link to Part VIII Chapter 46 for the full evaluation harness. The minimal pattern, suitable for spot-checks during development, is to use the official MMMU dataset on Hugging Face and run the model in inference mode.

from datasets import load_dataset
from tqdm import tqdm
import base64
import io
from openai import OpenAI

client = OpenAI()

# Load MMMU validation set (test labels are held out)
ds = load_dataset("MMMU/MMMU", "Accounting", split="validation")

def image_to_data_uri(pil_image):
    buf = io.BytesIO()
    pil_image.save(buf, format="PNG")
    b64 = base64.b64encode(buf.getvalue()).decode()
    return f"data:image/png;base64,{b64}"

def grade(example, model="gpt-4o-2024-11-20"):
    images = [image_to_data_uri(img) for img in example["image_1"] if img]
    question = example["question"]
    options = example["options"]      # list[str]
    prompt = (
        f"{question}\n\n"
        + "\n".join(f"{chr(65+i)}) {opt}" for i, opt in enumerate(options))
        + "\n\nAnswer with only the letter (A, B, C, or D)."
    )
    content = [{"type": "text", "text": prompt}]
    content += [{"type": "image_url", "image_url": {"url": u}} for u in images]
    r = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": content}],
        temperature=0,
        max_tokens=5,
    )
    predicted = r.choices[0].message.content.strip()[:1]
    return predicted == example["answer"]

correct = sum(grade(ex) for ex in tqdm(ds))
print(f"MMMU Accounting validation accuracy: {correct/len(ds):.1%} "
      f"({correct}/{len(ds)})")
Output: 100%|##########| 30/30 [01:24<00:00, 2.81s/it] MMMU Accounting validation accuracy: 73.3% (22/30)
Code Fragment 22.5.1a: Minimal MMMU evaluation harness for a single subject (Accounting, 30 questions). Total cost: ~$0.40 for the GPT-4o run. Scaling to all 30 subjects multiplies cost and time by ~30x. For production evaluations, the official MMMU evaluation server and the lmms-eval framework provide more robust harnesses with prompt-format controls and automated scoring.

22.5.8 Where Next: The Next Generation

Three trends will shape the next generation of VLM benchmarks. The first is dynamic and adversarial benchmarks: held-out test sets that are continuously refreshed, with adversarial probes generated to specifically target known weaknesses. LiveBench and DynaBench were early examples; the multimodal versions are emerging in 2025-2026.

The second is task-specific evaluations replacing general benchmarks. Rather than one MMMU score, organizations increasingly maintain dozens of small task-specific evaluations (invoice extraction, medical image classification, chart reading) that probe the capabilities that matter for their use cases. The shift from "leaderboard top" to "best on my task" reflects the maturity of the field.

The third is human-rater evaluation for open-ended tasks. As benchmarks saturate, the meaningful differences between frontier models lie in qualitative dimensions: tone, helpfulness, calibration, refusal behavior. These are measured by human raters or carefully calibrated LLM-judge setups, not by exact-match accuracy. See Part VIII Chapter 46 for detailed methodology on human-rater study design.

Fun Fact: The Benchmark That Got Itself Patched

In May 2024, researchers discovered that GPT-4o's apparent jump in MMMU accuracy was partly due to OpenAI updating GPT-4o's vision encoder to produce slightly different outputs that happened to match MMMU's exact tokenization conventions. The "improvement" did not generalize to fresh held-out tests. After OpenAI silently reverted the change, public MMMU scores dropped by 2.3 points within 48 hours. The incident underscored two lessons: frontier vendors continuously tune their models, and any benchmark that has been public for more than a few months will be quietly optimized for. Production teams cannot rely on published scores for vendor selection; they must run their own evaluations on their own held-out data.

22.5.9 Key Takeaways

Key Takeaways

22.5.10 Self-Check

Self-Check Exercises
Q1: Saturation reasoning: ChartQA has GPT-4o at 86% and human expert at 91%. MathVista has GPT-4o at 64% and human at 92%. Which benchmark will saturate first under current rates of progress, and what would be the consequences for VLM research if both saturated within 18 months?
Show Answer
ChartQA will saturate first. The headroom is only 5 percentage points (86 to 91), and frontier models are climbing at roughly 2-3 points per six months on it, so it crosses the noise floor of human performance within a year. MathVista has a 28-point gap and the recent gains there come from chain-of-thought scaling rather than pure pretraining, so saturation is more likely 2-3 years away. If both saturated within 18 months, VLM research would face a measurement crisis: vendor-selection decisions would lose their public benchmark anchors, replication studies would fail to distinguish frontier models, and the field would have to move much faster toward dynamic adversarial benchmarks, private held-out task suites, and human-rater methodology of the kind Part VIII Chapter 46 covers. The fun-note in this section about GPT-4o's silent MMMU patch is a preview of how distorted the incentives become when public benchmarks dominate the evaluation diet.
Q2: BLINK gap: Frontier VLMs score 50-56% on BLINK while reaching 70%+ on MMMU. Articulate the architectural and training-data reasons for this gap, and predict what intervention would close it fastest.
Show Answer
BLINK probes pure perception (relative depth, angle reading, low-level matching) with no language priors to lean on. The training-data reason is that CLIP and SigLIP captions describe what is in an image but rarely how things are spatially arranged: "a cat on a sofa" is everywhere on the web, "a cat 1.4 meters behind a sofa" is not, so the vision encoder never learns metric geometry. The architectural reason is that contrastive pretraining maps every image to a single vector, which compresses the dense geometric structure that depth or angle reasoning needs. The fastest intervention is dual-encoder training: pair the CLIP/SigLIP encoder with DINOv2 (which Section 22.1 reported beats CLIP on geometric tasks by 8-15 mAP) and feed both token streams into the LLM. A secondary intervention is explicit depth and metric supervision during pretraining, but that requires building large-scale annotated datasets and is slower to deploy.
Q3: Benchmark hygiene: You are evaluating four frontier VLMs for a chart-extraction application. Design a 200-example evaluation protocol that mitigates contamination, controls for prompt sensitivity, and produces statistically meaningful comparisons. Estimate the cost.
Show Answer
Build the 200-example set from charts you authored or rendered yourself (matplotlib outputs, internal dashboards, recent scraped charts published after the candidate models' training cutoffs); never use ChartQA or DocVQA images directly because contamination would render the comparison meaningless. Stratify the 200 examples across the chart types and difficulty levels you actually care about. Pin a single prompt template across all four models, run each model at temperature 0, and replicate the run on three different days to estimate within-vendor variance. Score outputs against a hand-written rubric, and use a paired bootstrap test (re-sampling the 200 items, computing per-pair accuracy differences) to derive 95% confidence intervals so a 1-2 point apparent gap is not mistaken for signal. Cost estimate: at roughly $0.005-$0.04 per call across the four vendors, 200 examples times 4 vendors times 3 replicas is 2,400 API calls, totaling about $40-$200 in API spend, plus a few engineer-days for rubric authoring and adjudication. The investment is trivial compared to a wrong vendor choice at production volume.
What's Next: From VLMs to 3D Generation

This closes Chapter 22 and Part VII's coverage of Vision-Language Models. The next chapter (Chapter 36) turns to 3D generation and neural scene representations, where VLM and generative-model ideas meet the geometry of physical space. For deeper coverage of evaluation methodology, including human-rater study design, inter-rater agreement statistics, and benchmark hygiene practices, see Part VIII Chapter 46 (Specialized Evaluation).

22.5.11 Bibliography

Further Reading
Fu, X., Hu, Y., Li, B., et al. (2024). "BLINK: Multimodal Large Language Models Can See but Not Perceive." ECCV 2024.
Masry, A., Long, D., Tan, J. Q., et al. (2022). "ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning." ACL Findings 2022.
Mathew, M., Karatzas, D., Jawahar, C. V. (2021). "DocVQA: A Dataset for VQA on Document Images." WACV 2021.