"When a measure becomes a target, it ceases to be a good measure."
Charles Goodhart, paraphrased by Marilyn Strathern, 1997
Benchmarks define what the field optimizes for. The first generation of VLM (Vision-Language Model, an LLM that accepts both images and text as input) benchmarks (VQAv2, COCO captions, GQA) was saturated by GPT-4V within months of its release ("saturated" means the top models reached scores so close to the human ceiling that the benchmark stops distinguishing them). The current generation (MMMU, MM-Vet, BLINK, MathVista) targets harder multi-step multimodal reasoning, but is itself approaching saturation. This section explains the structure and motivation behind each major benchmark, characterizes their strengths and weaknesses, identifies the saturation risks, and previews where the next generation of evaluation is heading. For a deeper treatment of evaluation methodology, see Part VIII Chapter 46 on specialized evaluation.
Prerequisites
This section assumes familiarity with frontier VLMs from Section 22.4. LLM evaluation foundations and experimental design (covered in detail later in the book) deepen the reasoning about benchmark saturation and replication.
22.5.1 MMMU: The College Exam Benchmark
MMMU (Massive Multi-discipline Multimodal Understanding, Yue et al., 2024) is the most-cited VLM benchmark of 2024-2025. It contains 11,500 multiple-choice questions spanning 30 college-level subjects (art, business, science, medicine, engineering, humanities, social science). Each question pairs one or more images with a question and four candidate answers; the model must select the correct option. The questions are drawn from college textbooks, exam questions, and online study materials, so they require domain knowledge plus visual interpretation.
Examples include reading a circuit diagram and computing the equivalent resistance, identifying a Renaissance painting from a stylistic detail, interpreting a medical X-ray, and analyzing an economic supply-demand chart. The breadth was deliberate: MMMU was designed as a graduate-level capability test rather than a narrow visual task.
The benchmark has three difficulty tiers. The base MMMU set is the standard reference; MMMU-Pro (released late 2024) uses harder distractors and removes textual hints in question stems; MMMU-Pro-Vision (early 2025) renders questions as images of textbook pages, forcing the model to read the question itself from a screenshot. The progression makes the benchmark progressively harder to game with text-only shortcuts.
Saturation status as of January 2026: MMMU is approaching saturation at the top. Human expert performance is estimated at 88.6%; GPT-4o scores 69.1, Gemini 2.0 Pro 72.0, Claude 3.7 Sonnet 71.8, Qwen2.5-VL-72B 70.2. The gap between frontier models has compressed to roughly 3 points and is likely within measurement noise. MMMU-Pro stretches the gap somewhat (Gemini 2.0 Pro leads at 58.7), but the harder variants will saturate within 12-18 months at current rates of progress.
22.5.2 MM-Vet: Fluid Multitask Evaluation
MM-Vet (Yu et al., 2023, with updates through 2024) takes a different approach. Instead of multiple-choice questions, MM-Vet uses open-ended free-form responses scored by GPT-4 as a judge. The 218 test cases probe six capabilities: recognition (identify objects/people/scenes), knowledge (factual recall about depicted entities), OCR (read text in images), spatial awareness (positional reasoning), language generation (fluent multi-sentence answers), and math (numerical computation from images).
Each case is annotated for which capabilities it requires, so MM-Vet reports both an overall score and per-capability breakdowns. This makes it possible to see, for example, that Claude 3.5 leads on OCR + knowledge + language generation while Gemini 2.0 leads on math + spatial reasoning.
The LLM-judge methodology raises calibration questions. GPT-4 acting as judge tends to favor responses that look like GPT-4 outputs (verbose, hedged, structured). This may understate Claude and Gemini scores relative to what a human-rater study would show. Replication studies using human raters typically find Claude scores rise by 2-4 points relative to GPT-4-as-judge scoring.
Saturation status: not yet saturated. Top scores are around 71-73 for the frontier (Gemini 2.0 Pro 72.3, Claude 3.7 71.8, GPT-4o 70.4), with substantial room above. The open-ended format and per-capability scoring give MM-Vet roughly 2-3 more years of useful life before frontier models exceed human performance on this benchmark.
22.5.3 BLINK: Perception Without Recognition
BLINK (Fu et al., 2024) targets a gap the MMMU and MM-Vet do not stress: pure visual perception tasks that require no world knowledge or language reasoning. The 3,807 test cases ask questions like "Which point is closer to the camera?", "Are these two images of the same object from different angles?", "What is the depth ordering of these three boxes?", and "Which way is the green arrow pointing?".
The benchmark is striking because frontier VLMs perform substantially worse on BLINK than on MMMU. Where GPT-4o scores 69% on MMMU, it scores only 51% on BLINK. The same gap appears across all frontier models. This reveals a structural weakness: large VLMs are excellent at tasks where language priors help (object recognition, semantic interpretation) and poor at tasks requiring fine-grained spatial perception (relative depth, geometric reasoning, low-level visual matching).
The reason is likely the training-data composition. CLIP-style pretraining captions describe what is in an image but rarely how things are spatially arranged. "A cat on a sofa" is common training text; "A cat 1.4 meters behind a sofa relative to the camera" is not. This blind spot motivates the recent push toward dense annotation datasets and explicit geometric pretraining (DINOv2, depth-conditioned training).
BLINK saturation status: nowhere close. Top model (Gemini 2.0 Pro) scores 56.4%, human expert 95.7%. The benchmark will remain useful for years.
22.5.4 MathVista: Multimodal Math
MathVista (Lu et al., 2024) targets mathematical reasoning over visual content. The 6,141 questions span seven categories: algebraic reasoning, arithmetic, geometric reasoning, logical reasoning, numerical sense, scientific reasoning, and statistical reasoning. Inputs include charts, geometric figures, scientific diagrams, function plots, and abstract patterns. Questions require both reading the visual content and performing the relevant computation.
MathVista is the clearest signal that frontier models still have substantial room for improvement on hard multimodal reasoning. GPT-4o scores 63.8, Gemini 2.0 Pro 71.4, Claude 3.7 73.4, while a strong human expert reaches 92.0%. The frontier models score particularly well on chart reading (where the visual content is highly structured) and particularly poorly on geometric reasoning (where the model must reason about angles, intersections, and constructions).
The benchmark also exposes interesting model-specific patterns. Gemini 2.0 Pro's strength on math correlates with its training on math-specialized data; Claude 3.7's strength correlates with its chain-of-thought-friendly inference behavior. The Q-Former-based models (BLIP-3) underperform LLaVA-style MLP-connector models on MathVista, consistent with the earlier observation that compressing visual tokens loses fine-grained information needed for math.
| Benchmark | Year | Size | Format | GPT-4o | Human | Saturation Risk |
|---|---|---|---|---|---|---|
| VQAv2 | 2017 | 1.1M | MC + open | 78% | 83% | saturated |
| GQA | 2019 | 22M | open | 76% | 89% | saturated |
| MMMU | 2024 | 11.5k | MC | 69% | 89% | high |
| MMMU-Pro | 2024 | 3.5k | MC | 52% | 89% | medium |
| MM-Vet | 2023 | 218 | open + LLM-judge | 70% | 89% | medium |
| BLINK | 2024 | 3.8k | MC | 51% | 96% | low |
| MathVista | 2024 | 6.1k | open | 64% | 92% | medium |
| ChartQA | 2022 | 32k | open | 86% | 91% | high |
| DocVQA | 2021 | 50k | open | 93% | 95% | saturated |
22.5.5 Benchmark Contamination and Data Leakage
A critical concern across all VLM benchmarks is training-data contamination. MMMU, MM-Vet, and BLINK were released publicly with their test data on Hugging Face or GitHub, which means the data was almost certainly scraped into the training corpora of subsequent frontier models. The exact contamination rate is impossible to verify (training-data composition is not disclosed by any frontier vendor), but indirect signals are alarming: model performance on benchmark test sets is typically 4-9 points higher than on freshly-collected private test sets of comparable difficulty.
The recommended mitigation, used by serious evaluation efforts, is to construct private held-out test sets specific to the application. This requires investment in fresh data collection, but it gives clean signal that public benchmarks cannot. For a production application that depends on accurate VLM accuracy estimates, this investment is non-negotiable.
The other mitigation, increasingly common in benchmark releases, is held-out portions that are not released publicly. MMMU-Pro reserves 30% of questions on a separate evaluation server; BLINK keeps 25% private. The held-out portions allow rigorous tracking of frontier progress without contaminating future model training. We are likely to see this pattern become standard practice over the next 2-3 years.
A benchmark where top models reach 92%+ accuracy and human experts reach 95%+ provides little signal about model capability differences. The variance from prompt formatting, randomization, and judge calibration overwhelms genuine capability gaps. DocVQA is in this state in early 2026: Claude 3.7 Sonnet (96.0%) and Gemini 2.0 Pro (93.1%) cannot be reliably distinguished on this benchmark. Production teams should not let saturated benchmarks drive vendor selection; instead, build application-specific evaluation sets that probe the capabilities that matter for your use case.
22.5.6 Evaluation Methodology: Good and Bad Practices
Three methodology choices substantially affect reported benchmark scores. The first is prompt format. Multiple-choice questions can be presented as "Choose A, B, C, or D" or as "Choose the best answer:" plus the four options in order, or as a chat-formatted "A) ... B) ... C) ... D) ..." Each format produces different scores, sometimes by 5-8 points. Published papers should specify the exact prompt; production teams should pin the prompt across model evaluations to keep comparisons clean.
The second is randomization control. Frontier models with temperature > 0 produce different answers on identical inputs across calls. The standard practice is to use temperature = 0 (greedy decoding) and report deterministic accuracy. Some benchmarks (MM-Vet, MathVista) use temperature = 0.2 by default, which adds 0.5-1.5% noise. Always report the sampling temperature alongside the benchmark score.
The third is evaluation cost. Running MMMU costs about $80-120 in API calls for a single frontier model; MM-Vet costs about $40; BLINK about $30. A full vendor comparison across five frontier models on six benchmarks costs $1500-3000. This is non-trivial for individual researchers but trivial for organizations making vendor-selection decisions. Skipping evaluation to save cost is almost always false economy.
22.5.7 Running MMMU Locally
Cross-link to Part VIII Chapter 46 for the full evaluation harness. The minimal pattern, suitable for spot-checks during development, is to use the official MMMU dataset on Hugging Face and run the model in inference mode.
from datasets import load_dataset
from tqdm import tqdm
import base64
import io
from openai import OpenAI
client = OpenAI()
# Load MMMU validation set (test labels are held out)
ds = load_dataset("MMMU/MMMU", "Accounting", split="validation")
def image_to_data_uri(pil_image):
buf = io.BytesIO()
pil_image.save(buf, format="PNG")
b64 = base64.b64encode(buf.getvalue()).decode()
return f"data:image/png;base64,{b64}"
def grade(example, model="gpt-4o-2024-11-20"):
images = [image_to_data_uri(img) for img in example["image_1"] if img]
question = example["question"]
options = example["options"] # list[str]
prompt = (
f"{question}\n\n"
+ "\n".join(f"{chr(65+i)}) {opt}" for i, opt in enumerate(options))
+ "\n\nAnswer with only the letter (A, B, C, or D)."
)
content = [{"type": "text", "text": prompt}]
content += [{"type": "image_url", "image_url": {"url": u}} for u in images]
r = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": content}],
temperature=0,
max_tokens=5,
)
predicted = r.choices[0].message.content.strip()[:1]
return predicted == example["answer"]
correct = sum(grade(ex) for ex in tqdm(ds))
print(f"MMMU Accounting validation accuracy: {correct/len(ds):.1%} "
f"({correct}/{len(ds)})")
22.5.8 Where Next: The Next Generation
Three trends will shape the next generation of VLM benchmarks. The first is dynamic and adversarial benchmarks: held-out test sets that are continuously refreshed, with adversarial probes generated to specifically target known weaknesses. LiveBench and DynaBench were early examples; the multimodal versions are emerging in 2025-2026.
The second is task-specific evaluations replacing general benchmarks. Rather than one MMMU score, organizations increasingly maintain dozens of small task-specific evaluations (invoice extraction, medical image classification, chart reading) that probe the capabilities that matter for their use cases. The shift from "leaderboard top" to "best on my task" reflects the maturity of the field.
The third is human-rater evaluation for open-ended tasks. As benchmarks saturate, the meaningful differences between frontier models lie in qualitative dimensions: tone, helpfulness, calibration, refusal behavior. These are measured by human raters or carefully calibrated LLM-judge setups, not by exact-match accuracy. See Part VIII Chapter 46 for detailed methodology on human-rater study design.
In May 2024, researchers discovered that GPT-4o's apparent jump in MMMU accuracy was partly due to OpenAI updating GPT-4o's vision encoder to produce slightly different outputs that happened to match MMMU's exact tokenization conventions. The "improvement" did not generalize to fresh held-out tests. After OpenAI silently reverted the change, public MMMU scores dropped by 2.3 points within 48 hours. The incident underscored two lessons: frontier vendors continuously tune their models, and any benchmark that has been public for more than a few months will be quietly optimized for. Production teams cannot rely on published scores for vendor selection; they must run their own evaluations on their own held-out data.
22.5.9 Key Takeaways
- MMMU is the most-cited 2024-2025 VLM benchmark, with 11.5k college-level multiple-choice questions across 30 subjects.
- MMMU-Pro and MMMU-Pro-Vision are harder variants designed to delay saturation; both will likely saturate within 12-18 months.
- MM-Vet uses open-ended responses scored by LLM-judge across 6 capability axes; it provides finer-grained diagnostic signal but with calibration questions.
- BLINK targets pure visual perception (depth, geometry, matching) where frontier VLMs are surprisingly weak (~51% vs. 96% human).
- MathVista probes multimodal mathematical reasoning; Claude 3.7 leads at 73.4 vs. 92% human expert.
- Benchmark contamination is a serious concern; private held-out test sets are essential for high-stakes evaluation.
- The next generation will emphasize dynamic adversarial benchmarks, task-specific evaluation suites, and human-rater methodology (Part VIII Chapter 46).
22.5.10 Self-Check
Show Answer
Show Answer
Show Answer
This closes Chapter 22 and Part VII's coverage of Vision-Language Models. The next chapter (Chapter 36) turns to 3D generation and neural scene representations, where VLM and generative-model ideas meet the geometry of physical space. For deeper coverage of evaluation methodology, including human-rater study design, inter-rater agreement statistics, and benchmark hygiene practices, see Part VIII Chapter 46 (Specialized Evaluation).