"The frontier moves faster than the literature. When you read this, the leaderboard has already changed."
Percy Liang, HELM project notes, 2024
Closed-source frontier VLMs (OpenAI GPT-4V/4o, Google Gemini 1.5/2.0, Anthropic Claude 3.5/3.7 Vision) sit at the top of every public multimodal benchmark. They are not architecturally different from the open-source models of the previous section: vision encoder + connector + LLM. They are bigger, trained on more proprietary data, and accessed through APIs rather than self-hosted weights. This section maps the API surfaces, compares benchmark performance across the three frontier vendors, surveys the prompt-engineering patterns that transfer between text and vision, and lays out the cost matrix that determines when a frontier API is the right answer.
Prerequisites
This section assumes familiarity with open-source generative VLMs from Section 22.3 and with LLM APIs from Section 11.1. Familiarity with prompt engineering from Section 12.1 helps when reading the vision-prompt patterns.
22.4.1 The Three Frontier Vendors
The three frontier VLMs each have a distinctive failure signature. GPT-4o confidently invents text inside small images; Gemini 1.5 gets the text right but mislocates objects in dense scenes; Claude Vision refuses to identify people in photos, even cartoons. Production teams typically pick one not by benchmark score but by which failure mode their users tolerate.
As of early 2026, three labs ship frontier VLMs that consistently top the public leaderboards. OpenAI's GPT-4V (September 2023) was first; its successor GPT-4o (May 2024) added native multimodal training and reduced latency by roughly 4x. Google's Gemini 1.5 Pro (February 2024) and Gemini 2.0 Flash/Pro (December 2024) brought million-token context windows. Anthropic's Claude 3 Opus (March 2024), Claude 3.5 Sonnet (June 2024), and Claude 3.7 Sonnet (early 2025) emphasized accuracy and conservative behavior on edge cases.
Architecturally, all three are believed to follow the same pattern as open-source VLMs: a vision encoder (likely a large CLIP/SigLIP variant), a connector, and a frontier LLM. The differences lie in undisclosed details: encoder size and pretraining corpus, connector design, training-data composition, and post-training (RLHF, constitutional AI, etc.). Public information is limited, but model cards and system reports give enough signal to characterize behavior.
22.4.2 GPT-4V and GPT-4o: Capabilities
GPT-4V was the first frontier VLM and set the benchmark for the field. GPT-4o (omni) replaced it as the default in 2024 and is the variant most production systems target as of 2026. GPT-4o's published MMMU score is 69.1, MathVista is 63.8, ChartQA is 85.7, and DocVQA is 92.8.
The API surface is straightforward. Images are passed inline as base64-encoded data URIs or as URLs (signed S3 URLs, public images on the web). The "detail" parameter selects between "low" (85 tokens, 512x512 max effective resolution) and "high" (multiple 512x512 tiles, up to 16 tiles per image). High-detail processing increases token cost by 4x but is essential for dense documents and small-text content. The model supports interleaved multi-image inputs within a single user turn, which enables tasks like "compare these two charts" or "describe the differences in this before-and-after pair".
GPT-4o's distinguishing strengths are arithmetic and consistency checks (catching that a receipt's subtotal does not equal sum-of-line-items), tabular data extraction with implicit type inference, and detection-style spatial reasoning ("how many people are wearing red shirts?"). Its known weaknesses are very small text (font sizes below 8pt drop accuracy by 15-25%), low-light or low-contrast images, and any task requiring counting beyond 10-15 items (where the model tends to round to suspiciously clean numbers).
22.4.3 Gemini: The Context Window Advantage
Google's Gemini 1.5 Pro and Gemini 2.0 Flash/Pro carve out a unique frontier with their 1-2 million token context windows. The practical implications for vision tasks are dramatic: Gemini can process up to 3600 images in a single API call (each image taking about 258 tokens at the default resolution), making it the only frontier VLM capable of full-document or full-video analysis in a single pass.
On standard single-image benchmarks, Gemini 2.0 Pro scores 72.0 on MMMU, 71.4 on MathVista, and 87.8 on ChartQA, slightly above GPT-4o. On long-context multi-image tasks, Gemini extends its lead substantially: on a benchmark of cross-image reasoning over 100-image inputs (Google DeepMind internal), Gemini 1.5 Pro reaches 78.4% accuracy versus 31.2% for GPT-4o with chunked retrieval.
Gemini 2.0 Flash is the cost-leader by a substantial margin. At $0.075 per 1M input tokens and $0.30 per 1M output tokens (December 2025 pricing), processing a single image with a 500-token response costs about $0.0002. For high-volume document workloads, Gemini Flash is the default choice when the marginal accuracy difference versus GPT-4o or Claude 3.5 (typically 1-3 points) does not justify a 30x cost premium.
22.4.4 Claude Vision: Precision and Tone
Anthropic's Claude 3.5 Sonnet and Claude 3.7 Sonnet have a different profile. On raw benchmarks, Claude 3.5 Sonnet sits at MMMU 68.3, MathVista 67.7, ChartQA 90.8, DocVQA 95.2 (the highest published score for any model on DocVQA as of January 2026). The 3.7 release pushed MMMU to 71.8 and MathVista to 73.4.
Claude's distinguishing behavior is conservatism. On ambiguous or partially occluded content, Claude is more likely to refuse to commit to a single interpretation, instead asking clarifying questions or returning a hedged answer with explicit caveats. For legal contract analysis, medical records, and regulatory documents, this conservatism is preferred even though it lowers raw accuracy numbers. For consumer-facing applications where a confident wrong answer is better than a hedged correct one, GPT-4o is usually preferred instead.
Claude's API supports up to 100 images per request (a recent expansion from 20), making it competitive with Gemini for multi-image tasks at typical workload sizes. The maximum effective single-image resolution is approximately 1568x1568, after which the model downsamples.
| Model | MMMU | MathVista | ChartQA | DocVQA | $/1k requests |
|---|---|---|---|---|---|
| GPT-4o | 69.1 | 63.8 | 85.7 | 92.8 | $8.50 |
| GPT-4o-mini | 59.4 | 56.7 | 82.4 | 89.3 | $0.50 |
| Gemini 2.0 Pro | 72.0 | 71.4 | 87.8 | 93.1 | $2.90 |
| Gemini 2.0 Flash | 67.7 | 68.5 | 85.4 | 91.2 | $0.20 |
| Claude 3.5 Sonnet | 68.3 | 67.7 | 90.8 | 95.2 | $6.40 |
| Claude 3.7 Sonnet | 71.8 | 73.4 | 91.3 | 96.0 | $6.40 |
The 1-3 point differences between frontier VLMs on public benchmarks are noisy. Test-set contamination (the training data of these models almost certainly includes many of the benchmark examples), reporting variance across runs, and prompt sensitivity all contribute uncertainty larger than the apparent gaps. The right way to choose between frontier VLMs for a specific application is to run a representative held-out sample (200-1000 examples) and compare on the actual task. Public-benchmark differences rarely survive contact with production data distributions.
22.4.5 Prompt Engineering for VLMs
VLM prompting borrows from text-only LLM prompting but adds vision-specific patterns. Six techniques transfer directly and provide measurable gains.
The first is explicit structure. Asking "What is in this image?" produces unstructured prose. Asking "List the objects visible in this image as a JSON array with fields {name, count, confidence}" produces parseable output. The structural cue helps every frontier VLM.
The second is chain-of-thought (CoT) for visual reasoning. Adding "Think step by step before answering" or "First describe what you see, then answer the question" raises accuracy on MathVista by 4-8 points on most frontier models. The mechanism mirrors text CoT: forcing the model to externalize intermediate reasoning catches mistakes that would otherwise be lost in a single forward pass.
The third is multi-view prompting. For visual reasoning tasks, asking the model to describe the image from multiple perspectives ("What do you see if you focus on the foreground? On the background? On the colors?") adds 2-4 points on MMMU and is particularly effective on Claude 3.5 Sonnet.
The fourth is reference images (few-shot). Providing 1-3 example (image, answer) pairs before the query image adds 3-7 points on most tasks. The examples can be hard-coded or retrieved from a small library; the key is that they show the desired output format and reasoning style.
The fifth is negative space prompting: explicitly telling the model what NOT to focus on. "Ignore any text in the corners. Focus only on the central chart." can reduce hallucinations on cluttered images.
The sixth is the verification pass: after extracting structured data, run a second call that asks the model to verify specific fields against the original image. This catches the 1-3% of cases where the first pass produced a plausible but wrong answer.
import base64
from pathlib import Path
from anthropic import Anthropic
client = Anthropic()
def encode_image(path: str) -> dict:
data = base64.b64encode(Path(path).read_bytes()).decode()
return {
"type": "image",
"source": {
"type": "base64",
"media_type": "image/png",
"data": data,
},
}
# Chain-of-thought + structure + verification pattern
response = client.messages.create(
model="claude-3-7-sonnet-20250219",
max_tokens=1024,
messages=[
{
"role": "user",
"content": [
encode_image("chart.png"),
{
"type": "text",
"text": (
"You will extract data from a chart in two steps.\n\n"
"Step 1: Describe the chart type, axes, and series.\n"
"Step 2: Output the extracted data as JSON with shape\n"
' {"chart_type": str, "x_axis_label": str,\n'
' "series": [{"name": str,\n'
' "data": [{"x": ..., "y": float}]}]}.\n\n'
"Step 3: Verify that all numeric values match the visible "
"chart markers. If any value is uncertain, mark it null."
),
},
],
}
],
)
print(response.content[0].text)
22.4.6 Multi-Image and Interleaved Inputs
Frontier VLMs handle multi-image inputs through interleaved text-and-image messaging. The canonical pattern is to alternate image content blocks and text content blocks within a single user turn: "Image 1: [img]. Image 2: [img]. Compare the two charts and identify three differences." All three frontier vendors support this pattern, though with different limits: GPT-4o supports up to 50 images per call, Claude 3.7 up to 100, and Gemini 2.0 up to 3,600.
For tasks that compare or aggregate across many images (slide-deck summarization, batch document classification, video keyframe analysis), the multi-image API is essential and replaces the need for per-image API calls with state passed in conversation context. The cost savings are substantial: a single 50-image call is typically 3-5x cheaper than 50 single-image calls because the per-call overhead is amortized.
22.4.7 When Frontier APIs Are the Right Answer
The 2026 decision matrix for choosing between frontier VLM APIs and self-hosted open-source VLMs depends on four factors. The first is task complexity: for tasks requiring fluent multi-step reasoning, frontier APIs maintain a 3-8 point edge that often matters operationally. The second is throughput: below 10k requests/day, frontier APIs are cheaper than running a dedicated GPU; above 1M requests/day, self-hosted is essentially always cheaper. Between these limits, the answer depends on accuracy and operational tolerance.
The third factor is data sensitivity. Self-hosted models keep data on-premise, which matters for regulated workloads. All three frontier vendors offer enterprise tiers with data residency commitments and zero data retention; for the most sensitive workloads (healthcare PHI, government classified data), self-hosting is still the only acceptable option.
The fourth factor is iteration speed. APIs make rapid prototyping trivial, and feature parity (new model versions, new modality support) ships continuously. Self-hosted deployments require explicit upgrade decisions. For research and exploration, APIs almost always win; for production-critical, high-volume workloads, self-hosted often wins.
Frontier VLM pricing has dropped 5-15x over 18 months as compute efficiency improvements compound. The numbers in Figure 22.4.1c are accurate as of January 2026 but will likely be obsolete within six months. Long-term contracts and reserved-capacity pricing can lock in costs for high-volume customers, but the spot pricing volatility means cost-driven decisions should be revisited quarterly. Engineering teams should design their architectures to allow swapping VLMs at the configuration layer rather than baking specific vendor choices into application code.
22.4.8 Key Takeaways
- Three vendors define the closed-source VLM frontier: OpenAI (GPT-4V/4o), Google (Gemini 1.5/2.0), Anthropic (Claude 3.5/3.7).
- Architectures are believed identical in pattern (vision encoder + connector + LLM); differences are in scale, training data, and post-training.
- Gemini 2.0 Pro leads on math (MathVista 71.4); Claude 3.7 leads on documents (DocVQA 96.0); Gemini 2.0 Flash dominates cost.
- Public-benchmark gaps of 1-3 points are noisy; real choices should be validated on representative held-out data.
- Effective VLM prompting layers six techniques: structure, CoT, multi-view, few-shot, negative space, verification. Compound effects often add 8-15 points on hard tasks.
- Multi-image interleaved inputs are supported by all three vendors (up to 3600 images on Gemini); single-call batching is 3-5x cheaper than per-image calls.
- Frontier APIs win for low/medium volume, rapid iteration, and complex reasoning; self-hosted open-source wins for high volume, data sensitivity, and long-lived production deployments.
22.4.9 Self-Check
Show Answer
Show Answer
Show Answer
Section 22.5 closes the chapter by examining how we measure VLM capability. MMMU, MM-Vet, BLINK, and MathVista are the benchmarks that define the field today, and understanding their structure, strengths, and saturation risks is essential for both reading the literature and evaluating models on your own data. The section also cross-links to Part VIII Chapter 46 (specialized evaluation) for deeper coverage of the evaluation methodology.