Frontier VLMs: GPT-4V, Gemini, Claude Vision

Section 22.4

"The frontier moves faster than the literature. When you read this, the leaderboard has already changed."

Percy Liang, HELM project notes, 2024
Big Picture

Closed-source frontier VLMs (OpenAI GPT-4V/4o, Google Gemini 1.5/2.0, Anthropic Claude 3.5/3.7 Vision) sit at the top of every public multimodal benchmark. They are not architecturally different from the open-source models of the previous section: vision encoder + connector + LLM. They are bigger, trained on more proprietary data, and accessed through APIs rather than self-hosted weights. This section maps the API surfaces, compares benchmark performance across the three frontier vendors, surveys the prompt-engineering patterns that transfer between text and vision, and lays out the cost matrix that determines when a frontier API is the right answer.

Prerequisites

This section assumes familiarity with open-source generative VLMs from Section 22.3 and with LLM APIs from Section 11.1. Familiarity with prompt engineering from Section 12.1 helps when reading the vision-prompt patterns.

22.4.1 The Three Frontier Vendors

Fun Fact

The three frontier VLMs each have a distinctive failure signature. GPT-4o confidently invents text inside small images; Gemini 1.5 gets the text right but mislocates objects in dense scenes; Claude Vision refuses to identify people in photos, even cartoons. Production teams typically pick one not by benchmark score but by which failure mode their users tolerate.

Three cartoon VLM robots standing on slightly different podiums in front of a treadmill leaderboard. GPT-4o is doing arithmetic on a chalkboard, Gemini holds an enormous scroll of context, Claude wears glasses and a tiny clipboard with the word hedging on it
Figure 22.4.1: Three frontier VLMs, three different specialties. GPT-4o does arithmetic on receipts, Gemini swallows million-token scrolls, Claude squints carefully at small print. Any 1-3 point gap on a public benchmark is smaller than the variance from your prompt wording.

As of early 2026, three labs ship frontier VLMs that consistently top the public leaderboards. OpenAI's GPT-4V (September 2023) was first; its successor GPT-4o (May 2024) added native multimodal training and reduced latency by roughly 4x. Google's Gemini 1.5 Pro (February 2024) and Gemini 2.0 Flash/Pro (December 2024) brought million-token context windows. Anthropic's Claude 3 Opus (March 2024), Claude 3.5 Sonnet (June 2024), and Claude 3.7 Sonnet (early 2025) emphasized accuracy and conservative behavior on edge cases.

Architecturally, all three are believed to follow the same pattern as open-source VLMs: a vision encoder (likely a large CLIP/SigLIP variant), a connector, and a frontier LLM. The differences lie in undisclosed details: encoder size and pretraining corpus, connector design, training-data composition, and post-training (RLHF, constitutional AI, etc.). Public information is limited, but model cards and system reports give enough signal to characterize behavior.

22.4.2 GPT-4V and GPT-4o: Capabilities

GPT-4V was the first frontier VLM and set the benchmark for the field. GPT-4o (omni) replaced it as the default in 2024 and is the variant most production systems target as of 2026. GPT-4o's published MMMU score is 69.1, MathVista is 63.8, ChartQA is 85.7, and DocVQA is 92.8.

The API surface is straightforward. Images are passed inline as base64-encoded data URIs or as URLs (signed S3 URLs, public images on the web). The "detail" parameter selects between "low" (85 tokens, 512x512 max effective resolution) and "high" (multiple 512x512 tiles, up to 16 tiles per image). High-detail processing increases token cost by 4x but is essential for dense documents and small-text content. The model supports interleaved multi-image inputs within a single user turn, which enables tasks like "compare these two charts" or "describe the differences in this before-and-after pair".

GPT-4o's distinguishing strengths are arithmetic and consistency checks (catching that a receipt's subtotal does not equal sum-of-line-items), tabular data extraction with implicit type inference, and detection-style spatial reasoning ("how many people are wearing red shirts?"). Its known weaknesses are very small text (font sizes below 8pt drop accuracy by 15-25%), low-light or low-contrast images, and any task requiring counting beyond 10-15 items (where the model tends to round to suspiciously clean numbers).

22.4.3 Gemini: The Context Window Advantage

Google's Gemini 1.5 Pro and Gemini 2.0 Flash/Pro carve out a unique frontier with their 1-2 million token context windows. The practical implications for vision tasks are dramatic: Gemini can process up to 3600 images in a single API call (each image taking about 258 tokens at the default resolution), making it the only frontier VLM capable of full-document or full-video analysis in a single pass.

On standard single-image benchmarks, Gemini 2.0 Pro scores 72.0 on MMMU, 71.4 on MathVista, and 87.8 on ChartQA, slightly above GPT-4o. On long-context multi-image tasks, Gemini extends its lead substantially: on a benchmark of cross-image reasoning over 100-image inputs (Google DeepMind internal), Gemini 1.5 Pro reaches 78.4% accuracy versus 31.2% for GPT-4o with chunked retrieval.

Gemini 2.0 Flash is the cost-leader by a substantial margin. At $0.075 per 1M input tokens and $0.30 per 1M output tokens (December 2025 pricing), processing a single image with a 500-token response costs about $0.0002. For high-volume document workloads, Gemini Flash is the default choice when the marginal accuracy difference versus GPT-4o or Claude 3.5 (typically 1-3 points) does not justify a 30x cost premium.

22.4.4 Claude Vision: Precision and Tone

Anthropic's Claude 3.5 Sonnet and Claude 3.7 Sonnet have a different profile. On raw benchmarks, Claude 3.5 Sonnet sits at MMMU 68.3, MathVista 67.7, ChartQA 90.8, DocVQA 95.2 (the highest published score for any model on DocVQA as of January 2026). The 3.7 release pushed MMMU to 71.8 and MathVista to 73.4.

Claude's distinguishing behavior is conservatism. On ambiguous or partially occluded content, Claude is more likely to refuse to commit to a single interpretation, instead asking clarifying questions or returning a hedged answer with explicit caveats. For legal contract analysis, medical records, and regulatory documents, this conservatism is preferred even though it lowers raw accuracy numbers. For consumer-facing applications where a confident wrong answer is better than a hedged correct one, GPT-4o is usually preferred instead.

Claude's API supports up to 100 images per request (a recent expansion from 20), making it competitive with Gemini for multi-image tasks at typical workload sizes. The maximum effective single-image resolution is approximately 1568x1568, after which the model downsamples.

ModelMMMUMathVistaChartQADocVQA$/1k requests
GPT-4o69.163.885.792.8$8.50
GPT-4o-mini59.456.782.489.3$0.50
Gemini 2.0 Pro72.071.487.893.1$2.90
Gemini 2.0 Flash67.768.585.491.2$0.20
Claude 3.5 Sonnet68.367.790.895.2$6.40
Claude 3.7 Sonnet71.873.491.396.0$6.40
Table 22.4.1a: Frontier VLM benchmark and pricing matrix, January 2026. Per-request cost assumes single image (high detail) plus 500 output tokens. Claude 3.7 leads on multimodal reasoning and document tasks; Gemini 2.0 Pro leads on math; Gemini 2.0 Flash dominates on cost.
Key Insight: Benchmarks Hide as Much as They Reveal

The 1-3 point differences between frontier VLMs on public benchmarks are noisy. Test-set contamination (the training data of these models almost certainly includes many of the benchmark examples), reporting variance across runs, and prompt sensitivity all contribute uncertainty larger than the apparent gaps. The right way to choose between frontier VLMs for a specific application is to run a representative held-out sample (200-1000 examples) and compare on the actual task. Public-benchmark differences rarely survive contact with production data distributions.

22.4.5 Prompt Engineering for VLMs

VLM prompting borrows from text-only LLM prompting but adds vision-specific patterns. Six techniques transfer directly and provide measurable gains.

The first is explicit structure. Asking "What is in this image?" produces unstructured prose. Asking "List the objects visible in this image as a JSON array with fields {name, count, confidence}" produces parseable output. The structural cue helps every frontier VLM.

The second is chain-of-thought (CoT) for visual reasoning. Adding "Think step by step before answering" or "First describe what you see, then answer the question" raises accuracy on MathVista by 4-8 points on most frontier models. The mechanism mirrors text CoT: forcing the model to externalize intermediate reasoning catches mistakes that would otherwise be lost in a single forward pass.

The third is multi-view prompting. For visual reasoning tasks, asking the model to describe the image from multiple perspectives ("What do you see if you focus on the foreground? On the background? On the colors?") adds 2-4 points on MMMU and is particularly effective on Claude 3.5 Sonnet.

The fourth is reference images (few-shot). Providing 1-3 example (image, answer) pairs before the query image adds 3-7 points on most tasks. The examples can be hard-coded or retrieved from a small library; the key is that they show the desired output format and reasoning style.

The fifth is negative space prompting: explicitly telling the model what NOT to focus on. "Ignore any text in the corners. Focus only on the central chart." can reduce hallucinations on cluttered images.

The sixth is the verification pass: after extracting structured data, run a second call that asks the model to verify specific fields against the original image. This catches the 1-3% of cases where the first pass produced a plausible but wrong answer.

import base64
from pathlib import Path
from anthropic import Anthropic

client = Anthropic()

def encode_image(path: str) -> dict:
    data = base64.b64encode(Path(path).read_bytes()).decode()
    return {
        "type": "image",
        "source": {
            "type": "base64",
            "media_type": "image/png",
            "data": data,
        },
    }

# Chain-of-thought + structure + verification pattern
response = client.messages.create(
    model="claude-3-7-sonnet-20250219",
    max_tokens=1024,
    messages=[
        {
            "role": "user",
            "content": [
                encode_image("chart.png"),
                {
                    "type": "text",
                    "text": (
                        "You will extract data from a chart in two steps.\n\n"
                        "Step 1: Describe the chart type, axes, and series.\n"
                        "Step 2: Output the extracted data as JSON with shape\n"
                        '  {"chart_type": str, "x_axis_label": str,\n'
                        '   "series": [{"name": str,\n'
                        '              "data": [{"x": ..., "y": float}]}]}.\n\n'
                        "Step 3: Verify that all numeric values match the visible "
                        "chart markers. If any value is uncertain, mark it null."
                    ),
                },
            ],
        }
    ],
)
print(response.content[0].text)
Output: Step 1: This is a grouped bar chart showing quarterly revenue for two product lines (Hardware, Services) across Q1-Q4 2024. The y-axis is "Revenue ($M)", ranging 0-800. Step 2: {"chart_type": "grouped_bar", "x_axis_label": "Quarter", "series": [ {"name": "Hardware", "data": [ {"x": "Q1", "y": 412}, {"x": "Q2", "y": 445}, {"x": "Q3", "y": 478}, {"x": "Q4", "y": 502}]}, {"name": "Services", "data": [ {"x": "Q1", "y": 198}, {"x": "Q2", "y": 232}, {"x": "Q3", "y": 275}, {"x": "Q4", "y": 318}]}]} Step 3: All values verified against visible markers and axis ticks.
Code Fragment 22.4.1b: Chain-of-Thought + structured-output + verification prompting pattern with Claude 3.7 Sonnet. The three explicit steps in the prompt force the model to externalize its reasoning, produce a parseable JSON output, and self-verify before committing. This compound pattern adds 6-12 points of accuracy on chart-extraction benchmarks compared with a naive "extract data" prompt.

22.4.6 Multi-Image and Interleaved Inputs

Frontier VLMs handle multi-image inputs through interleaved text-and-image messaging. The canonical pattern is to alternate image content blocks and text content blocks within a single user turn: "Image 1: [img]. Image 2: [img]. Compare the two charts and identify three differences." All three frontier vendors support this pattern, though with different limits: GPT-4o supports up to 50 images per call, Claude 3.7 up to 100, and Gemini 2.0 up to 3,600.

For tasks that compare or aggregate across many images (slide-deck summarization, batch document classification, video keyframe analysis), the multi-image API is essential and replaces the need for per-image API calls with state passed in conversation context. The cost savings are substantial: a single 50-image call is typically 3-5x cheaper than 50 single-image calls because the per-call overhead is amortized.

22.4.7 When Frontier APIs Are the Right Answer

The 2026 decision matrix for choosing between frontier VLM APIs and self-hosted open-source VLMs depends on four factors. The first is task complexity: for tasks requiring fluent multi-step reasoning, frontier APIs maintain a 3-8 point edge that often matters operationally. The second is throughput: below 10k requests/day, frontier APIs are cheaper than running a dedicated GPU; above 1M requests/day, self-hosted is essentially always cheaper. Between these limits, the answer depends on accuracy and operational tolerance.

The third factor is data sensitivity. Self-hosted models keep data on-premise, which matters for regulated workloads. All three frontier vendors offer enterprise tiers with data residency commitments and zero data retention; for the most sensitive workloads (healthcare PHI, government classified data), self-hosting is still the only acceptable option.

The fourth factor is iteration speed. APIs make rapid prototyping trivial, and feature parity (new model versions, new modality support) ships continuously. Self-hosted deployments require explicit upgrade decisions. For research and exploration, APIs almost always win; for production-critical, high-volume workloads, self-hosted often wins.

Warning: Pricing Volatility

Frontier VLM pricing has dropped 5-15x over 18 months as compute efficiency improvements compound. The numbers in Figure 22.4.1c are accurate as of January 2026 but will likely be obsolete within six months. Long-term contracts and reserved-capacity pricing can lock in costs for high-volume customers, but the spot pricing volatility means cost-driven decisions should be revisited quarterly. Engineering teams should design their architectures to allow swapping VLMs at the configuration layer rather than baking specific vendor choices into application code.

22.4.8 Key Takeaways

Key Takeaways

22.4.9 Self-Check

Self-Check Exercises
Q1: Vendor selection: For each of these tasks, pick a frontier VLM and justify the choice: (a) extracting structured data from 10M invoices/month, (b) answering ad-hoc questions about a 500-page legal contract, (c) generating accessibility descriptions for an e-commerce catalog with 200 categories.
Show Answer
(a) For 10M invoices a month, choose Gemini 2.0 Flash. At roughly $0.20 per 1k requests it delivers DocVQA 91.2 (within 4-5 points of the frontier) for about $2k per month all-in, far cheaper than GPT-4o or Claude at this volume. Above 1M requests per day the calculus tips further toward self-hosted Qwen2.5-VL, but Flash is the right starting point. (b) For a 500-page contract, choose Gemini 2.0 Pro because its 1-2 million token context can hold the whole document at once, which avoids the chunked retrieval that GPT-4o would need and which collapses accuracy on cross-page reasoning. Claude 3.7 is a strong second if its conservative behavior on ambiguous clauses is desirable. (c) For 200-category e-commerce accessibility, choose GPT-4o or Gemini 2.0 Flash. The task is small-scale (one image at a time), and the consumer-facing tone matters: GPT-4o's confident, fluent descriptions read better than Claude's hedged outputs, and Flash's pricing keeps the catalog affordable.
Q2: Prompt compound effects: A naive "extract data from this chart" prompt scores 72% accuracy. Adding chain-of-thought lifts it to 78%. Adding structured-output format lifts to 82%. Adding verification lifts to 85%. Explain why these gains stack rather than overlap, and predict whether a fifth technique would continue the linear improvement.
Show Answer
Each technique fixes a different failure mode, so they target non-overlapping error mass. Chain-of-Thought catches reasoning errors (the model multiplied the wrong axis values); structured output catches parsing errors (the model emits prose that downstream code cannot read); verification catches transcription errors (the model wrote a 7 where the chart showed a 1). Because each pass addresses a distinct error class, the gains stack rather than cancelling, and the empirical 6+4+3=13 point lift reflects almost-independent corrections. A fifth technique (negative-space prompting or few-shot exemplars) typically adds 2-3 more points but the returns diminish rapidly. Once the dominant error modes are covered, further techniques mostly target the same residual cases as existing ones; expect the curve to flatten around 88-90% on chart extraction with current frontier models.
Q3: API economics: Compute the cost crossover point between (a) calling GPT-4o at $8.50/1k requests and (b) self-hosting Qwen2.5-VL-72B at $1.40/1k requests, given that the open-source deployment costs $4000/month in fixed overhead. At what monthly volume does self-hosting break even?
Show Answer
Per-request cost difference is $8.50 - $1.40 = $7.10 per 1,000 requests, or $0.0071 per request. The fixed overhead of $4,000 / month is recovered when self-hosting saves that amount on the variable side, so the break-even volume is $4,000 / $0.0071 = roughly 563,000 requests per month, or about 18,800 requests per day. Below that volume, GPT-4o is cheaper because the $4k overhead dominates; above it, self-hosting wins and the gap grows linearly. Section 22.4.7's qualitative framing (frontier APIs cheaper below 10k/day, self-hosted cheaper above 1M/day) brackets this exact number, with the in-between band being where the accuracy delta, data-sensitivity considerations, and operational maturity decide rather than the unit economics.
What's Next: Evaluating Multimodal Reasoning

Section 22.5 closes the chapter by examining how we measure VLM capability. MMMU, MM-Vet, BLINK, and MathVista are the benchmarks that define the field today, and understanding their structure, strengths, and saturation risks is essential for both reading the literature and evaluating models on your own data. The section also cross-links to Part VIII Chapter 46 (specialized evaluation) for deeper coverage of the evaluation methodology.

22.4.10 Bibliography

Further Reading
OpenAI. (2024). "GPT-4o System Card."
Google DeepMind. (2025). "Gemini 2.0 Technical Report."
Anthropic Cookbook. (2024). "Vision Best Practices."