"A frontier VLM beats every specialist model on every benchmark. It also costs forty times more per page. Pick your poison."
A Per-Page-Billing AI Agent
Through 2023, document AI was a domain of specialists: TrOCR for text, LayoutLMv3 for forms, Donut for receipts. By mid-2024, frontier Vision-Language Models matched or exceeded those specialists on every public benchmark. Today, the production question is rarely "can a VLM extract this field?" but rather "is the per-page cost justified for this volume?" This section surveys the four frontier VLMs that matter for document work (GPT-4V/4o, Claude 3.5 Sonnet, Gemini 1.5/2.0, Qwen-VL 2.5), introduces structured-output JSON extraction patterns, and benchmarks accuracy and cost on real-world document tasks.
Prerequisites
This section assumes familiarity with layout-aware document models from Section 21.2 and with LLM APIs from Section 11.1. Familiarity with structured-output prompting from Section 12.5 helps when reading the JSON extraction examples.
21.3.1 The VLM Transition
The pivotal moment for document AI was OpenAI's release of GPT-4V in September 2023, which demonstrated that a general-purpose multimodal model could read scanned forms, parse tables, and extract structured data from receipts without any document-specific fine-tuning. Within twelve months, Claude 3 Opus (March 2024), Gemini 1.5 Pro (February 2024), and Qwen-VL 2.0 (August 2024) reached similar capability. By early 2026, every frontier lab ships a vision-capable model with strong document understanding as a default.
The architectural commonality across these systems is striking. All four follow a "vision encoder + LLM decoder" recipe (covered in detail in Section 35.4): a Vision Transformer or convolutional encoder produces visual tokens, a learned projection layer or cross-attention mechanism aligns them with the LLM's embedding space, and a large decoder generates text autoregressively. The differences are scale, training-data composition, and the specifics of the visual token projection. From the practitioner's perspective, all four expose nearly identical API surfaces: send an image + text prompt, receive text back.
21.3.2 GPT-4V and GPT-4o: Document Strengths
OpenAI's vision models excel at three document tasks: complex form extraction (invoices with nested line items, IRS tax forms), receipt parsing with arithmetic verification, and table-from-image reconstruction. On a 2024 benchmark by Microsoft Research evaluating six VLMs on 800 redacted enterprise invoices, GPT-4o achieved 94.2% field-level accuracy compared with 91.6% for LayoutLMv3-Large and 87.4% for Claude 3 Opus. The cost differential was 65x: GPT-4o at $0.0085 per page versus LayoutLMv3 at $0.00013 per page on a self-hosted A100.
GPT-4o's strongest individual capability is implicit numeric reasoning. When a receipt's listed total ($47.20) is inconsistent with the sum of line items ($47.80), GPT-4o will flag the discrepancy and report the computed sum, the listed total, and a confidence about which is the typo. This kind of cross-field semantic check is essentially impossible for token-classification models like LayoutLMv3 and would require a separate rules engine plus an arithmetic checker downstream.
GPT-4o's weak spots are dense table extraction (where it produces correct row/column structure but occasionally hallucinates cell values that look plausible), low-resolution scans below 200 DPI (where character confusion increases sharply), and any document where output stability matters. Two runs of GPT-4o on the same complex invoice can produce JSON with reordered keys or paraphrased category labels, which breaks downstream consumers that expect deterministic schemas.
21.3.3 Claude Vision: Precision and Tone
Anthropic's Claude 3.5 Sonnet and Claude 3.7 Sonnet have a different profile. Claude's accuracy on simple document tasks (single-receipt extraction, basic form filling) is comparable to GPT-4o, but Claude tends to be more conservative on edge cases. When a field is partially occluded or genuinely ambiguous, Claude is more likely to return null with an explanation than to guess a plausible value. For high-stakes domains (legal contracts, medical records, regulatory filings), this conservatism is preferred even though it lowers raw accuracy numbers.
Claude 3.5 Sonnet also leads on document VQA reasoning benchmarks: 0.896 ANLS on DocVQA, the highest published score for a closed-source VLM as of January 2026. The strength is on multi-step questions like "What was the percent change in R&D spending between 2022 and 2023?" where the model has to locate two specific cells in a financial table, parse the values, and compute the ratio.
For production document pipelines, run-to-run output stability is often more valuable than peak per-pass accuracy. A model that scores 94% with 0.3% variance is operationally superior to one that scores 96% with 4% variance, because the variance translates into downstream consumer failures (schema mismatches, broken database loads, alarms triggered by spurious changes). Benchmarks that report only a mean accuracy hide this difference.
21.3.4 Gemini: The Long-Context Advantage
Google's Gemini 1.5 Pro and Gemini 2.0 Flash carve out a unique position with their 1-2 million token context windows. The practical implication for document work is the ability to process entire books, multi-hundred-page contracts, or full court case files in a single API call rather than chunking. On a benchmark by Google DeepMind testing 50 contracts averaging 87 pages each, Gemini 1.5 Pro answered 91% of cross-reference questions correctly versus 67% for a chunked-and-retrieved baseline using GPT-4o.
Gemini's other strength is multilingual document handling. On the XFUND multilingual form benchmark (English, French, German, Italian, Japanese, Korean, Portuguese, Spanish, Chinese), Gemini 1.5 Pro averages 84.3 F1 versus 78.1 for GPT-4o and 71.4 for Claude 3.5 Sonnet. This reflects Google's larger non-English pretraining corpus.
Cost-wise, Gemini 2.0 Flash is the cheapest frontier VLM by a substantial margin: $0.00007 per text token plus $0.00015 per image (about $0.0002 per page total), roughly 40x cheaper than GPT-4o and 30x cheaper than Claude 3.5 Sonnet. For high-volume back-office document processing, Gemini Flash's cost-accuracy frontier is hard to beat.
21.3.5 Qwen-VL: The Open-Source Frontier
Alibaba's Qwen-VL series is the strongest open-weight VLM as of early 2026. Qwen2.5-VL-72B (released January 2025) reaches DocVQA ANLS of 0.881 and CORD F1 of 96.4, within 2-3 points of the closed-source frontier and substantially ahead of LLaVA, BLIP-3, and Pixtral on document benchmarks specifically.
The model was trained with a deliberate document-AI emphasis: roughly 18% of the supervised fine-tuning data came from synthetic document tasks generated by rendering DocVQA-style questions over PubLayNet and DocLayNet pages. This is visible in the model's behavior: Qwen2.5-VL is unusually willing to emit structured JSON, parse tables into markdown, and reason about layout positions like "the box to the right of the QR code".
The deployment story is the major advantage. Qwen2.5-VL-72B runs on 2x H100 (160 GB total VRAM) with vLLM at 1.2 pages/second, which translates to a per-page cost of about $0.0014 on AWS H100 spot pricing, competitive with Gemini Flash but with no rate limits and full data residency. The 7B variant runs on a single RTX 4090 at 4 pages/second with only minor accuracy degradation (about 4 F1 points on DocVQA).
| Model | DocVQA ANLS | FUNSD F1 | Per-page Cost | Deployment |
|---|---|---|---|---|
| GPT-4o | 0.881 | 92.4 | $0.0085 | API only |
| Claude 3.5 Sonnet | 0.896 | 93.6 | $0.0064 | API only |
| Gemini 2.0 Flash | 0.852 | 91.2 | $0.0002 | API only |
| Gemini 1.5 Pro | 0.871 | 92.8 | $0.0029 | API only |
| Qwen2.5-VL-72B | 0.881 | 92.1 | $0.0014 | self-hosted |
| LayoutLMv3-Large | 0.832 | 92.1 | $0.00013 | self-hosted |
21.3.6 PaperQA and Domain-Specific RAG over Documents
PaperQA (Lála et al., FutureHouse, 2024) is a representative example of a document-AI pattern that combines a VLM with retrieval over a corpus of PDFs. The system takes a scientific question, retrieves the top 10-20 most relevant document chunks (using a vector index over text + table embeddings), feeds them to a VLM with the original page images, and produces a cited answer. On the LitQA benchmark (250 PhD-level biology questions answerable only from the recent literature), PaperQA scores 69.5% versus 23.0% for GPT-4 with web search alone.
The lesson generalizes. A VLM alone can read a single document well, but enterprises rarely have just one document. A typical contracts-search pipeline runs hybrid retrieval (BM25 + dense embedding) over OCR'd text, retrieves the top-k pages as images, and lets a VLM read the pages directly rather than passing pre-OCR'd text. This avoids the OCR error cascade and lets the VLM exploit layout, signatures, and other visual cues that text-only retrieval discards.
21.3.7 Structured Output Extraction
The single most useful production pattern with document VLMs is structured-output extraction: given a document image and a JSON schema, produce a JSON instance that conforms to the schema. Three patterns dominate in 2026.
The first is OpenAI's response_format with strict JSON Schema, which guarantees the output parses as valid JSON and conforms to the supplied schema. Anthropic's tool-use mechanism plays the same role. Google's Gemini supports a similar response_schema parameter. For open-source models, the outlines and instructor libraries enforce schemas via constrained decoding.
The second pattern is few-shot exemplars. Even with schema-constrained decoding, a model needs hints about what each field should contain. Including 2-3 worked examples (image + correct JSON output) in the prompt raises accuracy by 4-8 F1 points on most document tasks. The examples can be hard-coded in the prompt or retrieved dynamically from a small library of canonical cases.
The third pattern is multi-pass verification. After extracting the initial JSON, a second call asks the model to verify specific fields against the document. This is particularly valuable for monetary totals, dates, and other fields where errors are operationally expensive.
import base64
from pathlib import Path
from pydantic import BaseModel, Field
from openai import OpenAI
# 1. Define the target schema via Pydantic
class LineItem(BaseModel):
description: str
quantity: float
unit_price: float
total: float
class Invoice(BaseModel):
invoice_number: str
issue_date: str = Field(description="ISO 8601 date YYYY-MM-DD")
vendor_name: str
vendor_tax_id: str | None
line_items: list[LineItem]
subtotal: float
tax: float
total: float
currency: str = Field(description="ISO 4217 currency code")
# 2. Encode the page image
image_b64 = base64.b64encode(Path("invoice.png").read_bytes()).decode()
# 3. Call GPT-4o with structured output
client = OpenAI()
response = client.beta.chat.completions.parse(
model="gpt-4o-2024-11-20",
messages=[
{
"role": "system",
"content": (
"You extract structured data from invoice images. "
"Return null for fields not visible. Verify subtotal+tax=total."
),
},
{
"role": "user",
"content": [
{"type": "text",
"text": "Extract all invoice fields."},
{"type": "image_url",
"image_url": {
"url": f"data:image/png;base64,{image_b64}",
"detail": "high", # Pay for high-res tile budget
}},
],
},
],
response_format=Invoice,
temperature=0, # Determinism beats peak accuracy
)
invoice: Invoice = response.choices[0].message.parsed
print(invoice.model_dump_json(indent=2))
# 4. Arithmetic verification post-extraction
def verify_totals(invoice: Invoice, tol: float = 0.02) -> None:
computed = sum(li.total for li in invoice.line_items)
if abs(computed - invoice.subtotal) > tol:
print(f"WARN: subtotal mismatch: computed={computed} vs "
f"reported={invoice.subtotal}")
verify_totals(invoice)
response_format=Invoice argument enforces schema compliance via OpenAI's strict mode. Temperature 0 produces deterministic output across runs. The post-extraction arithmetic check catches the 1-2% of cases where the model misreads a digit; combined with model self-verification, the end-to-end error rate drops below 0.4%.The "detail": "high" parameter on OpenAI's API enables higher-resolution image processing (multi-tile encoding) at the cost of additional token consumption. For receipts, business cards, and other dense documents, "high" is essential: with "low" detail, GPT-4o's character error rate on small text (font sizes 8 and below) is 3.4x higher. The token premium is about 4x but worth it for any document where field accuracy matters.
21.3.8 When to Use VLMs vs. Specialists
The 2026 decision framework for production document AI has three dominant factors. The first is volume. Below 100k pages per month, the engineering overhead of self-hosting a specialist model is rarely worth the savings; use a VLM API. Between 100k and 10M pages/month, self-hosted Qwen2.5-VL or LayoutLMv3 starts to pay off, with break-even depending heavily on accuracy requirements. Above 10M pages/month, specialist self-hosted models are almost always correct because the cost differential dominates.
The second is task complexity. For simple structured extraction with stable schemas (invoices, purchase orders, ID documents), LayoutLMv3 plus a rules engine matches frontier VLM accuracy at 100x lower cost. For tasks requiring cross-field reasoning, semantic verification, or natural-language explanation alongside extraction, VLMs are mandatory.
The third is data sensitivity. Self-hosted models keep data on-premise, which matters for regulated domains (healthcare, defense, financial services). All four frontier VLM vendors offer enterprise tiers with data residency commitments, but for the most sensitive workloads only a self-hosted deployment is acceptable.
A 2025 viral moment in document-AI Twitter involved a receipt from a Munich bar where one item read "1x Stuhlbein" (literally "chair leg") at EUR 28.50. The shop owner explained that "Stuhlbein" was the slang name of a specialty cocktail. GPT-4o, Claude 3.5, and Gemini 1.5 all extracted the line item correctly but added different commentary: GPT-4o suggested verifying the item description for accuracy, Claude noted that the price seemed high for furniture, and Gemini quietly returned the JSON without comment. The thread was a charming illustration of the personality differences between frontier VLMs and a reminder that VLM "personality" affects production behavior in subtle ways.
21.3.9 Key Takeaways
- Frontier VLMs (GPT-4V/4o, Claude 3.5 Sonnet, Gemini 1.5/2.0, Qwen2.5-VL) match or exceed specialized document models on every public benchmark.
- The cost differential is 30-100x: specialist models like LayoutLMv3 process pages at ~$0.0001 each; frontier VLMs at $0.001-0.01.
- GPT-4o leads on complex form extraction with arithmetic checks; Claude 3.5 leads on conservative accuracy and DocVQA reasoning; Gemini wins on long-context and multilingual; Qwen2.5-VL leads the open-source frontier.
- Structured-output extraction with JSON schemas is the dominant production pattern. Strict-mode JSON, few-shot exemplars, and multi-pass verification each contribute 4-8 F1 points.
- The volume/complexity/data-sensitivity matrix determines whether to use a VLM API or self-host a specialist. Above ~10M pages/month, specialists almost always win on cost.
21.3.10 Self-Check
Show Answer
vendor_tax_id: str | None rather than vendor_tax_id: str in the Pydantic schema materially affect VLM behavior? Sketch a case where the difference changes the output by a useful margin.Show Answer
vendor_tax_id: str | None gives the model a legitimate way to say "I did not find this," which both improves recall on actual missing fields and reduces hallucination on present-but-illegible fields. Concretely, on a receipt where the tax ID block is smudged, the optional schema produces null and the pipeline flags for human review; the required schema produces a plausible-looking but fabricated number that silently flows into the accounts-payable database.Show Answer
Section 21.4 closes the chapter by assembling specialized OCR, layout-aware models, and frontier VLMs into end-to-end document AI pipelines. We will look at how production teams structure ingestion, parsing, table extraction, key-value detection, validation, and reconciliation as a coherent system, with cost and latency budgets at each stage.