
"The hardest data is the data your colleagues already think they have."
Label, Document-Parsing AI Agent
Chapter 20 handled audio; this chapter handles the other dirty modality every enterprise meets: documents. PDFs, scanned forms, tables, and the OCR-plus-VLM pipelines that turn them into structured outputs an LLM can reason over.
Modern OCR (TrOCR), layout-aware models, VLM-based document understanding, and document AI pipelines.
Chapter Overview
A Fortune 500 insurer processes 4 million claim PDFs a year. In 2022, every one of them went through a Tesseract-plus-regex pipeline that hit 78 percent field accuracy and required 40 humans on the QA queue. In 2024 the same workload moved to a LayoutLM-plus-Claude pipeline, hit 96 percent, and the QA team shrank to 6. That is the document AI shift: OCR is no longer about character recognition, it is about end-to-end structured extraction from any layout the world throws at you. This chapter moves from modern end-to-end OCR (TrOCR, Donut), through layout-aware models (LayoutLM family, LiLT), to VLM-based document understanding (GPT-4V, Claude Vision, Gemini, Qwen-VL Doc) and the structured JSON extraction patterns that ship to production.
The combination of layout-aware encoders and frontier VLMs has changed the document AI baseline twice since 2022. This chapter teaches the architectures, the benchmarks (FUNSD, DocLayNet), and the production patterns that survive 2026.
- Explain the architectural difference between TrOCR, Donut, and the LayoutLM family.
- Fine-tune a LayoutLM-style model on FUNSD or a domain document corpus.
- Use frontier VLMs (GPT-4V, Claude Vision, Gemini) for structured JSON extraction from PDFs.
- Architect a production document AI pipeline with ingestion, table detection, KV extraction, and validation.
- Diagnose accuracy regressions in OCR and KV extraction using DocLayNet-style benchmarks.
- Compare the cost-latency-quality envelope of OCR-first vs VLM-direct pipelines for a target workload.
Prerequisites
- Modern LLM landscape from Chapter 7
- LLM APIs from Chapter 11
- Familiarity with one VLM (GPT-4o, Claude, Gemini) at the API level
Sections
- 21.1 Modern OCR: TrOCR and End-to-End Recognition TrOCR, Donut, DocLayNet, end-to-end document understanding, and accuracy benchmarks. Entry
- 21.2 Layout-Aware Models: LayoutLM Family LayoutLM v1, v2, v3, LiLT, Donut, and fine-tuning on FUNSD. Intermediate
- 21.3 VLM-Based Document Understanding GPT-4V, Claude Vision, Gemini, Qwen-VL Doc, PaperQA, and structured JSON extraction from PDFs. Advanced
- 21.4 Building Document AI Pipelines Ingestion, parsing, table detection, KV extraction, validation, reconciliation, and cost-latency tradeoffs. Advanced
Objective
Build a pipeline that ingests scanned invoice PDFs and emits structured JSON (vendor, line items, totals) with schema validation. By the end you will have a working extractor measured against a small gold set, and you will know when to reach for a fine-tuned LayoutLM vs. a frontier VLM.
Steps
- Step 1: Get data. Use the public
RVL-CDIP-invoicesubset (~200 invoices) or generate 30 fake invoices in Word/Pages and save as PDFs. Hand-label 20 with the ground-truth JSON. - Step 2: VLM baseline. Send each PDF page as an image to GPT-4o with a strict Pydantic schema (
Invoice(vendor, date, line_items: list[LineItem], total)) via the structured-output API. Save raw outputs. - Step 3: Validate. For each output, check: (a) total = sum(line_items.amount), (b) date parses to a valid datetime, (c) vendor non-empty. Track failure rate.
- Step 4: Retry on failure. When validation fails, send the failed JSON back to the model with the error message and ask for a corrected output. Single retry usually catches half.
- Step 5: Compare with LayoutLMv3. Run
microsoft/layoutlmv3-baseas a sequence-tagging model on 10 invoices. Compare field-level F1. - Step 6: Cost analysis. Tally VLM tokens per invoice vs. LayoutLM GPU inference time. Recommend a tier policy: LayoutLM for high-volume known templates, VLM for the long tail.
Expected Output
Expected time: 3 hours. Difficulty: intermediate. Artifact: a runnable PDF-to-JSON extractor + accuracy/cost report.
What's Next?
Next: Chapter 22: Vision-Language Models. Document understanding was a vertical slice of vision-language; Chapter 22 is the horizontal one. We trace the lineage from ViT and CLIP through SigLIP, BLIP-3, LLaVA, and the omni-modal frontier (GPT-4o, Gemini, Claude 3.5 Sonnet's vision) to see how a single architecture absorbed the entire image-understanding stack. By the end you will know which to reach for when you need to "look at" anything from a chart to a chest X-ray.