Chapter 21: Document Understanding and OCR

Chapter opener illustration: Document Understanding and OCR.

"The hardest data is the data your colleagues already think they have."
Label, Document-Parsing AI Agent

Looking Back

Chapter 20 handled audio; this chapter handles the other dirty modality every enterprise meets: documents. PDFs, scanned forms, tables, and the OCR-plus-VLM pipelines that turn them into structured outputs an LLM can reason over.

Big Picture

Modern OCR (TrOCR), layout-aware models, VLM-based document understanding, and document AI pipelines.

Chapter Overview

A Fortune 500 insurer processes 4 million claim PDFs a year. In 2022, every one of them went through a Tesseract-plus-regex pipeline that hit 78 percent field accuracy and required 40 humans on the QA queue. In 2024 the same workload moved to a LayoutLM-plus-Claude pipeline, hit 96 percent, and the QA team shrank to 6. That is the document AI shift: OCR is no longer about character recognition, it is about end-to-end structured extraction from any layout the world throws at you. This chapter moves from modern end-to-end OCR (TrOCR, Donut), through layout-aware models (LayoutLM family, LiLT), to VLM-based document understanding (GPT-4V, Claude Vision, Gemini, Qwen-VL Doc) and the structured JSON extraction patterns that ship to production.

The combination of layout-aware encoders and frontier VLMs has changed the document AI baseline twice since 2022. This chapter teaches the architectures, the benchmarks (FUNSD, DocLayNet), and the production patterns that survive 2026.

Note: Learning Objectives

Explain the architectural difference between TrOCR, Donut, and the LayoutLM family.
Fine-tune a LayoutLM-style model on FUNSD or a domain document corpus.
Use frontier VLMs (GPT-4V, Claude Vision, Gemini) for structured JSON extraction from PDFs.
Architect a production document AI pipeline with ingestion, table detection, KV extraction, and validation.
Diagnose accuracy regressions in OCR and KV extraction using DocLayNet-style benchmarks.
Compare the cost-latency-quality envelope of OCR-first vs VLM-direct pipelines for a target workload.

Prerequisites

Modern LLM landscape from Chapter 7
LLM APIs from Chapter 11
Familiarity with one VLM (GPT-4o, Claude, Gemini) at the API level

Sections

Lab 21: Build an Invoice-to-JSON Extractor With a VLM and a Validation Layer

Objective

Build a pipeline that ingests scanned invoice PDFs and emits structured JSON (vendor, line items, totals) with schema validation. By the end you will have a working extractor measured against a small gold set, and you will know when to reach for a fine-tuned LayoutLM vs. a frontier VLM.

Steps

Step 1: Get data. Use the public RVL-CDIP-invoice subset (~200 invoices) or generate 30 fake invoices in Word/Pages and save as PDFs. Hand-label 20 with the ground-truth JSON.
Step 2: VLM baseline. Send each PDF page as an image to GPT-4o with a strict Pydantic schema (Invoice(vendor, date, line_items: list[LineItem], total)) via the structured-output API. Save raw outputs.
Step 3: Validate. For each output, check: (a) total = sum(line_items.amount), (b) date parses to a valid datetime, (c) vendor non-empty. Track failure rate.
Step 4: Retry on failure. When validation fails, send the failed JSON back to the model with the error message and ask for a corrected output. Single retry usually catches half.
Step 5: Compare with LayoutLMv3. Run microsoft/layoutlmv3-base as a sequence-tagging model on 10 invoices. Compare field-level F1.
Step 6: Cost analysis. Tally VLM tokens per invoice vs. LayoutLM GPU inference time. Recommend a tier policy: LayoutLM for high-volume known templates, VLM for the long tail.

Expected Output

Expected time: 3 hours. Difficulty: intermediate. Artifact: a runnable PDF-to-JSON extractor + accuracy/cost report.

What's Next?

Next: Chapter 22: Vision-Language Models. Document understanding was a vertical slice of vision-language; Chapter 22 is the horizontal one. We trace the lineage from ViT and CLIP through SigLIP, BLIP-3, LLaVA, and the omni-modal frontier (GPT-4o, Gemini, Claude 3.5 Sonnet's vision) to see how a single architecture absorbed the entire image-understanding stack. By the end you will know which to reach for when you need to "look at" anything from a chart to a chest X-ray.