
"Vision-language models did not learn to see; they learned to align."
Pixel, Pipeline-Skeptical AI Agent
Chapters 20 and 21 covered audio and documents. This chapter is the rest of the visual world: CLIP, LLaVA, GPT-4o, Claude Vision, Gemini, and the omni-modal architectures that read images, listen, and reply in any modality.
How LLMs see. The first half covers vision-language models (ViT, CLIP/SigLIP contrastive learning, generative VLMs like LLaVA and Qwen-VL, frontier VLMs GPT-4V/Gemini/Claude Vision, multimodal benchmarks). The second half covers omni models (pipeline vs native multimodal, early vs late fusion, any-to-any generation, GPT-4o/Gemini/Llama-4-Omni/Chameleon).
Chapter Overview
In May 2024, OpenAI showed GPT-4o sing, hold a video conversation, and translate live speech without round-tripping to a separate ASR model. Six months later Google's Gemini 2.0 matched the trick. The "vision-language model" era of separate encoders glued onto LLMs is collapsing into omni models that train all modalities together from day one. This chapter walks both eras: CLIP and SigLIP as the contrastive ancestors, LLaVA and Qwen-VL as the open generative bridge, GPT-4V, Gemini, and Claude Vision at the closed frontier, the MMMU benchmark and its 2025 saturation story, and the early-vs-late fusion choice that decides whether your VLM is a pipeline or a model.
Multimodal moved from research curiosity to product default in two years. By the end of this chapter you will know which VLM to reach for, why omni models matter, and how early-fusion vs late-fusion choices shape your serving stack.
- Explain how a ViT tokenizes images and feeds patches into a language model decoder.
- Compare CLIP and SigLIP on contrastive objectives, scale, and retrieval performance.
- Architect a generative VLM (LLaVA-style) with a vision encoder, projector, and language backbone.
- Evaluate a multimodal system on MMMU and interpret saturation behavior.
- Decide between pipeline multimodality (separate models, fused at app layer) and native multimodality (single model, fused at training time).
- Compare GPT-4o, Gemini, Llama-4-Omni, and Chameleon as any-to-any production candidates.
Prerequisites
- Transformer architecture from Chapter 3
- Document understanding from Chapter 21
- LLM APIs from Chapter 11
Sections
- 22.1 ViT and Visual Tokenization Modern Vision-Language Models are stitched together from two halves: a vision encoder that turns pixels into a sequence of vectors, and a language model that consumes those vectors as if they were... Entry
- 22.2 Contrastive Vision-Language: CLIP and SigLIP CLIP was the model that taught a vision encoder to speak. Entry
- 22.3 Generative VLMs: LLaVA, BLIP-3, Qwen-VL A generative Vision-Language Model takes an image plus a text prompt and produces text. Intermediate
- 22.4 Frontier VLMs: GPT-4V, Gemini, Claude Vision Closed-source frontier VLMs (OpenAI GPT-4V/4o, Google Gemini 1.5/2.0, Anthropic Claude 3.5/3.7 Vision) sit at the top of every public multimodal benchmark. Advanced
- 22.5 Evaluating Multimodal Reasoning: MMMU and Saturation Benchmarks define what the field optimizes for. Intermediate
- 22.6 Pipeline vs Native Multimodal Multimodal AI systems fall on a spectrum from "pipeline" to "native". Intermediate
- 22.7 Early Fusion vs Late Fusion Once you commit to a native multimodal architecture (see Section 22.7), the next question is where in the network the modalities combine. Intermediate
- 22.8 Any-to-Any Generation "Any-to-any" generation is the goal of a single model that can consume any modality and produce any modality. Advanced
- 22.9 Frontier Omni Models: GPT-4o, Gemini, Llama-4-Omni, Chameleon Four frontier omni models defined the 2024-2026 state of native multimodality. Advanced
Objective
Train a contrastive image-text model on 500 to 2000 image-caption pairs from your own domain (product photos, X-rays, screenshots, paintings). By the end you will have a custom embedding space where domain-relevant images cluster correctly, plus a measured retrieval improvement over off-the-shelf CLIP.
Steps
- Step 1: Build the dataset. Scrape or curate 500 image-caption pairs in your domain. Examples: Wikipedia paintings + descriptions, product DB photos + descriptions, public X-ray datasets with findings. Save as
pairs.parquet. - Step 2: Baseline. Encode all images with
openai/clip-vit-base-patch32. Encode 50 held-out queries. Compute recall@10 on text-to-image retrieval. - Step 3: Fine-tune. Use
open_clipwith thecross_entropycontrastive loss. Train 5 epochs on a single T4:batch_size=128,lr=1e-5. Hugely important: keep both encoders unfrozen but use a tiny LR to avoid catastrophic forgetting. - Step 4: Re-measure. Recall@10 should jump 10 to 30 points on the held-out queries. If not, your data is too small or too noisy.
- Step 5: Inspect failure cases. Open 10 queries where retrieval failed. Categorize: ambiguous caption, visually similar but semantically different, OOD object. This builds intuition for what CLIP embeddings actually represent.
- Step 6: Library shortcut + deploy. Push your model to Hugging Face Hub. Reload it in 3 lines with
open_clip.create_model_and_transforms(). Anyone can now use your domain-CLIP for retrieval.
Expected Output
Expected time: 3 to 4 hours. Difficulty: intermediate. Artifact: a domain-fine-tuned CLIP on Hugging Face Hub + recall@10 comparison.
What's Next?
Next: Chapter 23: 3D Generation and Neural Scenes. VLMs see 2D pixels. The world is 3D. Chapter 23 covers the modern toolbox for synthesising and editing 3D content: Gaussian Splatting (the 2024 efficiency revolution that mostly replaced NeRFs), Stable Zero123 and Trellis for image-to-3D, 4D splats for video, and relighting. The diffusion-prior trick that makes all of this work surprisingly well is the heart of the chapter.