Chapter 22: Vision-Language and Omni Models

Chapter opener illustration: Vision-Language and Omni Models.

"Vision-language models did not learn to see; they learned to align."
Pixel, Pipeline-Skeptical AI Agent

Looking Back

Chapters 20 and 21 covered audio and documents. This chapter is the rest of the visual world: CLIP, LLaVA, GPT-4o, Claude Vision, Gemini, and the omni-modal architectures that read images, listen, and reply in any modality.

Big Picture

How LLMs see. The first half covers vision-language models (ViT, CLIP/SigLIP contrastive learning, generative VLMs like LLaVA and Qwen-VL, frontier VLMs GPT-4V/Gemini/Claude Vision, multimodal benchmarks). The second half covers omni models (pipeline vs native multimodal, early vs late fusion, any-to-any generation, GPT-4o/Gemini/Llama-4-Omni/Chameleon).

Chapter Overview

In May 2024, OpenAI showed GPT-4o sing, hold a video conversation, and translate live speech without round-tripping to a separate ASR model. Six months later Google's Gemini 2.0 matched the trick. The "vision-language model" era of separate encoders glued onto LLMs is collapsing into omni models that train all modalities together from day one. This chapter walks both eras: CLIP and SigLIP as the contrastive ancestors, LLaVA and Qwen-VL as the open generative bridge, GPT-4V, Gemini, and Claude Vision at the closed frontier, the MMMU benchmark and its 2025 saturation story, and the early-vs-late fusion choice that decides whether your VLM is a pipeline or a model.

Multimodal moved from research curiosity to product default in two years. By the end of this chapter you will know which VLM to reach for, why omni models matter, and how early-fusion vs late-fusion choices shape your serving stack.

Note: Learning Objectives

Explain how a ViT tokenizes images and feeds patches into a language model decoder.
Compare CLIP and SigLIP on contrastive objectives, scale, and retrieval performance.
Architect a generative VLM (LLaVA-style) with a vision encoder, projector, and language backbone.
Evaluate a multimodal system on MMMU and interpret saturation behavior.
Decide between pipeline multimodality (separate models, fused at app layer) and native multimodality (single model, fused at training time).
Compare GPT-4o, Gemini, Llama-4-Omni, and Chameleon as any-to-any production candidates.

Prerequisites

Transformer architecture from Chapter 3
Document understanding from Chapter 21
LLM APIs from Chapter 11

Sections

Lab 22: Fine-Tune a CLIP-Style Embedding on Your Own Image-Text Pairs

Objective

Train a contrastive image-text model on 500 to 2000 image-caption pairs from your own domain (product photos, X-rays, screenshots, paintings). By the end you will have a custom embedding space where domain-relevant images cluster correctly, plus a measured retrieval improvement over off-the-shelf CLIP.

Steps

Step 1: Build the dataset. Scrape or curate 500 image-caption pairs in your domain. Examples: Wikipedia paintings + descriptions, product DB photos + descriptions, public X-ray datasets with findings. Save as pairs.parquet.
Step 2: Baseline. Encode all images with openai/clip-vit-base-patch32. Encode 50 held-out queries. Compute recall@10 on text-to-image retrieval.
Step 3: Fine-tune. Use open_clip with the cross_entropy contrastive loss. Train 5 epochs on a single T4: batch_size=128, lr=1e-5. Hugely important: keep both encoders unfrozen but use a tiny LR to avoid catastrophic forgetting.
Step 4: Re-measure. Recall@10 should jump 10 to 30 points on the held-out queries. If not, your data is too small or too noisy.
Step 5: Inspect failure cases. Open 10 queries where retrieval failed. Categorize: ambiguous caption, visually similar but semantically different, OOD object. This builds intuition for what CLIP embeddings actually represent.
Step 6: Library shortcut + deploy. Push your model to Hugging Face Hub. Reload it in 3 lines with open_clip.create_model_and_transforms(). Anyone can now use your domain-CLIP for retrieval.

Expected Output

Expected time: 3 to 4 hours. Difficulty: intermediate. Artifact: a domain-fine-tuned CLIP on Hugging Face Hub + recall@10 comparison.

What's Next?

Next: Chapter 23: 3D Generation and Neural Scenes. VLMs see 2D pixels. The world is 3D. Chapter 23 covers the modern toolbox for synthesising and editing 3D content: Gaussian Splatting (the 2024 efficiency revolution that mostly replaced NeRFs), Stable Zero123 and Trellis for image-to-3D, 4D splats for video, and relighting. The diffusion-prior trick that makes all of this work surprisingly well is the heart of the chapter.