A truly multimodal model does not translate between modalities. It thinks in all of them simultaneously.
Pixel, Polyglot Perceiving AI Agent
The field of multimodal AI is undergoing a fundamental architectural shift from pipeline systems to unified models. Early multimodal systems bolted together separate models: a vision encoder, a text decoder, and an audio processor, connected by adapters and projection layers. While functional, these pipeline approaches suffer from information loss at every handoff. GPT-4o, Gemini 2.0/2.5, and similar "omni" models represent a new paradigm: a single transformer trained end-to-end across text, images, audio, and video that can both understand and generate in multiple modalities natively. This section covers the architectures behind these models, the trade-offs between fusion strategies, and how to evaluate multimodal systems using benchmarks like MMMU and MMBench. The embedding concepts from Section 19.1 provide context for understanding shared multimodal embedding spaces.
Prerequisites
This section builds on vision-language models from Section 27.1: Image Generation and Vision-Language Models, audio and video generation in Section 27.2: Audio, Music and Video Generation, and the transformer architecture covered in Section 04.1.
Imagine two ways to handle a conversation in three languages. The first approach (pipeline) hires separate translators: one listens in Japanese, translates to English, another reads English and writes a response, and a third translates the response to French. Each handoff loses nuance. The second approach (native multimodal) is a single person who thinks fluently in all three languages. They hear the Japanese question, understand its meaning directly, and respond in French without any intermediate translation step. Unified multimodal models are the second approach: they process images, text, and audio within a single model that represents all modalities in a shared embedding space, avoiding the information loss that comes from translating between separate specialized models.
1. Pipeline vs. Native Multimodal Architectures
The distinction between pipeline and native multimodal models is the most important architectural concept in this section. A pipeline model chains separate specialist models: CLIP encodes the image, Whisper transcribes the audio, and a language model reasons over the combined text representations. A native multimodal model processes all modalities within a single set of transformer layers, using a shared representation space where image tokens, audio tokens, and text tokens attend to each other directly.
| Property | Pipeline (e.g., LLaVA) | Native (e.g., GPT-4o, Gemini) |
|---|---|---|
| Architecture | Separate encoders + projection + LLM | Single transformer, shared embeddings |
| Cross-modal reasoning | Limited by projection bottleneck | Full attention across modalities |
| Generation | Text only (typically) | Text, images, audio, video |
| Latency | Additive (each model adds delay) | Single forward pass (lower latency) |
| Training cost | Lower (can reuse pretrained components) | Much higher (end-to-end training) |
| Customizability | Can swap individual components | Monolithic; harder to modify |
Multimodal models need special tokenization strategies because different modalities have fundamentally different information densities. A single text token encodes roughly 4 characters (a word fragment). A single image patch token from a 224x224 image encodes a 16x16 pixel region containing thousands of pixel values. A single audio frame token represents 20 milliseconds of waveform with thousands of amplitude samples. This mismatch creates a practical problem: a 1024-token context window can hold a 4,000-character text passage or a single low-resolution image or a fraction of a second of audio. Native multimodal models address this through learned compression (visual tokenizers like those in Gemini reduce a 224x224 image to 256 tokens), dynamic resolution (allocating more tokens to complex images and fewer to simple ones), and hierarchical encoding (capturing global structure at low token count and local detail at high token count). Understanding these tokenization trade-offs, which build on the text tokenization foundations from Chapter 02, is essential for designing multimodal applications that balance quality with context window efficiency.
2. Early Fusion vs. Late Fusion
The distinction between early and late fusion describes where in the model different modalities are combined. In early fusion, raw inputs from all modalities are tokenized and concatenated into a single sequence before the first transformer layer. Every attention layer can attend across modalities from the very start. In late fusion, each modality is processed by separate encoder stacks for many layers, and the representations are only combined in the final layers.
Early Fusion (GPT-4o style)
GPT-4o and Gemini 2.0 use early fusion: images are converted to visual tokens by a patch-based tokenizer, audio is converted to audio tokens by a learned encoder, and text uses the standard tokenizer. All tokens are interleaved into a single sequence and fed through the full transformer stack. This allows the model to reason about relationships between modalities from the earliest layers. For example, the model can associate a spoken word with a visual region in the first few layers, rather than having to reconstruct this relationship after separate encoding.
When choosing between early fusion and late fusion multimodal models for your application, consider your fine-tuning budget. Late fusion models let you freeze the vision encoder and only train the cross-attention layers, cutting GPU memory requirements by 40 to 60%. If your task mainly needs visual understanding (image captioning, visual QA), late fusion with a frozen encoder is often sufficient. Reserve early fusion for tasks that require tight cross-modal reasoning, such as models that must describe spatial relationships between objects in a scene while following complex text instructions.
Late Fusion (Flamingo style)
Late fusion models like Flamingo process images through a frozen vision encoder for many layers, producing a compact set of visual features. These features are then injected into the language model via cross-attention layers at specific positions. The advantage is efficiency: the vision encoder can be pretrained independently and frozen, reducing training cost. The disadvantage is that cross-modal interactions are limited to the layers where injection occurs. Figure 27.4.2 compares early fusion, late fusion, and cross-attention fusion architectures side by side. Code Fragment 27.4.2 below puts this into practice.
# Gemini 2.5: native multimodal with early fusion
# Process image + audio + text in a single forward pass for cross-modal reasoning
import google.generativeai as genai
from pathlib import Path
# Gemini 2.5: native multimodal with early fusion
# The model processes image + text + audio in a single forward pass
genai.configure(api_key="YOUR_API_KEY")
model = genai.GenerativeModel("gemini-2.5-flash")
# Upload an image and audio file
image = genai.upload_file(Path("meeting_whiteboard.jpg"))
audio = genai.upload_file(Path("meeting_recording.mp3"))
# Cross-modal query: reason across image and audio simultaneously
response = model.generate_content([
image,
audio,
"Compare the action items discussed in the audio recording with "
"what was written on the whiteboard. Identify any items mentioned "
"verbally but missing from the whiteboard."
])
print(response.text)
3. Any-to-Any Generation
The most striking capability of unified multimodal models is any-to-any generation: the ability to take input in any combination of modalities and produce output in any modality. GPT-4o can accept text and produce audio, accept an image and produce text, or accept audio and produce text with an image. This is fundamentally different from pipeline systems where each input-output modality pair requires a separate model chain.
Architecturally, any-to-any generation requires the model to have both encoders and decoders for every supported modality. For text, the standard autoregressive decoder suffices. For images, the model typically includes a diffusion decoder or a discrete visual token decoder that generates images token by token. For audio, a vocoder or audio token decoder converts the model's internal representations into waveforms. The shared transformer backbone provides the cross-modal reasoning, while modality-specific heads handle the final generation step. Code Fragment 27.4.2 below puts this into practice.
# Any-to-any generation: analyze an image, then generate a new one based on the analysis
# Demonstrates cross-modal flow from image understanding to image generation
from openai import OpenAI
import base64
client = OpenAI()
# GPT-4o: native multimodal understanding + generation
# Accept image input, produce text analysis + generate related image
# Step 1: Analyze an image with text
with open("chart.png", "rb") as f:
image_data = base64.b64encode(f.read()).decode()
analysis = client.chat.completions.create(
model="gpt-4o",
messages=[{
"role": "user",
"content": [
{"type": "image_url",
"image_url": {"url": f"data:image/png;base64,{image_data}"}},
{"type": "text",
"text": "Describe the trends in this chart. What are the key insights?"}
]
}]
)
print(analysis.choices[0].message.content)
# Step 2: Generate an improved version of the chart (image generation)
image_response = client.images.generate(
model="gpt-image-1",
prompt=(
f"Create a clean, professional bar chart showing the following data: "
f"{analysis.choices[0].message.content[:500]}. "
f"Use a modern color palette with clear labels and a title."
),
size="1024x1024"
)
print(f"Generated image URL: {image_response.data[0].url}")
For quick image captioning without API keys, the transformers pipeline (pip install transformers) handles model loading and inference in 3 lines:
from transformers import pipeline
captioner = pipeline("image-to-text", model="Salesforce/blip2-opt-2.7b")
result = captioner("chart.png")
print(result[0]["generated_text"]) # "a bar chart showing quarterly revenue growth"
4. Training Unified Multimodal Models
Training a unified multimodal model requires solving several challenges that do not exist in text-only training. The first is tokenization across modalities: images, audio, and text have fundamentally different structures and information densities. A single image can contain as much information as thousands of text tokens, and audio streams have a very different temporal structure than text sequences. The model needs a tokenization strategy that represents each modality efficiently while allowing meaningful cross-modal attention.
Training Stages
Unified multimodal models typically follow a staged training curriculum, where each phase builds on the previous one:
- Modality-specific pretraining: Train separate encoders and decoders on large unimodal datasets (ImageNet for vision, LibriSpeech for audio, text corpora for language). This gives each modality a strong foundation.
- Alignment pretraining: Train the model on paired multimodal data (image-caption pairs, audio-transcript pairs) to align representations across modalities in a shared embedding space.
- Unified fine-tuning: Fine-tune the complete model end-to-end on interleaved multimodal data where inputs and outputs can be in any modality. This stage teaches the model to reason across modalities and generate in any target modality.
- Instruction tuning: Fine-tune on human-curated instruction-following data that covers multimodal tasks: "describe this image," "transcribe this audio," "generate an image of..." This stage aligns the model with user expectations.
Training data for different modalities varies enormously in availability and quality. There are trillions of text tokens available on the web, but far fewer high-quality image-text pairs, and even fewer high-quality audio-text-image triplets. If training data is not carefully balanced, the model can become "text-dominant," treating visual and audio inputs as secondary signals rather than first-class modalities. Google's Gemini team addressed this by curating dedicated cross-modal datasets and using loss weighting to ensure each modality receives adequate training signal.
5. Multimodal Benchmarks
Evaluating multimodal models requires benchmarks that test cross-modal reasoning, not just performance on each modality in isolation. A model that achieves high accuracy on ImageNet and high perplexity on text benchmarks might still fail at tasks requiring joint reasoning across images and text. The major multimodal benchmarks as of 2025 include:
| Benchmark | What It Tests | Key Features |
|---|---|---|
| MMMU | Expert-level multimodal understanding | 11.5k questions from college exams across 30+ subjects requiring image + text reasoning |
| MMBench | Broad multimodal capabilities | Hierarchical evaluation with 20 ability dimensions, bilingual (EN/CN) |
| MathVista | Mathematical reasoning with visuals | Tests interpretation of charts, geometry, and scientific figures |
| Video-MME | Video understanding | Tests temporal reasoning across video frames with multiple-choice questions |
| SEED-Bench | Generative multimodal comprehension | 19k multiple-choice questions covering 12 evaluation dimensions for images and video |
When MMMU was first released, researchers noticed that some multimodal models scored better on questions with complex diagrams than on questions with simple text. It turned out the models were using visual features of the question formatting (font size, layout, color coding) as shortcuts to guess the answer, rather than actually reasoning about the diagram content. The benchmark creators had to redesign several questions to eliminate these visual "cheats," a reminder that models will exploit any shortcut available, whether textual or visual.
6. The Current Landscape: GPT-4o, Gemini, and Beyond
As of early 2025, the unified multimodal model landscape is dominated by three families, each with distinct architectural choices and capabilities.
GPT-4o (OpenAI)
GPT-4o ("omni") was the first widely available native multimodal model, launched in May 2024. It processes text, images, and audio natively in a single model. The "o" in the name reflects the "omni" architecture: rather than routing different modalities through separate models, GPT-4o uses a unified transformer that operates on tokens from all modalities. Its image generation capabilities were significantly expanded in 2025, with GPT-4o gaining the ability to produce high-quality images alongside text responses.
Gemini 2.0/2.5 (Google)
Google's Gemini family uses a natively multimodal architecture trained on interleaved text, image, audio, and video data from the ground up. Gemini 2.0 Flash and Gemini 2.5 Pro support long-context windows (up to 1 million tokens for Gemini 2.5 Pro) that can include mixed-modality inputs, making them particularly strong for tasks involving long documents with embedded images, lengthy videos, or multi-turn conversations with visual context. Gemini 2.5 also introduced "thinking" capabilities (extended reasoning) for multimodal tasks.
Open-Weight Alternatives
On the open-weight side, models like LLaVA-OneVision, InternVL2, and Qwen2-VL have closed much of the gap with proprietary models on vision-language tasks. However, true any-to-any generation (accepting and producing images, audio, and video) remains primarily a capability of proprietary systems due to the enormous compute requirements for end-to-end multimodal training.
Who: Customer experience team at a consumer electronics company
Situation: Customers frequently submitted support tickets with photos of defective products, screenshots of error messages, and voice recordings describing their issues. The support team needed to triage and diagnose across all these modalities.
Problem: The existing pipeline used separate OCR for screenshots, speech-to-text for recordings, and a text classifier for routing. Each modality was processed independently, losing the connections between what customers showed, said, and typed.
Dilemma: Upgrading each pipeline component individually would cost months of engineering per modality. A unified approach with GPT-4o would be simpler to deploy but more expensive per query.
Decision: The team replaced the multi-model pipeline with GPT-4o, passing all customer-submitted media (images, audio, text) into a single API call for joint analysis and triage.
How: Customer submissions were routed to GPT-4o with a structured prompt requesting: issue category, severity, affected product, and suggested resolution steps. The model analyzed photos of damage alongside the customer's text description and audio explanation simultaneously.
Result: Triage accuracy improved from 72% to 91%. Average resolution time decreased by 35% because the model could identify the root cause from the photo even when the customer's text description was vague. Engineering maintenance dropped from three separate pipelines to one.
Lesson: Native multimodal models shine when the diagnostic signal is distributed across modalities, because they can correlate visual evidence with spoken or written descriptions in ways that pipeline systems cannot.
Show Answer
Show Answer
Show Answer
Cost management for multimodal API calls. Multimodal API calls are significantly more expensive than text-only calls because image and audio tokens consume large portions of the context window. A single high-resolution image sent to GPT-4o can cost 1,000+ tokens. Production strategies to control costs: (1) resize images to the minimum resolution needed for the task (use detail: "low" for classification, detail: "high" only for OCR or fine-grained analysis); (2) for video analysis, sample frames at 1 fps instead of sending every frame, which reduces token cost by 24x while preserving most temporal information; (3) use a cheaper model (GPT-4o mini, Gemini 2.0 Flash) as a first-pass filter and route only ambiguous cases to the more expensive model; (4) cache multimodal embeddings so repeated analysis of the same image uses a cached representation rather than re-encoding. Gemini's 1M-token context window is particularly cost-effective for long-document or multi-image tasks where you can batch many images into a single call.
- Native multimodal models eliminate the information loss inherent in pipeline approaches. By processing all modalities in a single transformer with cross-modal attention, they enable richer reasoning than chaining separate specialist models.
- Early fusion allows deeper cross-modal reasoning than late fusion. GPT-4o and Gemini use early fusion (interleaved tokens from all modalities), while older systems like Flamingo use late fusion (separate encoders joined late). The trade-off is compute cost vs. reasoning depth.
- Any-to-any generation requires modality-specific decoders sharing a common backbone. The shared transformer provides cross-modal reasoning; separate heads for text, images, and audio handle the final generation step.
- Training unified models requires careful data balance across modalities. Text data is abundant; high-quality cross-modal data is scarce. Without explicit balancing, models become text-dominant.
- Evaluate multimodal models with cross-modal benchmarks (MMMU, MMBench), not just unimodal scores. A model with high ImageNet accuracy and high text scores may still fail at tasks requiring joint reasoning across modalities.
- Open-weight models are closing the gap on vision-language tasks, but true any-to-any generation remains primarily proprietary. The compute requirements for end-to-end multimodal training are a significant barrier for open-source efforts.
Open Questions:
- Can unified multimodal models match the quality of specialist models in every modality, or is there a fundamental trade-off between breadth and depth? Current evidence suggests unified models lag behind specialists in audio generation quality and fine-grained image understanding.
- How should we tokenize video efficiently? Current approaches (sampling frames, using 3D patch embeddings) are computationally expensive. Learned video tokenizers that compress temporal redundancy could make long-video understanding practical.
Recent Developments (2024-2025):
- Google's Gemini 2.5 (2025) introduced "thinking" capabilities for multimodal reasoning, showing that test-time compute scaling (as discussed in Section 22.5) extends beyond text to cross-modal tasks.
- OpenAI's native image generation in GPT-4o (March 2025) demonstrated that a single model can both understand and generate images at high quality, moving toward true any-to-any generation without external diffusion models.
- The MMMU-Pro benchmark (2024) raised the bar for multimodal evaluation by requiring free-form reasoning rather than multiple-choice answers, exposing significant gaps in current models' cross-modal reasoning abilities.
Explore Further: Compare the same cross-modal reasoning task (e.g., "compare the whiteboard diagram with the audio explanation") on a pipeline system (separate vision encoder + speech-to-text + LLM) and a native multimodal model (Gemini 2.5). Analyze where the pipeline loses information that the native model preserves.
Exercises
Compare pipeline multimodal systems (separate models connected by code) with native multimodal models (single model processing multiple modalities). What are the advantages of each approach?
Answer Sketch
Pipeline: easier to build, each component can be optimized independently, easier to debug and replace individual parts. Native: lower latency (single forward pass), better cross-modal understanding (visual and text features interact at every layer), can capture subtle relationships that pipeline approaches miss. Pipeline is pragmatic today; native is the future direction.
Explain early fusion and late fusion in multimodal architectures. Draw (or describe) the information flow in each approach and discuss when each is preferred.
Answer Sketch
Early fusion: combine modalities at the input level (e.g., interleave image tokens with text tokens before processing). Advantage: deep cross-modal interaction from the first layer. Late fusion: process each modality with separate encoders, then combine the representations later. Advantage: each encoder can be pre-trained independently. Early fusion is better for tasks requiring tight integration (VQA); late fusion is better for tasks where modalities are loosely related (retrieval).
Write a multimodal prompt using the Anthropic API that sends an image along with a text question. The prompt should ask the model to describe the image and identify any text visible in it.
Answer Sketch
Use anthropic.Anthropic().messages.create() with a message containing both an image content block (base64-encoded or URL) and a text content block with the question. The model processes both together and returns a unified response that references visual and textual content from the image.
Describe the concept of 'any-to-any' generation in unified multimodal models. What architectural innovations make it possible for a single model to both understand and generate across modalities?
Answer Sketch
Any-to-any means the model can take any combination of modalities as input and produce any modality as output (text to image, image to text, audio to text, etc.). Key innovations: (1) shared tokenization across modalities (images, audio, and text all become token sequences), (2) a single transformer that processes all modalities, and (3) modality-specific decoders that convert output tokens back to images, audio, or text.
Compare the performance of GPT-4o, Gemini, and Claude on multimodal benchmarks (MMMU, MathVista). What patterns emerge in their strengths and weaknesses?
Answer Sketch
GPT-4o: strong on visual reasoning and chart understanding. Gemini: strong on long-context multimodal tasks and video understanding. Claude: strong on document analysis and careful instruction following. Patterns: all models struggle with tasks requiring precise spatial reasoning or counting objects in complex scenes. Performance on text-heavy images (documents, code screenshots) is generally better than on natural scenes requiring fine-grained visual understanding.
What Comes Next
In the next chapter, Chapter 28: LLM Applications, we turn from model architectures to practical applications, exploring how these multimodal and agentic capabilities are deployed in real-world products.
Bibliography
OpenAI. (2024). "GPT-4o System Card." OpenAI
Google DeepMind. (2024). "Gemini: A Family of Highly Capable Multimodal Models." arXiv:2312.11805
Alayrac, J.-B., Donahue, J., Luc, P., et al. (2022). "Flamingo: a Visual Language Model for Few-Shot Learning." arXiv:2204.14198
Liu, H., Li, C., Wu, Q., Lee, Y. J. (2024). "Visual Instruction Tuning." arXiv:2304.08485
Yue, X., Ni, Y., Zhang, K., et al. (2024). "MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark." arXiv:2311.16502
Liu, Y., Duan, H., Zhang, Y., et al. (2024). "MMBench: Is Your Multi-modal Model an All-around Player?" arXiv:2307.06281