Part VII: AI Applications
Chapter 27: Multimodal Generation

Unified Multimodal Models & Omni-Architectures

A truly multimodal model does not translate between modalities. It thinks in all of them simultaneously.

Pixel Pixel, Polyglot Perceiving AI Agent
Big Picture

The field of multimodal AI is undergoing a fundamental architectural shift from pipeline systems to unified models. Early multimodal systems bolted together separate models: a vision encoder, a text decoder, and an audio processor, connected by adapters and projection layers. While functional, these pipeline approaches suffer from information loss at every handoff. GPT-4o, Gemini 2.0/2.5, and similar "omni" models represent a new paradigm: a single transformer trained end-to-end across text, images, audio, and video that can both understand and generate in multiple modalities natively. This section covers the architectures behind these models, the trade-offs between fusion strategies, and how to evaluate multimodal systems using benchmarks like MMMU and MMBench. The embedding concepts from Section 19.1 provide context for understanding shared multimodal embedding spaces.

Prerequisites

This section builds on vision-language models from Section 27.1: Image Generation and Vision-Language Models, audio and video generation in Section 27.2: Audio, Music and Video Generation, and the transformer architecture covered in Section 04.1.

Key Insight

Imagine two ways to handle a conversation in three languages. The first approach (pipeline) hires separate translators: one listens in Japanese, translates to English, another reads English and writes a response, and a third translates the response to French. Each handoff loses nuance. The second approach (native multimodal) is a single person who thinks fluently in all three languages. They hear the Japanese question, understand its meaning directly, and respond in French without any intermediate translation step. Unified multimodal models are the second approach: they process images, text, and audio within a single model that represents all modalities in a shared embedding space, avoiding the information loss that comes from translating between separate specialized models.

1. Pipeline vs. Native Multimodal Architectures

The distinction between pipeline and native multimodal models is the most important architectural concept in this section. A pipeline model chains separate specialist models: CLIP encodes the image, Whisper transcribes the audio, and a language model reasons over the combined text representations. A native multimodal model processes all modalities within a single set of transformer layers, using a shared representation space where image tokens, audio tokens, and text tokens attend to each other directly.

Pipeline Approach (LLaVA, traditional voice assistants) Vision Encoder (CLIP, frozen) Audio Encoder (Whisper) Text Embedder (tokenizer) Projection ! ! Language Model TTS Model Limitations: Tone and emotion lost during transcription Cumulative latency: 2 to 5 seconds per turn Error propagation with no self-correction path Output limited to text (typically) Native Multimodal (GPT-4o, Gemini 2.5) Image patches Audio tokens Text tokens Tokenize and Interleave into Single Sequence Unified Transformer Cross-modal attention at every layer Tone, emotion, spatial context all preserved Text Out Audio Out Image Out Advantages: ~320ms latency | full cross-modal reasoning | any-to-any generation
Figure 27.4.1: Pipeline approach (left) chains separate encoders through projection layers, losing information at each handoff. Native multimodal models (right) process all modalities through a single transformer with cross-modal attention at every layer, preserving the full signal.
Property Comparison
Property Pipeline (e.g., LLaVA) Native (e.g., GPT-4o, Gemini)
Architecture Separate encoders + projection + LLM Single transformer, shared embeddings
Cross-modal reasoning Limited by projection bottleneck Full attention across modalities
Generation Text only (typically) Text, images, audio, video
Latency Additive (each model adds delay) Single forward pass (lower latency)
Training cost Lower (can reuse pretrained components) Much higher (end-to-end training)
Customizability Can swap individual components Monolithic; harder to modify
Key Insight

Multimodal models need special tokenization strategies because different modalities have fundamentally different information densities. A single text token encodes roughly 4 characters (a word fragment). A single image patch token from a 224x224 image encodes a 16x16 pixel region containing thousands of pixel values. A single audio frame token represents 20 milliseconds of waveform with thousands of amplitude samples. This mismatch creates a practical problem: a 1024-token context window can hold a 4,000-character text passage or a single low-resolution image or a fraction of a second of audio. Native multimodal models address this through learned compression (visual tokenizers like those in Gemini reduce a 224x224 image to 256 tokens), dynamic resolution (allocating more tokens to complex images and fewer to simple ones), and hierarchical encoding (capturing global structure at low token count and local detail at high token count). Understanding these tokenization trade-offs, which build on the text tokenization foundations from Chapter 02, is essential for designing multimodal applications that balance quality with context window efficiency.

2. Early Fusion vs. Late Fusion

The distinction between early and late fusion describes where in the model different modalities are combined. In early fusion, raw inputs from all modalities are tokenized and concatenated into a single sequence before the first transformer layer. Every attention layer can attend across modalities from the very start. In late fusion, each modality is processed by separate encoder stacks for many layers, and the representations are only combined in the final layers.

Early Fusion (GPT-4o style)

GPT-4o and Gemini 2.0 use early fusion: images are converted to visual tokens by a patch-based tokenizer, audio is converted to audio tokens by a learned encoder, and text uses the standard tokenizer. All tokens are interleaved into a single sequence and fed through the full transformer stack. This allows the model to reason about relationships between modalities from the earliest layers. For example, the model can associate a spoken word with a visual region in the first few layers, rather than having to reconstruct this relationship after separate encoding.

Tip

When choosing between early fusion and late fusion multimodal models for your application, consider your fine-tuning budget. Late fusion models let you freeze the vision encoder and only train the cross-attention layers, cutting GPU memory requirements by 40 to 60%. If your task mainly needs visual understanding (image captioning, visual QA), late fusion with a frozen encoder is often sufficient. Reserve early fusion for tasks that require tight cross-modal reasoning, such as models that must describe spatial relationships between objects in a scene while following complex text instructions.

Late Fusion (Flamingo style)

Late fusion models like Flamingo process images through a frozen vision encoder for many layers, producing a compact set of visual features. These features are then injected into the language model via cross-attention layers at specific positions. The advantage is efficiency: the vision encoder can be pretrained independently and frozen, reducing training cost. The disadvantage is that cross-modal interactions are limited to the layers where injection occurs. Figure 27.4.2 compares early fusion, late fusion, and cross-attention fusion architectures side by side. Code Fragment 27.4.2 below puts this into practice.

Early Fusion (GPT-4o, Gemini) Image Audio Text Tokenize + Interleave Layer 1: Full cross-modal attn Layer 2: Full cross-modal attn ... Layer N: Full cross-modal attn cross-modal Any modality output + Deepest cross-modal reasoning - Very expensive to train - Cannot swap individual encoders Late Fusion (LLaVA, MiniGPT-4) Vision Encoder (frozen CLIP) many layers independently Audio Encoder (frozen Whisper) Text Embedder Projection + Concatenation Shared Decoder Layer 1 Shared Decoder Layer M cross-modal + Modular, swap encoders easily + Cheaper to train (reuse parts) - Cross-modal reasoning only in decoder layers Cross-Attention (Flamingo, Idefics) Frozen LLM LLM Layer 1 X-Attn Layer LLM Layer 3 LLM Layer 4 X-Attn Layer LLM Layer 6 Vision Encoder + Preserves LLM capabilities + Add modalities incrementally - Complex architecture design - Selective interaction only
Figure 27.4.2: Three multimodal fusion strategies. Early fusion (left) allows cross-modal attention at every layer. Late fusion (center) combines separately encoded modalities in a shared decoder. Cross-attention fusion (right) inserts gated cross-attention at specific layers of a frozen LLM, with vision features serving as keys/values.

# Gemini 2.5: native multimodal with early fusion
# Process image + audio + text in a single forward pass for cross-modal reasoning
import google.generativeai as genai
from pathlib import Path

# Gemini 2.5: native multimodal with early fusion
# The model processes image + text + audio in a single forward pass
genai.configure(api_key="YOUR_API_KEY")
model = genai.GenerativeModel("gemini-2.5-flash")

# Upload an image and audio file
image = genai.upload_file(Path("meeting_whiteboard.jpg"))
audio = genai.upload_file(Path("meeting_recording.mp3"))

# Cross-modal query: reason across image and audio simultaneously
response = model.generate_content([
 image,
 audio,
 "Compare the action items discussed in the audio recording with "
 "what was written on the whiteboard. Identify any items mentioned "
 "verbally but missing from the whiteboard."
])

print(response.text)
Comparing the audio discussion with the whiteboard notes: **On whiteboard but not discussed verbally:** - "Update CI/CD pipeline" (written in bottom-right corner) **Discussed verbally but missing from whiteboard:** - Migrate staging database to PostgreSQL 16 (mentioned at ~3:45) - Schedule design review with the frontend team (mentioned at ~7:20) ...
Code Fragment 27.4.1: Gemini 2.5: native multimodal with early fusion

3. Any-to-Any Generation

The most striking capability of unified multimodal models is any-to-any generation: the ability to take input in any combination of modalities and produce output in any modality. GPT-4o can accept text and produce audio, accept an image and produce text, or accept audio and produce text with an image. This is fundamentally different from pipeline systems where each input-output modality pair requires a separate model chain.

Architecturally, any-to-any generation requires the model to have both encoders and decoders for every supported modality. For text, the standard autoregressive decoder suffices. For images, the model typically includes a diffusion decoder or a discrete visual token decoder that generates images token by token. For audio, a vocoder or audio token decoder converts the model's internal representations into waveforms. The shared transformer backbone provides the cross-modal reasoning, while modality-specific heads handle the final generation step. Code Fragment 27.4.2 below puts this into practice.


# Any-to-any generation: analyze an image, then generate a new one based on the analysis
# Demonstrates cross-modal flow from image understanding to image generation
from openai import OpenAI
import base64

client = OpenAI()

# GPT-4o: native multimodal understanding + generation
# Accept image input, produce text analysis + generate related image

# Step 1: Analyze an image with text
with open("chart.png", "rb") as f:
 image_data = base64.b64encode(f.read()).decode()

analysis = client.chat.completions.create(
 model="gpt-4o",
 messages=[{
 "role": "user",
 "content": [
 {"type": "image_url",
 "image_url": {"url": f"data:image/png;base64,{image_data}"}},
 {"type": "text",
 "text": "Describe the trends in this chart. What are the key insights?"}
 ]
 }]
)
print(analysis.choices[0].message.content)

# Step 2: Generate an improved version of the chart (image generation)
image_response = client.images.generate(
 model="gpt-image-1",
 prompt=(
 f"Create a clean, professional bar chart showing the following data: "
 f"{analysis.choices[0].message.content[:500]}. "
 f"Use a modern color palette with clear labels and a title."
 ),
 size="1024x1024"
)
print(f"Generated image URL: {image_response.data[0].url}")
The chart shows a steady upward trend in quarterly revenue from Q1 ($2.1M) to Q4 ($4.2M), representing approximately 100% year-over-year growth. The steepest increase occurred between Q2 and Q3, suggesting a seasonal acceleration... Generated image URL: https://oaidalleapiprodscus.blob.core.windows.net/private/...
Code Fragment 27.4.2: Any-to-any generation: analyze an image, then generate a new one based on the analysis
Library Shortcut: transformers pipeline in Practice

For quick image captioning without API keys, the transformers pipeline (pip install transformers) handles model loading and inference in 3 lines:


from transformers import pipeline

captioner = pipeline("image-to-text", model="Salesforce/blip2-opt-2.7b")
result = captioner("chart.png")
print(result[0]["generated_text"]) # "a bar chart showing quarterly revenue growth"
Code Fragment 27.4.3: Any-to-any generation: analyze an image, then generate a new one based on the an

4. Training Unified Multimodal Models

Training a unified multimodal model requires solving several challenges that do not exist in text-only training. The first is tokenization across modalities: images, audio, and text have fundamentally different structures and information densities. A single image can contain as much information as thousands of text tokens, and audio streams have a very different temporal structure than text sequences. The model needs a tokenization strategy that represents each modality efficiently while allowing meaningful cross-modal attention.

Training Stages

Unified multimodal models typically follow a staged training curriculum, where each phase builds on the previous one:

  1. Modality-specific pretraining: Train separate encoders and decoders on large unimodal datasets (ImageNet for vision, LibriSpeech for audio, text corpora for language). This gives each modality a strong foundation.
  2. Alignment pretraining: Train the model on paired multimodal data (image-caption pairs, audio-transcript pairs) to align representations across modalities in a shared embedding space.
  3. Unified fine-tuning: Fine-tune the complete model end-to-end on interleaved multimodal data where inputs and outputs can be in any modality. This stage teaches the model to reason across modalities and generate in any target modality.
  4. Instruction tuning: Fine-tune on human-curated instruction-following data that covers multimodal tasks: "describe this image," "transcribe this audio," "generate an image of..." This stage aligns the model with user expectations.
Warning

Training data for different modalities varies enormously in availability and quality. There are trillions of text tokens available on the web, but far fewer high-quality image-text pairs, and even fewer high-quality audio-text-image triplets. If training data is not carefully balanced, the model can become "text-dominant," treating visual and audio inputs as secondary signals rather than first-class modalities. Google's Gemini team addressed this by curating dedicated cross-modal datasets and using loss weighting to ensure each modality receives adequate training signal.

5. Multimodal Benchmarks

Evaluating multimodal models requires benchmarks that test cross-modal reasoning, not just performance on each modality in isolation. A model that achieves high accuracy on ImageNet and high perplexity on text benchmarks might still fail at tasks requiring joint reasoning across images and text. The major multimodal benchmarks as of 2025 include:

Benchmark Comparison
Benchmark What It Tests Key Features
MMMU Expert-level multimodal understanding 11.5k questions from college exams across 30+ subjects requiring image + text reasoning
MMBench Broad multimodal capabilities Hierarchical evaluation with 20 ability dimensions, bilingual (EN/CN)
MathVista Mathematical reasoning with visuals Tests interpretation of charts, geometry, and scientific figures
Video-MME Video understanding Tests temporal reasoning across video frames with multiple-choice questions
SEED-Bench Generative multimodal comprehension 19k multiple-choice questions covering 12 evaluation dimensions for images and video
Fun Fact

When MMMU was first released, researchers noticed that some multimodal models scored better on questions with complex diagrams than on questions with simple text. It turned out the models were using visual features of the question formatting (font size, layout, color coding) as shortcuts to guess the answer, rather than actually reasoning about the diagram content. The benchmark creators had to redesign several questions to eliminate these visual "cheats," a reminder that models will exploit any shortcut available, whether textual or visual.

6. The Current Landscape: GPT-4o, Gemini, and Beyond

As of early 2025, the unified multimodal model landscape is dominated by three families, each with distinct architectural choices and capabilities.

GPT-4o (OpenAI)

GPT-4o ("omni") was the first widely available native multimodal model, launched in May 2024. It processes text, images, and audio natively in a single model. The "o" in the name reflects the "omni" architecture: rather than routing different modalities through separate models, GPT-4o uses a unified transformer that operates on tokens from all modalities. Its image generation capabilities were significantly expanded in 2025, with GPT-4o gaining the ability to produce high-quality images alongside text responses.

Gemini 2.0/2.5 (Google)

Google's Gemini family uses a natively multimodal architecture trained on interleaved text, image, audio, and video data from the ground up. Gemini 2.0 Flash and Gemini 2.5 Pro support long-context windows (up to 1 million tokens for Gemini 2.5 Pro) that can include mixed-modality inputs, making them particularly strong for tasks involving long documents with embedded images, lengthy videos, or multi-turn conversations with visual context. Gemini 2.5 also introduced "thinking" capabilities (extended reasoning) for multimodal tasks.

Open-Weight Alternatives

On the open-weight side, models like LLaVA-OneVision, InternVL2, and Qwen2-VL have closed much of the gap with proprietary models on vision-language tasks. However, true any-to-any generation (accepting and producing images, audio, and video) remains primarily a capability of proprietary systems due to the enormous compute requirements for end-to-end multimodal training.

Real-World Scenario: Multimodal Customer Support with GPT-4o

Who: Customer experience team at a consumer electronics company

Situation: Customers frequently submitted support tickets with photos of defective products, screenshots of error messages, and voice recordings describing their issues. The support team needed to triage and diagnose across all these modalities.

Problem: The existing pipeline used separate OCR for screenshots, speech-to-text for recordings, and a text classifier for routing. Each modality was processed independently, losing the connections between what customers showed, said, and typed.

Dilemma: Upgrading each pipeline component individually would cost months of engineering per modality. A unified approach with GPT-4o would be simpler to deploy but more expensive per query.

Decision: The team replaced the multi-model pipeline with GPT-4o, passing all customer-submitted media (images, audio, text) into a single API call for joint analysis and triage.

How: Customer submissions were routed to GPT-4o with a structured prompt requesting: issue category, severity, affected product, and suggested resolution steps. The model analyzed photos of damage alongside the customer's text description and audio explanation simultaneously.

Result: Triage accuracy improved from 72% to 91%. Average resolution time decreased by 35% because the model could identify the root cause from the photo even when the customer's text description was vague. Engineering maintenance dropped from three separate pipelines to one.

Lesson: Native multimodal models shine when the diagnostic signal is distributed across modalities, because they can correlate visual evidence with spoken or written descriptions in ways that pipeline systems cannot.

Self-Check
Q1: What is the key advantage of native multimodal models over pipeline approaches?
Show Answer
Native multimodal models process all modalities within a single transformer with cross-modal attention, allowing every layer to reason across modalities directly. Pipeline approaches chain separate encoders through projection layers, losing information at each handoff. The result is that native models can perform richer cross-modal reasoning (such as associating a spoken word with a visual region) that pipeline models cannot because the relevant information was lost during projection.
Q2: Explain the difference between early fusion and late fusion in multimodal architectures.
Show Answer
In early fusion, all modalities are tokenized and concatenated into a single sequence before the first transformer layer. Every attention layer processes all modalities together from the start, enabling rich cross-modal reasoning throughout the network. In late fusion, each modality is processed by separate encoder stacks for many layers, and representations are only combined in the final layers (typically through cross-attention injection). Early fusion enables deeper cross-modal reasoning but requires more compute. Late fusion is more efficient and allows reuse of pretrained encoders but limits cross-modal interactions to the fusion layers.
Q3: Why is data balance a critical challenge when training unified multimodal models?
Show Answer
Training data availability varies enormously across modalities: trillions of text tokens exist on the web, but far fewer high-quality image-text pairs, and even fewer audio-text-image triplets. If training is not carefully balanced, the model becomes text-dominant, treating visual and audio inputs as secondary signals rather than first-class modalities. Solutions include curating dedicated cross-modal datasets, using loss weighting to ensure each modality receives adequate training signal, and oversampling rare modality combinations.
Production Tip

Cost management for multimodal API calls. Multimodal API calls are significantly more expensive than text-only calls because image and audio tokens consume large portions of the context window. A single high-resolution image sent to GPT-4o can cost 1,000+ tokens. Production strategies to control costs: (1) resize images to the minimum resolution needed for the task (use detail: "low" for classification, detail: "high" only for OCR or fine-grained analysis); (2) for video analysis, sample frames at 1 fps instead of sending every frame, which reduces token cost by 24x while preserving most temporal information; (3) use a cheaper model (GPT-4o mini, Gemini 2.0 Flash) as a first-pass filter and route only ambiguous cases to the more expensive model; (4) cache multimodal embeddings so repeated analysis of the same image uses a cached representation rather than re-encoding. Gemini's 1M-token context window is particularly cost-effective for long-document or multi-image tasks where you can batch many images into a single call.

Key Takeaways
Research Frontier

Open Questions:

Recent Developments (2024-2025):

Explore Further: Compare the same cross-modal reasoning task (e.g., "compare the whiteboard diagram with the audio explanation") on a pipeline system (separate vision encoder + speech-to-text + LLM) and a native multimodal model (Gemini 2.5). Analyze where the pipeline loses information that the native model preserves.

Exercises

Exercise 27.4.1: Pipeline vs. Native Multimodal Conceptual

Compare pipeline multimodal systems (separate models connected by code) with native multimodal models (single model processing multiple modalities). What are the advantages of each approach?

Answer Sketch

Pipeline: easier to build, each component can be optimized independently, easier to debug and replace individual parts. Native: lower latency (single forward pass), better cross-modal understanding (visual and text features interact at every layer), can capture subtle relationships that pipeline approaches miss. Pipeline is pragmatic today; native is the future direction.

Exercise 27.4.2: Early vs. Late Fusion Conceptual

Explain early fusion and late fusion in multimodal architectures. Draw (or describe) the information flow in each approach and discuss when each is preferred.

Answer Sketch

Early fusion: combine modalities at the input level (e.g., interleave image tokens with text tokens before processing). Advantage: deep cross-modal interaction from the first layer. Late fusion: process each modality with separate encoders, then combine the representations later. Advantage: each encoder can be pre-trained independently. Early fusion is better for tasks requiring tight integration (VQA); late fusion is better for tasks where modalities are loosely related (retrieval).

Exercise 27.4.3: Multimodal Prompt Design Coding

Write a multimodal prompt using the Anthropic API that sends an image along with a text question. The prompt should ask the model to describe the image and identify any text visible in it.

Answer Sketch

Use anthropic.Anthropic().messages.create() with a message containing both an image content block (base64-encoded or URL) and a text content block with the question. The model processes both together and returns a unified response that references visual and textual content from the image.

Exercise 27.4.4: Any-to-Any Generation Conceptual

Describe the concept of 'any-to-any' generation in unified multimodal models. What architectural innovations make it possible for a single model to both understand and generate across modalities?

Answer Sketch

Any-to-any means the model can take any combination of modalities as input and produce any modality as output (text to image, image to text, audio to text, etc.). Key innovations: (1) shared tokenization across modalities (images, audio, and text all become token sequences), (2) a single transformer that processes all modalities, and (3) modality-specific decoders that convert output tokens back to images, audio, or text.

Exercise 27.4.5: Multimodal Benchmark Analysis Analysis

Compare the performance of GPT-4o, Gemini, and Claude on multimodal benchmarks (MMMU, MathVista). What patterns emerge in their strengths and weaknesses?

Answer Sketch

GPT-4o: strong on visual reasoning and chart understanding. Gemini: strong on long-context multimodal tasks and video understanding. Claude: strong on document analysis and careful instruction following. Patterns: all models struggle with tasks requiring precise spatial reasoning or counting objects in complex scenes. Performance on text-heavy images (documents, code screenshots) is generally better than on natural scenes requiring fine-grained visual understanding.

What Comes Next

In the next chapter, Chapter 28: LLM Applications, we turn from model architectures to practical applications, exploring how these multimodal and agentic capabilities are deployed in real-world products.

Bibliography

Unified Multimodal Models

OpenAI. (2024). "GPT-4o System Card." OpenAI

The system card for GPT-4o describing its native multimodal architecture, safety evaluations, and capability assessments across text, vision, and audio. Essential reference for understanding the first widely deployed omni-model.
Model Documentation

Google DeepMind. (2024). "Gemini: A Family of Highly Capable Multimodal Models." arXiv:2312.11805

Technical report for the Gemini model family, describing the natively multimodal architecture trained on interleaved text, image, audio, and video data. Covers training methodology, benchmark results, and the architectural choices behind early fusion.
Technical Report
Multimodal Architectures

Alayrac, J.-B., Donahue, J., Luc, P., et al. (2022). "Flamingo: a Visual Language Model for Few-Shot Learning." arXiv:2204.14198

Introduces the Flamingo architecture for late-fusion multimodal learning. Demonstrates how cross-attention layers can inject visual features into a frozen language model. Important baseline for understanding the evolution toward early fusion approaches.
Architecture

Liu, H., Li, C., Wu, Q., Lee, Y. J. (2024). "Visual Instruction Tuning." arXiv:2304.08485

The LLaVA paper that popularized the pipeline approach to multimodal LLMs using a vision encoder, projection layer, and language model. Important reference for understanding the pipeline baseline that native multimodal models aim to surpass.
Architecture
Benchmarks

Yue, X., Ni, Y., Zhang, K., et al. (2024). "MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark." arXiv:2311.16502

Introduces MMMU, the standard benchmark for expert-level multimodal understanding. Contains 11.5k questions from college exams requiring joint image-text reasoning across 30+ disciplines. Essential for evaluating unified multimodal models.
Benchmark

Liu, Y., Duan, H., Zhang, Y., et al. (2024). "MMBench: Is Your Multi-modal Model an All-around Player?" arXiv:2307.06281

Presents MMBench with its hierarchical evaluation framework covering 20 ability dimensions. The bilingual (English/Chinese) design and systematic ability decomposition make it valuable for identifying specific multimodal weaknesses.
Benchmark