Section 22.7: Early Fusion vs Late Fusion

"Where you mix the modalities determines how deeply they ever talk to each other."
Pixel, Fusion-Curious AI Agent

Big Picture

Once you commit to a native multimodal architecture (see Section 22.6), the next question is where in the network the modalities combine. Early-fusion models tokenize every modality and feed all tokens into a single transformer from the input embedding layer. Late-fusion models keep modality-specific encoders that produce features, then combine those features near the top of the network. The choice has deep consequences for cross-modal reasoning depth, training data efficiency, and the kind of capabilities the final model exhibits. This section examines the two patterns and the gradient of intermediate "mid-fusion" designs (LLaVA-style projection layers, Q-Formers, perceivers) that dominate the open-source landscape.

Prerequisites

This section assumes the VLM architecture from Section 22.1 through Section 22.4, and the cross-attention mechanics from Section 4.3.

Three architecture diagrams: early fusion (all modalities tokenized at the input), mid fusion (modality encoders feeding a projection layer into the LLM), and late fusion (modality encoders combined near the output) — **Figure 22.7.1**: The fusion spectrum. Early fusion shares all transformer layers across modalities. Late fusion shares almost nothing until the final layers. Mid-fusion (e.g., LLaVA) sits in between with a projection layer that maps a frozen vision encoder's output into the LLM's embedding space.

22.7.1 Three Points on the Spectrum

Fun Fact

Meta's Chameleon used early fusion on every modality. It generates striking interleaved text and images and quietly refuses to produce coherent images at all in its public release because the team decided the safety risk was too high. The model that could do the most was the model that was allowed to do the least.

Key Insight: Mental Model: A Three-Stage Onion

Think of fusion as how soon an image's bytes start influencing a sentence's grammar. Three food metaphors cover the spectrum:

Early fusion is a smoothie. Text, image, and audio tokens are blended at layer 1, and from then on the transformer cannot tell which input was which.
Mid-fusion is a sandwich. Each modality is prepared separately (CLIP for images, Whisper for audio), then layered into a fixed slot at the LLM input.
Late fusion is a side-by-side tasting flight. Two encoders run in parallel and only meet at the contrastive head.

The training cost order matches the depth of mixing. Chameleon (early) required from-scratch training of one giant model. LLaVA (mid) required only training a 250-million-parameter projector against a frozen Llama. CLIP (late) required two encoders trained against each other with a single dot product as the bridge.

Where this model breaks down: the onion is a one-dimensional axis; real systems often mix strategies (LLaVA-NeXT uses late-fused images for retrieval and mid-fused images for generation), so "early/mid/late" is a useful taxonomy for one-modality-at-a-time, not a hard architecture rule.

Fusion is a continuum, but three operating points cover most production systems:

Early fusion (Chameleon, Llama-4-Omni input side): every modality is tokenized at the input. Text uses BPE tokens, images use VQ-VAE or similar discrete codes, audio uses a neural codec like SoundStream. All tokens are interleaved and processed by the same transformer from layer 1 onward.
Mid-fusion (LLaVA, Qwen-VL, Idefics2): each modality has its own frozen encoder (CLIP-ViT for images, audio Whisper, etc.) that produces feature vectors. A small projection layer (linear or MLP) maps those features into the LLM's embedding space. They are then prefixed or interleaved with text tokens and fed into a vanilla text LLM.
Late fusion (CLIP-style retrieval models): modality-specific encoders produce embeddings that meet only at the contrastive head or the very last attention layer. Late fusion is dominant in retrieval (Section 33.1) but rare in generation.

22.7.2 Early Fusion Mechanics: The Chameleon Approach

Meta's Chameleon (Team, 2024) is the cleanest expression of early fusion. The full pipeline:

Train a discrete image tokenizer (a VQ-GAN-style encoder) that maps a 256x256 image to 1024 tokens from a 8192-entry codebook.
Train a discrete audio tokenizer similarly.
Expand the LLM's vocabulary to include the image and audio codebook entries.
Train a single autoregressive transformer on interleaved sequences of text, image, and audio tokens, with a unified next-token-prediction objective.

The result is a model that does not distinguish between modalities at the architectural level. Every layer sees every token type and learns whatever cross-modal patterns the data demands. The same approach drives the input side of Llama-4-Omni and the image generation head of GPT-4o.

Key Insight: Why early fusion needs discrete tokens

The autoregressive transformer's bread and butter is next-token prediction over a discrete vocabulary. To plug images and audio into this scheme, you need a tokenizer that produces a finite codebook. VQ-VAE, RVQ (residual vector quantization), and FSQ (finite scalar quantization) are the three dominant choices. The quality of these tokenizers caps the quality of any early-fusion model: a lossy image tokenizer puts a ceiling on image generation fidelity. Diffusion-based image generation in late-fusion designs sidesteps this entirely, which is why GPT-4o's image output uses a non-tokenized diffusion path.

22.7.3 Mid-Fusion Mechanics: The LLaVA Pattern

LLaVA (Liu et al., 2023) crystallized the mid-fusion recipe that dominates open-source multimodal LLMs in 2024-2026:

A frozen vision encoder (CLIP ViT-L/14) extracts patch embeddings from the input image, typically 576 tokens at 24x24 patches.
A small projection layer (a 2-layer MLP) maps those 576 tokens from the vision encoder's hidden dim into the LLM's embedding dim.
The projected tokens are prepended to the text token sequence and fed into a frozen or fine-tuned LLM (Vicuna, Mistral, Qwen-2).

The forward pass can be written compactly as

\mathbf{H} = \mathrm{LLM}\!\Big(\big[\,\underbrace{\phi(\mathrm{ViT}(\mathbf{I}))}_{\text{576 visual tokens}}\;;\;\underbrace{\mathrm{Embed}(\mathbf{T})}_{\text{text tokens}}\,\big]\Big),

where $\mathbf{I}$ is the input image, $\mathbf{T}$ is the text token sequence, $\mathrm{ViT}$ is the frozen vision encoder, and $\phi$ is the trainable 2-layer MLP projector that bridges the vision and language embedding spaces. Only $\phi$ is trained in LLaVA's first stage; the LLM is fine-tuned (often with LoRA) in the second stage.

LLaVA mid-fusion: visual tokens prepended to text in the LLM's embedding space — **Figure 22.7.2**: The LLaVA mid-fusion architecture. A frozen CLIP ViT extracts 576 visual tokens, a tiny 2-layer MLP projects them into the LLM's embedding space, and they are simply concatenated with the text token embeddings before the LLM's first transformer block. The LLM's existing self-attention handles cross-modal mixing for free; no specialized cross-attention modules are added.

Real-World Scenario

LLaVA Token Budget for a 4K-Context Model

A LLaVA-1.5-7B model uses CLIP ViT-L/14 at 336×336 resolution, which yields 24×24 = 576 visual tokens per image. Each visual token costs the same as a text token inside the LLM, so a single image consumes 576 of the LLM's 4096 context slots: ~14% of the context window. Adding LLaVA-1.5's "high-resolution" trick (sub-image crops at 4 positions plus the global view) multiplies this by 5x, giving 2880 visual tokens per image and leaving only 1216 slots for text. This is why models that need long text histories with multiple images shifted toward perceiver-style or Q-Former resamplers that compress the visual token count down to a fixed ~64 to 256 regardless of image resolution.

# Mid-fusion forward pass: vision encoder + projection + LLM.
import torch
import torch.nn as nn

class MidFusionVLM(nn.Module):
    def __init__(self, vision_encoder, llm, hidden_dim=4096):
        super().__init__()
        self.vision = vision_encoder      # CLIP ViT-L, frozen
        self.llm = llm                     # Mistral-7B, fine-tuned
        self.projector = nn.Sequential(
            nn.Linear(vision_encoder.hidden_dim, hidden_dim),
            nn.GELU(),
            nn.Linear(hidden_dim, hidden_dim),
        )

    def forward(self, image, text_token_ids):
        # 1. Encode image as 576 visual tokens of vision_encoder.hidden_dim
        with torch.no_grad():
            visual = self.vision(image).last_hidden_state
        # 2. Project into LLM's embedding space
        visual_proj = self.projector(visual)
        # 3. Embed text tokens
        text_emb = self.llm.embed_tokens(text_token_ids)
        # 4. Concatenate: [visual_tokens, text_tokens]
        combined = torch.cat([visual_proj, text_emb], dim=1)
        # 5. Run through the LLM transformer stack
        return self.llm.transformer(inputs_embeds=combined)

Code Fragment 22.7.1a: The mid-fusion pattern in 30 lines. The projection layer is the only new parameter; the vision encoder is frozen and the LLM is fine-tuned (or LoRA-adapted) on multimodal instruction data.

Variants of the basic recipe include:

Perceiver Resampler (Flamingo, Alayrac et al. 2022; Idefics): a fixed number of learned latent queries (typically 64) cross-attend to the raw vision features and emit a fixed-size summary. Flamingo then injects this summary into a frozen LLM via interleaved gated cross-attention blocks rather than direct token concatenation. The Resampler predates Q-Former and inspired it.
Q-Former (BLIP-2, Li et al. 2023): a Perceiver-style resampler with learnable query tokens that compresses the vision features to 32 tokens, but additionally trained with image-text matching, image-text contrastive, and image-grounded text generation losses before being attached to the LLM. The compressed tokens are then concatenated into the LLM input (not cross-attended like Flamingo).
Dynamic resolution (Qwen2-VL): the vision encoder produces a variable number of tokens depending on input resolution, letting the model handle high-res images without truncating detail.

Under the Hood

Perceiver Resampler (Flamingo gated cross-attention)

The Perceiver Resampler maps a variable number of vision features to a fixed set of 64 learned latent queries that cross-attend to those features, so downstream cost is constant regardless of image resolution. Flamingo then inserts new gated cross-attention layers between the frozen LLM's existing blocks; each new layer is wrapped in a tanh gate initialized at zero, so at the start of training the model behaves exactly like the original text LLM and the visual pathway switches on gradually. This zero-init gating is what lets a powerful pretrained LLM absorb a new modality without catastrophic forgetting of its language ability.

22.7.4 Capability Trade-offs

Aspect	Early Fusion	Mid-Fusion	Late Fusion
Cross-modal depth	Highest	Medium	Lowest
Training data efficiency	Lowest (needs huge multimodal corpus)	Highest (leverages pretrained components)	High
Compute cost	Highest	Medium	Lowest
Modality swappability	Hard (retrain tokenizer)	Easy (swap projection layer)	Easy (swap encoder)
Generation across modalities	Native	Needs separate decoder	Needs separate decoder
Reasoning about visual details	Strong (when trained well)	Strong	Limited
Production examples (2026)	Chameleon, Llama-4-Omni, GPT-4o input	LLaVA, Qwen2-VL, Idefics, Pixtral	CLIP for retrieval

Figure 22.7.3: Fusion design space, late 2025. Mid-fusion dominates open-source because it leverages pretrained text LLMs; early fusion dominates frontier labs because it scales cleanly to many modalities; late fusion lives almost entirely in the retrieval world.

Note: Why open source converged on mid-fusion

Mid-fusion's killer feature is parameter efficiency. Training a CLIP-LLaVA-style model from scratch needs only the projection layer (~few million parameters) plus optional LoRA on the LLM. A small team can produce a competitive multimodal LLM in days on academic hardware. Early-fusion models like Chameleon need full pretraining on billions of multimodal samples, which is out of reach for most labs. The asymmetry shows in the leaderboards: nearly every popular open-weight VLM in 2024-2026 is mid-fusion.

22.7.5 Fusion and Generation

Fusion choice affects not just understanding but generation. To generate an image, a mid-fusion model must either:

Emit special "image tokens" that a downstream diffusion model decodes (GILL, MGIE).
Output a CLIP-aligned latent that a frozen image generator consumes (BLIP-Diffusion).
Call out to a separate image-generation tool via function-calling (the early ChatGPT + DALL-E 3 architecture).

An early-fusion model can in principle emit image tokens directly through the same autoregressive head that emits text. Chameleon does this; GPT-4o's image generation appears to use a hybrid (early-fusion text+image tokens at the input, with a diffusion model handling the actual pixel synthesis on the output side). The trade-off is the tokenizer fidelity bottleneck mentioned earlier.

Warning: The image-tokenizer bottleneck

Early-fusion image generation looks elegant but caps out at the quality of the discrete tokenizer. VQ-GAN style tokenizers have residual artifacts (color shifts, mosaicked edges) that the autoregressive model cannot fix. The state of the art (2026) routes image generation through a diffusion decoder even in otherwise early-fusion architectures: the LLM emits a continuous latent, and a diffusion model paints the pixels. Stay aware of where the modality boundary actually lives in a given model.

22.7.6 Evaluation Pitfalls

Benchmarks for multimodal models often reward modality alignment more than cross-modal reasoning. MMMU and MMBench, for example, can be partially solved by models that just project image features into an LLM's input, with limited cross-modal reasoning. To stress-test fusion depth, look for benchmarks that require:

Visual grounding (RefCOCO, Visual7W): the model must point at a region described in text.
Cross-modal arithmetic (ChartQA, MathVista): the answer requires multi-step reasoning over numerical content in an image.
Compositional generation (T2I-CompBench, GenAI-Bench): the model must produce an image faithful to a complex text prompt.
Long-form interleaved (Mantis, MMDU): the model must handle dozens of images interleaved with text across thousands of tokens.

Real-World Scenario

Choosing a Fusion Architecture for an Internal VLM

A 2025 e-commerce team needed an internal VLM for product attribute extraction from photos. The two finalists were a mid-fusion LLaVA-1.6 fine-tune and a 3B-param early-fusion model trained in-house. The team chose the mid-fusion option because (a) projection-layer fine-tuning hit 92% accuracy on the team's eval in 3 days, (b) swapping the CLIP encoder for a domain-specific encoder later was a 1-day change rather than a full retrain, and (c) the team had no need for cross-modal generation. The early-fusion model would have offered marginal accuracy gains at 10x the training cost.

Key Insight

Where modalities fuse determines the model's cross-modal capability ceiling and its training cost. Early fusion gives the deepest cross-modal reasoning but needs frontier-scale training data and faces the discrete-tokenizer bottleneck on generation. Mid-fusion is the parameter-efficient default for open-source VLMs and works for nearly all understanding tasks. Late fusion dominates retrieval but is unfit for generation. Match the fusion design to the capability target and the data budget; do not chase early fusion unless you are training at frontier scale or need joint generation across modalities.

Self-Check

Q1: Why does early fusion require discrete tokenizers for images and audio? What goes wrong with continuous features?

Show Answer

An autoregressive transformer's training loss is next-token prediction over a finite vocabulary using softmax cross-entropy. Discrete image and audio tokenizers (VQ-VAE, RVQ, FSQ) give the model exactly that: a finite codebook of integer IDs that can sit alongside text BPE tokens in the same vocabulary. Continuous features have no natural distribution to predict with cross-entropy; you would need a different head (a Gaussian likelihood, a flow, or a diffusion stage) and the unified next-token-prediction abstraction breaks. The architectural simplicity of Chameleon-style early fusion depends on the discrete tokenizer; the cost, noted in Section 22.7.5, is that the tokenizer's reconstruction fidelity caps the model's image-generation quality, which is why production stacks increasingly route the output side through a diffusion decoder.

Q2: BLIP-2's Q-Former compresses 576 vision tokens to 32. What does the model gain and lose?

Show Answer

The gain is context-length efficiency: the LLM only has to attend to 32 vision tokens instead of 576, which cuts attention compute by an order of magnitude per image, keeps the context window free for long text or many images, and shortens per-step latency. The loss is information density. The Q-Former's 32 learned queries cannot represent every spatial region distinctly, so fine-grained tasks (small text in documents, dense chart annotations, OCR-style detail) lose accuracy compared with LLaVA's keep-everything MLP projection. Section 22.3.4's key insight makes this explicit: "learned compression sometimes loses to keep more, attend smartly", which is why open-source production VLMs largely abandoned Q-Former for direct MLP projection.

Q3: You want a fine-tuned open-source VLM that can answer questions about technical schematics. Which fusion architecture do you start with, and why?

Show Answer

Start with mid-fusion, specifically a LLaVA-NeXT, Qwen2.5-VL, or Pixtral checkpoint at a dynamic-resolution variant. Technical schematics need fine-grained visual detail (small labels, dense line work, callouts) that benefits from preserving hundreds to thousands of vision tokens, which an MLP projection delivers. Mid-fusion also lets you swap the vision encoder for a domain-specific one (an encoder fine-tuned on engineering drawings, say) without retraining the LLM, and the practical example at Section 22.7.6 shows exactly this pattern paying off for an e-commerce attribute-extraction team. Early fusion would require training a custom discrete tokenizer on schematic data plus a frontier-scale multimodal pretrain, which is overkill when the goal is question answering rather than cross-modal generation.

Q4: Sketch why an early-fusion model is the natural choice for joint speech-to-speech translation that preserves the speaker's voice.

Show Answer

Speech-to-speech translation that keeps the original speaker's voice requires the model to carry speaker timbre, prosody, and accent from the input audio through to the output audio while replacing only the linguistic content. A mid-fusion pipeline transcribes audio to text, translates text to text, then re-synthesizes audio with a TTS model, which loses the speaker's voice at the ASR stage and has to reconstruct it from a side channel. An early-fusion model tokenizes input audio with a neural codec like SoundStream or DAC, processes those tokens alongside text and target-language tokens inside one transformer, and emits target-language audio tokens directly. The model is free to copy speaker identity across the autoregressive boundary because input and output codec tokens live in the same vocabulary and the same residual stream, which is exactly the recipe AudioPaLM and Gemini's native audio mode follow.

What Comes Next

Section 22.8: Any-to-Any Generation covers the generalist models, NextGPT, AnyGPT, Unified-IO 2, that take the early-fusion idea to its conclusion: a single network that consumes and produces every modality.

Further Reading

Early Fusion

Chameleon Team. (2024). "Chameleon: Mixed-Modal Early-Fusion Foundation Models." Meta. arXiv:2405.09818

Mid-Fusion

Liu, H., Li, C., Wu, Q., & Lee, Y. J. (2023). "Visual Instruction Tuning" (LLaVA). NeurIPS. arXiv:2304.08485

Li, J., Li, D., Savarese, S., & Hoi, S. (2023). "BLIP-2: Bootstrapping Language-Image Pretraining with Frozen Image Encoders and Large Language Models" (Q-Former). ICML. arXiv:2301.12597

Wang, P., Bai, S., Tan, S., et al. (2024). "Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution." arXiv. arXiv:2409.12191

Perceiver Resampler and Variants

Alayrac, J. B., Donahue, J., Luc, P., et al. (2022). "Flamingo: A Visual Language Model for Few-Shot Learning." NeurIPS. arXiv:2204.14198

2024-2026 Surveys

Liang, P. P., Lyu, Y., Fan, X., et al. (2024). "Foundations and Recent Trends in Multimodal Machine Learning: Principles, Challenges, and Open Questions." arXiv. arXiv:2209.03430