Linear Attention, Hybrids, Benchmarks & Neuromorphic

Section 75.3a

Linear attention promised O(n) cost and was a little vague about quality. Quadratic attention promised quality and was very specific about cost. Hybrids tried to negotiate. Mamba and the rest of the state-space crowd were not invited.

FrontierFrontier, Architecturally Curious AI Agent
Big Picture

This section continues Section 75.3, which covered the foundational non-transformer alternatives: the scaling problem with self-attention and state-space models (S4, Mamba, Mamba-2). Here we cover the rest of the design space: linear attention and recurrent alternatives (RWKV, RetNet, kernel methods), hybrid architectures, efficiency comparisons, decision criteria for when to reach beyond transformers, and the neuromorphic and event-driven approaches at the frontier. These architecture variants matter for the next generation of LLMs and long-horizon agents, since extending context beyond a million tokens or shrinking decoder cost is what unlocks workloads (codebase-scale reasoning, lifetime memory) that quadratic transformers cannot afford.

Prerequisites

This section continues from Section 75.3, which introduced sparse-attention architectures (Longformer, BigBird) and the broader frontier of attention alternatives. Familiarity with linear attention, state-space models, and the benchmark suites used to compare new architectures (LongBench, RULER) is assumed. Cross-references to Chapter 3 (transformer attention) and Chapter 6 (scaling laws) will help.

Fun Fact: Mamba's Naming Problem

When Albert Gu and Tri Dao released Mamba in late 2023, the architecture itself was elegant but the name confusion was instant. Mamba is also a popular Python package manager. And a snake. And a basketball legend. For months, half the search-engine traffic for 'Mamba LLM' landed on conda-forge documentation. The lesson: naming a new architecture is harder than designing it, and the namespace of cool-sounding biology terms is fully booked. Future authors are encouraged to consider obscure mollusks.

75.3.3 Linear Attention and Recurrent Alternatives

The long-context attention landscape splits into four families that each attack the quadratic bottleneck from a different angle: sparse attention (Longformer, BigBird) limits which positions attend to which; linear attention (Performer, Linformer) approximates the attention matrix with a low-rank factorization; hierarchical and block attention (Hierarchical Attention, Compressive Transformer) chunk the sequence and route information up a tree; and recurrent or SSM approaches (Mamba, RWKV, RetNet) abandon explicit attention for a fixed-size state. The rest of this section walks each lineage in detail.

Sparse-attention approaches such as Longformer (Beltagy et al., 2020) and BigBird (Zaheer et al., 2020) approximate full attention by zeroing out most of the $n \times n$ attention matrix. Longformer combines a sliding window of width $w$ around each token with a small set $G$ of global tokens that attend to everything; BigBird adds a sparse set of random connections on top. The cost per layer drops from $O(n^2)$ to:

$$ \mathrm{cost}_{\text{sparse}}(n) = O\big(n \cdot (w + |G| + r)\big), $$

where $w$ is the window width, $|G|$ the number of global tokens, and $r$ the number of random connections per token. With $w = 512$, $|G| = 8$, and $r = 3$, a 16K-token document costs about $30 \times$ less than dense attention and still preserves the universality property: the resulting sparse graph is a universal sequence-to-sequence approximator.

Three attention-mask layouts side by side. Filled cells indicate non-zero attention. Dense (left) is the standard transformer baseline. Longformer (middle) keeps a diagonal sliding window plus a small set of global tokens. BigBird (right) adds a few random off-diagonal entries to preserve theoretical universality.
Figure 75.3.5: Three attention-mask layouts side by side. Filled cells indicate non-zero attention. Dense (left) is the standard transformer baseline. Longformer (middle) keeps a diagonal sliding window plus a small set of global tokens. BigBird (right) adds a few random off-diagonal entries to preserve theoretical universality.
# sparse_attention_mask.py: build a Longformer-style mask in pure PyTorch.
import torch

def longformer_mask(n, window=128, global_idx=(0,)):
    """Boolean (n, n) mask: True where attention is allowed."""
    rows = torch.arange(n).unsqueeze(1)
    cols = torch.arange(n).unsqueeze(0)
    band = (cols - rows).abs() <= window // 2          # sliding window
    g    = torch.zeros(n, dtype=torch.bool)
    g[list(global_idx)] = True
    mask = band | g.unsqueeze(0) | g.unsqueeze(1)       # union with globals
    return mask

n, window = 8192, 512
mask = longformer_mask(n, window=window, global_idx=(0,))
density = mask.float().mean().item()
print(f"non-zero fraction: {density:.4f}  (dense baseline = 1.0000)")
# Reuse the mask inside any HuggingFace LongformerModel by setting
# attention_window=window and marking CLS as global_attention_mask=1.
Output: non-zero fraction: 0.0626 (dense baseline = 1.0000)

Code Fragment 75.3.4a: A minimal Longformer attention mask. The same boolean tensor is the input that HuggingFace's LongformerModel uses internally via attention_window and global_attention_mask.

Numeric Example: Longformer Cost at 16K Tokens

For $n = 16{,}384$, dense self-attention requires $n^2 = 2.68 \times 10^8$ query-key dot products per head per layer. A Longformer configuration with window $w = 512$ and $|G| = 8$ global tokens requires only $n \cdot (w + |G|) = 16384 \cdot 520 \approx 8.5 \times 10^6$ dot products, a $31\times$ reduction. The non-zero fraction printed by the code above for $n = 8192, w = 512$ is about $0.063$, exactly matching the analytical $(w + 1) / n$. BigBird's additional $r = 3$ random connections per token raise the constant from $520$ to $523$, a negligible compute increase that buys provable universality of the resulting sparse attention graph.

Under the Hood: Linear attention

Standard attention computes softmax over every query-key pair, forcing a quadratic n-by-n matrix. Linear attention drops the softmax and replaces each score with a kernel feature map applied separately to queries and keys, $\text{sim}(q,k)=\phi(q)^\top \phi(k)$. Because the map is applied independently, the computation can be reassociated: instead of building the n-by-n matrix, the model first accumulates the outer products of feature-mapped keys with their values into a fixed-size state matrix, then multiplies each query into that state. Cost and memory then grow linearly with sequence length, and for causal models the state is a running sum updated token by token, exactly like an RNN. The price is approximation error, since the feature map only mimics softmax, which is why pure linear-attention models trail on precise retrieval.

75.3.3.1 RWKV: Reinventing RNNs for the Transformer Era

RWKV (Peng et al., 2023) takes a different approach: rather than inventing a new mechanism, it reformulates the transformer architecture to eliminate the quadratic attention computation while retaining the parallel training properties that made transformers successful. The name reflects its four core operations: Receptance (R), Weight (W), Key (K), and Value (V).

The key innovation is the WKV (Weighted Key-Value) mechanism, which replaces softmax attention with an exponentially decaying sum. Instead of computing attention scores between all pairs of tokens, RWKV maintains a running numerator and denominator that can be updated incrementally:

# RWKV WKV mechanism (simplified, single-head)
import torch
def rwkv_wkv(w, u, k, v):
    """
    RWKV attention replacement.
    w: decay factors (d_model,) - learned per-channel decay
    u: bonus for current token (d_model,) - learned
    k: keys (batch, seq_len, d_model)
    v: values (batch, seq_len, d_model)
    returns: output (batch, seq_len, d_model)
    """
    batch, seq_len, d = k.shape
    outputs = []
    # Running state: exponentially weighted sum
    state_num = torch.zeros(batch, d, device=k.device)
    state_den = torch.zeros(batch, d, device=k.device)
    state_max = torch.full((batch, d), -float('inf'), device=k.device)
    for t in range(seq_len):
        kt = k[:, t] # (batch, d)
        vt = v[:, t] # (batch, d)
        # Numerically stable exponential moving average
        new_max = torch.maximum(state_max, kt)
        # Combine historical state with current token
        exp_prev = torch.exp(state_max - new_max)
        exp_curr = torch.exp(kt - new_max)
        exp_bonus = torch.exp(u + kt - new_max)
        # Output: weighted combination
        wkv = (exp_prev * state_num + exp_bonus * vt) / \
        (exp_prev * state_den + exp_bonus)
        outputs.append(wkv)
        # Update running state with decay
        state_num = torch.exp(w) * exp_prev * state_num + exp_curr * vt
        state_den = torch.exp(w) * exp_prev * state_den + exp_curr
        state_max = new_max
        return torch.stack(outputs, dim=1)
Code Fragment 75.3.4: RWKV WKV mechanism (simplified, single-head)

Code 34.3.3: Simplified RWKV WKV attention replacement. The exponential decay w controls how quickly the model forgets older tokens, functioning as a learned "memory horizon" per channel.

RWKV has reached competitive quality at scale. RWKV-6 models at 1.6B, 3B, 7B, and 14B parameters show performance comparable to similarly-sized transformers on standard benchmarks, while offering constant-memory inference. The RWKV community has trained models in multiple languages, and the architecture is fully open-source with an active ecosystem.

75.3.3.2 RetNet: Retentive Networks

RetNet (Sun et al., 2023) from Microsoft Research proposes a "retention" mechanism that supports three computation modes: parallel (for training efficiency), recurrent (for $O(1)$ inference), and chunkwise (a hybrid for long-sequence processing). The retention mechanism uses complex-valued exponential decay rather than softmax normalization.

In the parallel mode, retention can be expressed as a matrix operation similar to attention, enabling efficient GPU utilization during training. In the recurrent mode, it becomes an RNN-like update with fixed-size state, enabling constant memory during inference. The chunkwise mode divides the sequence into fixed-size chunks, processes each chunk in parallel mode, and propagates state between chunks in recurrent mode. This triple formulation gives RetNet flexibility to optimize for the deployment scenario at hand.

75.3.3.3 Griffin and RecurrentGemma

Google DeepMind's Griffin architecture (De et al., 2024) combines linear recurrences with local attention in a hybrid design. Griffin uses a Real-Gated Linear Recurrence (RGLRU) layer that maintains a diagonal state matrix, interleaved with local sliding-window attention layers that handle short-range dependencies. The RecurrentGemma model series implements this architecture at the 2B and 9B parameter scales.

The practical significance of Griffin is that it demonstrates a design pattern: use efficient recurrence for the "backbone" of sequence processing, and add sparse attention layers only where they provide clear benefit (local context, retrieval-like operations). This hybrid approach often outperforms pure SSM or pure attention models of the same size.

Key Insight

The attention versus efficiency tradeoff is not all-or-nothing. The research trajectory is moving away from "replace attention entirely" toward "use attention surgically." Pure SSM models sacrifice recall precision on tasks that require exact matching or retrieval from earlier in the context. Pure attention models pay quadratic cost for every token, even when most tokens do not need to attend to most other tokens. The emerging consensus is that hybrid architectures (attention for precision-critical layers, linear recurrence for everything else) may dominate both pure approaches. For practitioners, this means that the inference optimization techniques from Chapter 9 (KV cache management, continuous batching) will remain relevant even as architectures evolve, because attention layers will likely persist in some form.

75.3.4 Hybrid Architectures: Combining Strengths

If the lesson is "use attention surgically," the field needs concrete architectural recipes that prove it works at scale. Jamba is the most prominent demonstration: it interleaves Mamba blocks, attention blocks, and MoE feed-forward layers into a single 52B-parameter model that runs in 128K context on a single GPU. The hybrid design choices it makes are now templates that other labs are imitating.

75.3.4.1 Jamba: Mamba Meets Transformers

AI21 Labs' Jamba model (Lieber et al., 2024) is the most prominent hybrid architecture, interleaving Mamba layers with transformer attention layers and Mixture-of-Experts (MoE) modules. The architecture uses a ratio of roughly 3:1 Mamba-to-attention layers, with MoE applied to the feed-forward components. This design achieves three goals simultaneously: the long-context handling of Mamba, the precise retrieval capability of attention, and the parameter efficiency of MoE.

from torch import nn
# Jamba-style hybrid architecture (conceptual)
class JambaBlock(nn.Module):
    """
    Hybrid block: alternates between Mamba and Attention layers.
    Every 4th layer uses attention; the rest use Mamba.
    MoE replaces standard FFN in selected layers.
    """
    def __init__(
        self,
        d_model: int,
        layer_idx: int,
        n_heads: int = 16,
        mamba_state_dim: int = 16,
        num_experts: int = 16,
        active_experts: int = 2,
        attention_every_n: int = 4,
        moe_every_n: int = 2,
        ):
        super().__init__()
        self.layer_idx = layer_idx
        self.use_attention = (layer_idx % attention_every_n == 0)
        self.use_moe = (layer_idx % moe_every_n == 0)
        # Sequence mixing: either Mamba or Attention
        if self.use_attention:
            self.seq_mixer = MultiHeadAttention(d_model, n_heads)
        else:
            self.seq_mixer = SelectiveSSM(d_model, mamba_state_dim)
            # Channel mixing: either MoE or standard FFN
            if self.use_moe:
                self.channel_mixer = MoELayer(
                    d_model, num_experts, active_experts
                    )
            else:
                self.channel_mixer = FeedForward(d_model)
                self.norm1 = RMSNorm(d_model)
                self.norm2 = RMSNorm(d_model)
    def forward(self, x, attention_mask=None):
        # Pre-norm residual connections
        h = x + self.seq_mixer(self.norm1(x), mask=attention_mask)
        out = h + self.channel_mixer(self.norm2(h))
        return out
Code Fragment 75.3.5a: Jamba-style hybrid architecture (conceptual)

Code 34.3.4: Conceptual Jamba-style hybrid block. The architectural ratio (how frequently attention layers appear) is a key design decision that trades recall precision for throughput.

Jamba's 256K-token context window with only 12B active parameters (52B total with MoE) demonstrates the efficiency gains possible with hybrid designs. On the NVIDIA A100, AI21 reported roughly 3x the throughput of a comparable pure-attention model at 128K context length because the Mamba layers eliminate most of the KV cache memory pressure. Independent benchmarks have varied, with throughput ratios from 1.8x to 3.5x depending on batch size, prompt length distribution, and serving framework, so the 3x figure should be read as an order-of-magnitude advantage rather than a guaranteed multiplier in your stack.

Mixture of Experts Gating.

In an MoE layer with E experts, a gating network routes each token x to the top-k experts:

$$\begin{aligned}g(x) &= \operatorname{softmax}(W_{g} \cdot x) \\ \operatorname{TopK} &= \text{argtop}-k(g(x)) \\ y &= \sum _{i \in \operatorname{TopK}} g_{i}(x) \cdot Expert_{i}(x)\end{aligned}$$

Only the top-k experts (typically k = 2) are activated per token, so the compute cost scales with k rather than E. To prevent expert collapse (all tokens routed to the same expert), a load balancing auxiliary loss is added:

$$\mathscr{L}_{\text{balance}} = \alpha \cdot E \cdot \sum _{i=1..E} f_{i} \cdot P_{i}$$

where fi is the fraction of tokens dispatched to expert i, Pi is the mean gate probability for expert i across the batch, and α is a small coefficient (typically 0.01). This encourages uniform distribution of tokens across experts.

Real-World Scenario
When would you choose a hybrid architecture in production?

Who: An NLP engineer at a contract analytics startup building a system to extract and cross-reference clauses from 100-page legal contracts.

Situation: The system needed to answer queries like "what does Section 4.2(b) say about the indemnification clause mentioned in Section 15.1?" across documents averaging 128K tokens.

Problem: A pure transformer required 40 GB of GPU memory for the KV cache at 128K tokens, making it impractical on the team's L4 GPUs. A pure Mamba model handled the length efficiently but scored 15% lower on cross-reference retrieval tasks that required precise attention over distant sections.

Decision: The team evaluated a Jamba-style hybrid architecture that used Mamba layers for efficient sequential processing of the bulk document and interspersed attention layers (every 4th layer) for precise cross-reference retrieval.

Result: The hybrid model fit on a single L4 GPU with 24 GB of memory, matched the pure transformer's retrieval accuracy within 2%, and processed contracts 3x faster. Per-document analysis cost dropped from $0.18 to $0.05.

Lesson: Hybrid SSM-attention architectures unlock practical long-context processing by using efficient SSM layers for the bulk of the sequence and reserving attention layers for the sub-tasks that genuinely require precise cross-document retrieval.

75.3.4.2 Design Principles for Hybrid Architectures

Empirical studies from multiple research groups have converged on three design principles for hybrid SSM-attention architectures, each answering a different "where do I put the attention?" question:

75.3.5 Efficiency Comparisons and Benchmarks

Comparing architectures requires examining multiple dimensions: quality (benchmark scores, perplexity), throughput (tokens per second during training and inference), memory consumption, and latency characteristics. The following table summarizes approximate comparisons at the 7B parameter scale with 128K context length:

Table 75.3.2b: Comparing architectures requires examining multiple dimensions: quality (benchmark scores, perplexity), throughput (tokens per second during training and inference), memory consumption.
Architecture Attention Complexity Inference Memory (128K) Throughput (tok/s) In-Context Retrieval
Standard Transformer $O(n^{2})$ ~40 GB KV cache Baseline (1x) Excellent
Mamba-2 $O(n)$ ~200 MB state ~5x at 128K Good (degrades at extreme range)
RWKV-6 $O(n)$ ~150 MB state ~4x at 128K Good
Jamba (Hybrid) $O(n)$ amortized ~8 GB (reduced KV) ~3x at 128K Very good
Griffin $O(n)$ with local attn ~2 GB ~3.5x at 128K Good

Table 75.3.1b: Approximate efficiency comparison of architectures at 7B parameters, 128K context. Throughput multiples are relative to standard transformer. Actual numbers vary by implementation and hardware.

Library Shortcut: einops in Practice

Use einops for readable tensor reshaping in attention and SSM layers.

Show code
# pip install einops
from einops import rearrange, repeat
import torch
# Reshape for multi-head attention: (batch, seq, d) to (batch, heads, seq, head_dim)
x = torch.randn(2, 128, 512)
heads = rearrange(x, "b s (h d) -> b h s d", h=8)
print(f"Multi-head shape: {heads.shape}") # (2, 8, 128, 64)
# Repeat a state vector across batch dimension
state = torch.randn(1, 64)
batched = repeat(state, "1 d -> b d", b=32)
print(f"Batched state: {batched.shape}") # (32, 64)
Code Fragment 75.3.6: Pip install einops.
Library Shortcut: JAX, Flax, and Optax in Practice

Define and train a minimal sequence model layer using the JAX ecosystem.

Show code
# pip install jax flax optax
import jax
import jax.numpy as jnp
import flax.linen as nn
import optax
class SSMBlock(nn.Module):
    state_dim: int = 64
    @nn.compact
    def __call__(self, x):
        d = x.shape[-1]
        A = self.param("A", nn.initializers.normal(0.01), (self.state_dim, self.state_dim))
        B = self.param("B", nn.initializers.normal(0.01), (self.state_dim, d))
        C = self.param("C", nn.initializers.normal(0.01), (d, self.state_dim))
        return x + (x @ B.T @ A.T @ C.T) # simplified skip connection
model = SSMBlock()
key = jax.random.PRNGKey(0)
params = model.init(key, jnp.ones((1, 128, 256)))
optimizer = optax.adamw(learning_rate=1e-3)
opt_state = optimizer.init(params)
print(f"Parameter count: {sum(p.size for p in jax.tree.leaves(params))}")
Code Fragment 75.3.7: Pip install jax flax optax.
Key Insight

The "Needle in a Haystack" test reveals the retrieval gap. In this test, a specific fact is inserted at a random position in a long document, and the model must retrieve it from a distant point. Transformers with full attention achieve near-perfect accuracy at all positions and context lengths. Pure SSM models show degradation for facts placed in the middle of very long sequences (the "lost in the middle" effect is amplified by compressed state). Hybrid models split the difference, achieving strong retrieval when the fact falls within an attention window and moderate retrieval otherwise. This test remains the clearest diagnostic for evaluating alternative architectures.

75.3.6 When to Consider Non-Transformer Architectures

For most production applications in 2025-2026, the transformer remains the default choice. The ecosystem of tools, pretrained models, fine-tuning frameworks (covered in Chapter 17), and serving infrastructure is built around transformers. Choosing an alternative architecture introduces friction at every stage of the pipeline. That said, three scenarios justify the switch, each driven by a different hardware or workload constraint.

75.3.6.1 Extremely Long Contexts with Constrained Hardware

If your application requires processing 100K+ tokens and you cannot afford the GPU memory for a transformer's KV cache, Mamba or RWKV models provide a practical path forward. The memory savings can be the difference between requiring one GPU and requiring four.

75.3.6.2 High-Throughput Streaming Applications

For applications that process continuous streams of text (real-time transcription analysis, social media monitoring, log analysis), the constant-memory inference of SSMs is a natural fit. Each new token costs the same regardless of how many tokens have been processed, unlike transformers where the per-token cost grows with the KV cache.

75.3.6.3 Edge and Mobile Deployment

When deploying models on devices with limited memory and no access to cloud GPUs, SSM architectures offer the best quality-per-byte ratio for long-context tasks. The small state footprint (hundreds of megabytes vs. tens of gigabytes) makes on-device long-context processing feasible.

Real-World Scenario
Decision matrix for architecture selection

Who: A CTO at a healthcare startup building a clinical decision support system that summarized patient records and flagged medication interactions.

Situation: Patient records averaged 50K tokens each. The system needed to both summarize the full record and perform safety-critical medication interaction checks that required precise retrieval of drug names, dosages, and contraindications scattered across the record.

Problem: The team evaluated three architecture options. Option A: a transformer with FlashAttention and 4-bit quantization (covered in Section 9.4) required a single A100 GPU ($2.50/hour). Option B: a Mamba-2 model handled the same records on an L4 GPU at one-quarter the cost but missed 8% of medication interactions in testing. Option C: a Jamba-style hybrid preserved attention-level recall for safety checks while using Mamba layers for the bulk of the record.

Decision: The startup chose Option C. The medication interaction task was safety-critical and required attention-level recall precision, while the overall record summarization could leverage the efficiency of SSM layers. They deployed on a single A10G GPU ($1.00/hour), splitting the cost difference.

Result: The hybrid architecture caught 99.2% of medication interactions (matching the pure transformer) while processing records 2.4x faster. Monthly infrastructure costs were 60% lower than the transformer-only option.

Lesson: Architecture selection should be driven by the most demanding sub-task. When one component requires high-precision retrieval and another needs efficient long-context processing, a hybrid architecture lets you optimize each independently rather than paying the cost of the most expensive requirement everywhere.

When NOT to switch. If your application works with contexts under 8K tokens, if you need the vast ecosystem of fine-tuned transformer models on Hugging Face, if your team lacks experience with newer architectures, or if your task is primarily about precise in-context retrieval, the transformer remains the better choice. The efficiency advantages of alternative architectures only materialize at longer sequence lengths, and the model availability gap is significant.

75.3.7 Neuromorphic and Event-Driven Approaches

A more speculative line of research explores architectures inspired by biological neural computation. Spiking neural networks (SNNs) process information as discrete spikes rather than continuous values, offering potential energy efficiency gains on specialized neuromorphic hardware like Intel's Loihi 2 and IBM's NorthPole.

SpikingGPT and similar projects have demonstrated that language modeling is possible with spiking architectures, though quality lags behind conventional networks at comparable scale. The primary advantage is energy consumption: neuromorphic chips can process inference workloads at 10-100x lower energy per operation than GPUs. If this efficiency gap translates to language models at scale, the implications for sustainability and edge deployment would be transformative.

Event-driven architectures extend this concept to data processing. Rather than processing all tokens uniformly, event-driven models activate computation only when the input changes significantly from the model's current prediction. For tasks like real-time document monitoring where most content is unchanged between updates, this can reduce compute costs by orders of magnitude. These approaches remain in the research stage and are not yet practical for production deployment.

Key Takeaways
Research Frontier

The convergence of architectures. Mamba-2's state space duality theorem suggests that SSMs and attention may be endpoints on a spectrum rather than fundamentally different approaches.

Recent work on "linear attention" (Katharopoulos et al., Yang et al.) and "gated linear attention" further blurs the boundary. The research community is moving toward a unified framework where the architectural choice is a hyperparameter (how much to compress the state) rather than a philosophical commitment. Watch for architectures that can dynamically adjust their compression ratio per layer and per input, spending full attention on tokens that need it and using compressed state for the rest.

Exercise 33.3.1:
Conceptual
Quadratic Attention vs. Linear Alternatives

Explain why standard self-attention has $O(n^2)$ time and memory complexity, and describe how state space models (SSMs) and linear attention variants achieve $O(n)$ complexity. For each alternative, explain: (a) the core mechanism that replaces pairwise token comparisons, (b) what capability is lost compared to full attention, (c) the empirical performance gap on standard language modeling benchmarks, and (d) the types of tasks where the linear alternatives perform comparably to full attention.

Exercise 33.3.2:
Analysis
Mamba Architecture Deep Dive

Analyze the Mamba architecture (a selective state space model). Describe: (a) how the "selection mechanism" allows the model to decide what information to remember or forget from the input, (b) how this differs from a fixed-parameter state space model, (c) why Mamba achieves comparable perplexity to transformers on language modeling despite not having explicit attention, and (d) the inference speed advantage of Mamba for long sequences (no KV cache, constant memory per token). Calculate the memory savings for a 1M-token sequence compared to a transformer with a standard KV cache.

Exercise 33.3.3:
Coding
Benchmarking Sequence Length Scaling

Write a Python benchmarking script that measures the memory usage and forward-pass time of a standard self-attention layer versus a simulated linear-time alternative as sequence length increases from 512 to 32,768 tokens. For the attention layer, use PyTorch's scaled_dot_product_attention. For the linear alternative, implement a simple recurrent scan that processes tokens sequentially with a fixed-size hidden state. Plot both memory and time on log-log axes and verify that attention shows $O(n^2)$ scaling while the linear alternative shows $O(n)$.

Exercise 33.3.4:
Discussion
Hybrid Architectures: The Best of Both Worlds?

Several recent models (Jamba, Zamba, Griffin) combine attention layers with SSM or linear attention layers in a hybrid architecture. Discuss: (a) Why would mixing attention and SSM layers work better than either alone? (b) What is the optimal ratio of attention to SSM layers, and how might this depend on the task? (c) Do hybrid architectures represent a transitional step or a permanent design pattern? (d) What would need to be true about a pure SSM architecture for it to completely replace transformers?

Exercise 33.3.5:
Conceptual
In-Context Learning Without Attention

One of the strongest arguments for attention is that it enables in-context learning (the ability to learn from examples in the prompt). If SSMs can achieve comparable perplexity on language modeling, can they also perform in-context learning as effectively as transformers? Discuss: (a) What properties of attention enable in-context learning (the implicit cross-entropy theory from Section 6.7), (b) whether SSMs' fixed-size hidden state limits their ability to "remember" all examples in a long prompt, and (c) what recent empirical evidence shows about SSMs' in-context learning capabilities.

What Comes Next

This section surveyed the architectures challenging transformer dominance. The next section, Section 75.4: World Models, explores how neural networks are learning to simulate physical environments through video generation, interactive 3D worlds, and embodied reasoning for agent planning.

See Also

For the sequence-model and attention foundations these architectures extend, see Section 2.2 and Section 3.1. For inference-side optimizations these architectures co-design with, see Section 9.1 and Section 9.4. For PEFT applied to frontier architectures, see Chapter 17.

Further Reading

State Space Models

Gu, A., Goel, K., and Re, C. (2022). "Efficiently Modeling Long Sequences with Structured State Spaces." ICLR 2022. The foundational S4 paper that introduced structured state spaces for sequence modeling. This work laid the mathematical groundwork for all subsequent SSM architectures discussed in this section.
Gu, A. and Dao, T. (2023). "Mamba: Linear-Time Sequence Modeling with Selective State Spaces." arXiv:2312.00752. Introduces input-dependent selection into state space models, achieving transformer-quality language modeling with linear scaling. The most widely adopted pure SSM architecture at the time of writing.
Dao, T. and Gu, A. (2024). "Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality." ICML 2024. Reveals a deep mathematical duality between transformers and state space models, unifying the two paradigms under a single framework. This Mamba-2 paper simplifies understanding of when each approach excels.

Linear Attention & RNN Revivals

Peng, B. et al. (2023). "RWKV: Reinventing RNNs for the Transformer Era." EMNLP 2023 Findings. Demonstrates that a carefully designed RNN can match transformer quality at multi-billion parameter scale while maintaining constant memory during inference. A key data point for the viability of non-attention architectures.
Sun, Y. et al. (2023). "Retentive Network: A Successor to Transformer for Large Language Models." arXiv:2307.08621. Proposes retention mechanisms that combine the training parallelism of transformers with the constant-cost inference of RNNs. Illustrates the design space between pure attention and pure recurrence.
Katharopoulos, A. et al. (2020). "Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention." ICML 2020. An early and influential paper showing that attention can be linearized by kernel approximation, enabling RNN-like inference. This theoretical insight motivates much of the linear attention research covered in this section.

Hybrid Architectures

De, S. et al. (2024). "Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models." arXiv:2402.19427. Shows that interleaving gated linear recurrence layers with local attention windows outperforms pure approaches at scale. Represents the hybrid design philosophy that is emerging as the practical consensus.
Lieber, O. et al. (2024). "Jamba: A Hybrid Transformer-Mamba Language Model." arXiv:2403.19887. The first production-scale hybrid combining Mamba layers, attention layers, and mixture-of-experts in a single 52B-parameter model. Demonstrates that hybrid architectures can be deployed at frontier scale.