Linear attention promised O(n) cost and was a little vague about quality. Quadratic attention promised quality and was very specific about cost. Hybrids tried to negotiate. Mamba and the rest of the state-space crowd were not invited.
Frontier, Architecturally Curious AI Agent
This section continues Section 75.3, which covered the foundational non-transformer alternatives: the scaling problem with self-attention and state-space models (S4, Mamba, Mamba-2). Here we cover the rest of the design space: linear attention and recurrent alternatives (RWKV, RetNet, kernel methods), hybrid architectures, efficiency comparisons, decision criteria for when to reach beyond transformers, and the neuromorphic and event-driven approaches at the frontier. These architecture variants matter for the next generation of LLMs and long-horizon agents, since extending context beyond a million tokens or shrinking decoder cost is what unlocks workloads (codebase-scale reasoning, lifetime memory) that quadratic transformers cannot afford.
Prerequisites
This section continues from Section 75.3, which introduced sparse-attention architectures (Longformer, BigBird) and the broader frontier of attention alternatives. Familiarity with linear attention, state-space models, and the benchmark suites used to compare new architectures (LongBench, RULER) is assumed. Cross-references to Chapter 3 (transformer attention) and Chapter 6 (scaling laws) will help.
When Albert Gu and Tri Dao released Mamba in late 2023, the architecture itself was elegant but the name confusion was instant. Mamba is also a popular Python package manager. And a snake. And a basketball legend. For months, half the search-engine traffic for 'Mamba LLM' landed on conda-forge documentation. The lesson: naming a new architecture is harder than designing it, and the namespace of cool-sounding biology terms is fully booked. Future authors are encouraged to consider obscure mollusks.
75.3.3 Linear Attention and Recurrent Alternatives
The long-context attention landscape splits into four families that each attack the quadratic bottleneck from a different angle: sparse attention (Longformer, BigBird) limits which positions attend to which; linear attention (Performer, Linformer) approximates the attention matrix with a low-rank factorization; hierarchical and block attention (Hierarchical Attention, Compressive Transformer) chunk the sequence and route information up a tree; and recurrent or SSM approaches (Mamba, RWKV, RetNet) abandon explicit attention for a fixed-size state. The rest of this section walks each lineage in detail.
Sparse-attention approaches such as Longformer (Beltagy et al., 2020) and BigBird (Zaheer et al., 2020) approximate full attention by zeroing out most of the $n \times n$ attention matrix. Longformer combines a sliding window of width $w$ around each token with a small set $G$ of global tokens that attend to everything; BigBird adds a sparse set of random connections on top. The cost per layer drops from $O(n^2)$ to:
$$ \mathrm{cost}_{\text{sparse}}(n) = O\big(n \cdot (w + |G| + r)\big), $$
where $w$ is the window width, $|G|$ the number of global tokens, and $r$ the number of random connections per token. With $w = 512$, $|G| = 8$, and $r = 3$, a 16K-token document costs about $30 \times$ less than dense attention and still preserves the universality property: the resulting sparse graph is a universal sequence-to-sequence approximator.
# sparse_attention_mask.py: build a Longformer-style mask in pure PyTorch.
import torch
def longformer_mask(n, window=128, global_idx=(0,)):
"""Boolean (n, n) mask: True where attention is allowed."""
rows = torch.arange(n).unsqueeze(1)
cols = torch.arange(n).unsqueeze(0)
band = (cols - rows).abs() <= window // 2 # sliding window
g = torch.zeros(n, dtype=torch.bool)
g[list(global_idx)] = True
mask = band | g.unsqueeze(0) | g.unsqueeze(1) # union with globals
return mask
n, window = 8192, 512
mask = longformer_mask(n, window=window, global_idx=(0,))
density = mask.float().mean().item()
print(f"non-zero fraction: {density:.4f} (dense baseline = 1.0000)")
# Reuse the mask inside any HuggingFace LongformerModel by setting
# attention_window=window and marking CLS as global_attention_mask=1.
Code Fragment 75.3.4a: A minimal Longformer attention mask. The same boolean tensor is the input that HuggingFace's LongformerModel uses internally via attention_window and global_attention_mask.
For $n = 16{,}384$, dense self-attention requires $n^2 = 2.68 \times 10^8$ query-key dot products per head per layer. A Longformer configuration with window $w = 512$ and $|G| = 8$ global tokens requires only $n \cdot (w + |G|) = 16384 \cdot 520 \approx 8.5 \times 10^6$ dot products, a $31\times$ reduction. The non-zero fraction printed by the code above for $n = 8192, w = 512$ is about $0.063$, exactly matching the analytical $(w + 1) / n$. BigBird's additional $r = 3$ random connections per token raise the constant from $520$ to $523$, a negligible compute increase that buys provable universality of the resulting sparse attention graph.
Standard attention computes softmax over every query-key pair, forcing a quadratic n-by-n matrix. Linear attention drops the softmax and replaces each score with a kernel feature map applied separately to queries and keys, $\text{sim}(q,k)=\phi(q)^\top \phi(k)$. Because the map is applied independently, the computation can be reassociated: instead of building the n-by-n matrix, the model first accumulates the outer products of feature-mapped keys with their values into a fixed-size state matrix, then multiplies each query into that state. Cost and memory then grow linearly with sequence length, and for causal models the state is a running sum updated token by token, exactly like an RNN. The price is approximation error, since the feature map only mimics softmax, which is why pure linear-attention models trail on precise retrieval.
75.3.3.1 RWKV: Reinventing RNNs for the Transformer Era
RWKV (Peng et al., 2023) takes a different approach: rather than inventing a new mechanism, it reformulates the transformer architecture to eliminate the quadratic attention computation while retaining the parallel training properties that made transformers successful. The name reflects its four core operations: Receptance (R), Weight (W), Key (K), and Value (V).
The key innovation is the WKV (Weighted Key-Value) mechanism, which replaces softmax attention with an exponentially decaying sum. Instead of computing attention scores between all pairs of tokens, RWKV maintains a running numerator and denominator that can be updated incrementally:
# RWKV WKV mechanism (simplified, single-head)
import torch
def rwkv_wkv(w, u, k, v):
"""
RWKV attention replacement.
w: decay factors (d_model,) - learned per-channel decay
u: bonus for current token (d_model,) - learned
k: keys (batch, seq_len, d_model)
v: values (batch, seq_len, d_model)
returns: output (batch, seq_len, d_model)
"""
batch, seq_len, d = k.shape
outputs = []
# Running state: exponentially weighted sum
state_num = torch.zeros(batch, d, device=k.device)
state_den = torch.zeros(batch, d, device=k.device)
state_max = torch.full((batch, d), -float('inf'), device=k.device)
for t in range(seq_len):
kt = k[:, t] # (batch, d)
vt = v[:, t] # (batch, d)
# Numerically stable exponential moving average
new_max = torch.maximum(state_max, kt)
# Combine historical state with current token
exp_prev = torch.exp(state_max - new_max)
exp_curr = torch.exp(kt - new_max)
exp_bonus = torch.exp(u + kt - new_max)
# Output: weighted combination
wkv = (exp_prev * state_num + exp_bonus * vt) / \
(exp_prev * state_den + exp_bonus)
outputs.append(wkv)
# Update running state with decay
state_num = torch.exp(w) * exp_prev * state_num + exp_curr * vt
state_den = torch.exp(w) * exp_prev * state_den + exp_curr
state_max = new_max
return torch.stack(outputs, dim=1)
Code 34.3.3: Simplified RWKV WKV attention replacement. The exponential decay w controls how quickly the model forgets older tokens, functioning as a learned "memory horizon" per channel.
RWKV has reached competitive quality at scale. RWKV-6 models at 1.6B, 3B, 7B, and 14B parameters show performance comparable to similarly-sized transformers on standard benchmarks, while offering constant-memory inference. The RWKV community has trained models in multiple languages, and the architecture is fully open-source with an active ecosystem.
75.3.3.2 RetNet: Retentive Networks
RetNet (Sun et al., 2023) from Microsoft Research proposes a "retention" mechanism that supports three computation modes: parallel (for training efficiency), recurrent (for $O(1)$ inference), and chunkwise (a hybrid for long-sequence processing). The retention mechanism uses complex-valued exponential decay rather than softmax normalization.
In the parallel mode, retention can be expressed as a matrix operation similar to attention, enabling efficient GPU utilization during training. In the recurrent mode, it becomes an RNN-like update with fixed-size state, enabling constant memory during inference. The chunkwise mode divides the sequence into fixed-size chunks, processes each chunk in parallel mode, and propagates state between chunks in recurrent mode. This triple formulation gives RetNet flexibility to optimize for the deployment scenario at hand.
75.3.3.3 Griffin and RecurrentGemma
Google DeepMind's Griffin architecture (De et al., 2024) combines linear recurrences with local attention in a hybrid design. Griffin uses a Real-Gated Linear Recurrence (RGLRU) layer that maintains a diagonal state matrix, interleaved with local sliding-window attention layers that handle short-range dependencies. The RecurrentGemma model series implements this architecture at the 2B and 9B parameter scales.
The practical significance of Griffin is that it demonstrates a design pattern: use efficient recurrence for the "backbone" of sequence processing, and add sparse attention layers only where they provide clear benefit (local context, retrieval-like operations). This hybrid approach often outperforms pure SSM or pure attention models of the same size.
The attention versus efficiency tradeoff is not all-or-nothing. The research trajectory is moving away from "replace attention entirely" toward "use attention surgically." Pure SSM models sacrifice recall precision on tasks that require exact matching or retrieval from earlier in the context. Pure attention models pay quadratic cost for every token, even when most tokens do not need to attend to most other tokens. The emerging consensus is that hybrid architectures (attention for precision-critical layers, linear recurrence for everything else) may dominate both pure approaches. For practitioners, this means that the inference optimization techniques from Chapter 9 (KV cache management, continuous batching) will remain relevant even as architectures evolve, because attention layers will likely persist in some form.
75.3.4 Hybrid Architectures: Combining Strengths
If the lesson is "use attention surgically," the field needs concrete architectural recipes that prove it works at scale. Jamba is the most prominent demonstration: it interleaves Mamba blocks, attention blocks, and MoE feed-forward layers into a single 52B-parameter model that runs in 128K context on a single GPU. The hybrid design choices it makes are now templates that other labs are imitating.
75.3.4.1 Jamba: Mamba Meets Transformers
AI21 Labs' Jamba model (Lieber et al., 2024) is the most prominent hybrid architecture, interleaving Mamba layers with transformer attention layers and Mixture-of-Experts (MoE) modules. The architecture uses a ratio of roughly 3:1 Mamba-to-attention layers, with MoE applied to the feed-forward components. This design achieves three goals simultaneously: the long-context handling of Mamba, the precise retrieval capability of attention, and the parameter efficiency of MoE.
from torch import nn
# Jamba-style hybrid architecture (conceptual)
class JambaBlock(nn.Module):
"""
Hybrid block: alternates between Mamba and Attention layers.
Every 4th layer uses attention; the rest use Mamba.
MoE replaces standard FFN in selected layers.
"""
def __init__(
self,
d_model: int,
layer_idx: int,
n_heads: int = 16,
mamba_state_dim: int = 16,
num_experts: int = 16,
active_experts: int = 2,
attention_every_n: int = 4,
moe_every_n: int = 2,
):
super().__init__()
self.layer_idx = layer_idx
self.use_attention = (layer_idx % attention_every_n == 0)
self.use_moe = (layer_idx % moe_every_n == 0)
# Sequence mixing: either Mamba or Attention
if self.use_attention:
self.seq_mixer = MultiHeadAttention(d_model, n_heads)
else:
self.seq_mixer = SelectiveSSM(d_model, mamba_state_dim)
# Channel mixing: either MoE or standard FFN
if self.use_moe:
self.channel_mixer = MoELayer(
d_model, num_experts, active_experts
)
else:
self.channel_mixer = FeedForward(d_model)
self.norm1 = RMSNorm(d_model)
self.norm2 = RMSNorm(d_model)
def forward(self, x, attention_mask=None):
# Pre-norm residual connections
h = x + self.seq_mixer(self.norm1(x), mask=attention_mask)
out = h + self.channel_mixer(self.norm2(h))
return out
Code 34.3.4: Conceptual Jamba-style hybrid block. The architectural ratio (how frequently attention layers appear) is a key design decision that trades recall precision for throughput.
Jamba's 256K-token context window with only 12B active parameters (52B total with MoE) demonstrates the efficiency gains possible with hybrid designs. On the NVIDIA A100, AI21 reported roughly 3x the throughput of a comparable pure-attention model at 128K context length because the Mamba layers eliminate most of the KV cache memory pressure. Independent benchmarks have varied, with throughput ratios from 1.8x to 3.5x depending on batch size, prompt length distribution, and serving framework, so the 3x figure should be read as an order-of-magnitude advantage rather than a guaranteed multiplier in your stack.
Mixture of Experts Gating.
In an MoE layer with E experts, a gating network routes each token x to the top-k experts:
$$\begin{aligned}g(x) &= \operatorname{softmax}(W_{g} \cdot x) \\ \operatorname{TopK} &= \text{argtop}-k(g(x)) \\ y &= \sum _{i \in \operatorname{TopK}} g_{i}(x) \cdot Expert_{i}(x)\end{aligned}$$Only the top-k experts (typically k = 2) are activated per token, so the compute cost scales with k rather than E. To prevent expert collapse (all tokens routed to the same expert), a load balancing auxiliary loss is added:
$$\mathscr{L}_{\text{balance}} = \alpha \cdot E \cdot \sum _{i=1..E} f_{i} \cdot P_{i}$$where fi is the fraction of tokens dispatched to expert i, Pi is the mean gate probability for expert i across the batch, and α is a small coefficient (typically 0.01). This encourages uniform distribution of tokens across experts.
Who: An NLP engineer at a contract analytics startup building a system to extract and cross-reference clauses from 100-page legal contracts.
Situation: The system needed to answer queries like "what does Section 4.2(b) say about the indemnification clause mentioned in Section 15.1?" across documents averaging 128K tokens.
Problem: A pure transformer required 40 GB of GPU memory for the KV cache at 128K tokens, making it impractical on the team's L4 GPUs. A pure Mamba model handled the length efficiently but scored 15% lower on cross-reference retrieval tasks that required precise attention over distant sections.
Decision: The team evaluated a Jamba-style hybrid architecture that used Mamba layers for efficient sequential processing of the bulk document and interspersed attention layers (every 4th layer) for precise cross-reference retrieval.
Result: The hybrid model fit on a single L4 GPU with 24 GB of memory, matched the pure transformer's retrieval accuracy within 2%, and processed contracts 3x faster. Per-document analysis cost dropped from $0.18 to $0.05.
Lesson: Hybrid SSM-attention architectures unlock practical long-context processing by using efficient SSM layers for the bulk of the sequence and reserving attention layers for the sub-tasks that genuinely require precise cross-document retrieval.
75.3.4.2 Design Principles for Hybrid Architectures
Empirical studies from multiple research groups have converged on three design principles for hybrid SSM-attention architectures, each answering a different "where do I put the attention?" question:
- Attention layer placement matters. Placing attention layers at regular intervals (every 4th or 8th layer) works better than clustering them. The attention layers serve as "information consolidation" points that allow the model to perform precise retrieval operations on the compressed state maintained by the SSM layers.
- The ratio depends on the task. Tasks requiring heavy in-context retrieval (question answering, coding with large contexts) benefit from more frequent attention layers (every 2nd or 3rd layer). Tasks dominated by sequential generation (creative writing, summarization) work well with sparser attention (every 6th to 8th layer).
- Sliding-window attention is often sufficient. Instead of full global attention, using a sliding window of 4,096 to 8,192 tokens in the attention layers preserves local precision while keeping memory costs bounded. The SSM layers handle global context propagation.
75.3.5 Efficiency Comparisons and Benchmarks
Comparing architectures requires examining multiple dimensions: quality (benchmark scores, perplexity), throughput (tokens per second during training and inference), memory consumption, and latency characteristics. The following table summarizes approximate comparisons at the 7B parameter scale with 128K context length:
| Architecture | Attention Complexity | Inference Memory (128K) | Throughput (tok/s) | In-Context Retrieval |
|---|---|---|---|---|
| Standard Transformer | $O(n^{2})$ | ~40 GB KV cache | Baseline (1x) | Excellent |
| Mamba-2 | $O(n)$ | ~200 MB state | ~5x at 128K | Good (degrades at extreme range) |
| RWKV-6 | $O(n)$ | ~150 MB state | ~4x at 128K | Good |
| Jamba (Hybrid) | $O(n)$ amortized | ~8 GB (reduced KV) | ~3x at 128K | Very good |
| Griffin | $O(n)$ with local attn | ~2 GB | ~3.5x at 128K | Good |
Table 75.3.1b: Approximate efficiency comparison of architectures at 7B parameters, 128K context. Throughput multiples are relative to standard transformer. Actual numbers vary by implementation and hardware.
Use einops for readable tensor reshaping in attention and SSM layers.
Show code
# pip install einops
from einops import rearrange, repeat
import torch
# Reshape for multi-head attention: (batch, seq, d) to (batch, heads, seq, head_dim)
x = torch.randn(2, 128, 512)
heads = rearrange(x, "b s (h d) -> b h s d", h=8)
print(f"Multi-head shape: {heads.shape}") # (2, 8, 128, 64)
# Repeat a state vector across batch dimension
state = torch.randn(1, 64)
batched = repeat(state, "1 d -> b d", b=32)
print(f"Batched state: {batched.shape}") # (32, 64)
Define and train a minimal sequence model layer using the JAX ecosystem.
Show code
# pip install jax flax optax
import jax
import jax.numpy as jnp
import flax.linen as nn
import optax
class SSMBlock(nn.Module):
state_dim: int = 64
@nn.compact
def __call__(self, x):
d = x.shape[-1]
A = self.param("A", nn.initializers.normal(0.01), (self.state_dim, self.state_dim))
B = self.param("B", nn.initializers.normal(0.01), (self.state_dim, d))
C = self.param("C", nn.initializers.normal(0.01), (d, self.state_dim))
return x + (x @ B.T @ A.T @ C.T) # simplified skip connection
model = SSMBlock()
key = jax.random.PRNGKey(0)
params = model.init(key, jnp.ones((1, 128, 256)))
optimizer = optax.adamw(learning_rate=1e-3)
opt_state = optimizer.init(params)
print(f"Parameter count: {sum(p.size for p in jax.tree.leaves(params))}")
The "Needle in a Haystack" test reveals the retrieval gap. In this test, a specific fact is inserted at a random position in a long document, and the model must retrieve it from a distant point. Transformers with full attention achieve near-perfect accuracy at all positions and context lengths. Pure SSM models show degradation for facts placed in the middle of very long sequences (the "lost in the middle" effect is amplified by compressed state). Hybrid models split the difference, achieving strong retrieval when the fact falls within an attention window and moderate retrieval otherwise. This test remains the clearest diagnostic for evaluating alternative architectures.
75.3.6 When to Consider Non-Transformer Architectures
For most production applications in 2025-2026, the transformer remains the default choice. The ecosystem of tools, pretrained models, fine-tuning frameworks (covered in Chapter 17), and serving infrastructure is built around transformers. Choosing an alternative architecture introduces friction at every stage of the pipeline. That said, three scenarios justify the switch, each driven by a different hardware or workload constraint.
75.3.6.1 Extremely Long Contexts with Constrained Hardware
If your application requires processing 100K+ tokens and you cannot afford the GPU memory for a transformer's KV cache, Mamba or RWKV models provide a practical path forward. The memory savings can be the difference between requiring one GPU and requiring four.
75.3.6.2 High-Throughput Streaming Applications
For applications that process continuous streams of text (real-time transcription analysis, social media monitoring, log analysis), the constant-memory inference of SSMs is a natural fit. Each new token costs the same regardless of how many tokens have been processed, unlike transformers where the per-token cost grows with the KV cache.
75.3.6.3 Edge and Mobile Deployment
When deploying models on devices with limited memory and no access to cloud GPUs, SSM architectures offer the best quality-per-byte ratio for long-context tasks. The small state footprint (hundreds of megabytes vs. tens of gigabytes) makes on-device long-context processing feasible.
Who: A CTO at a healthcare startup building a clinical decision support system that summarized patient records and flagged medication interactions.
Situation: Patient records averaged 50K tokens each. The system needed to both summarize the full record and perform safety-critical medication interaction checks that required precise retrieval of drug names, dosages, and contraindications scattered across the record.
Problem: The team evaluated three architecture options. Option A: a transformer with FlashAttention and 4-bit quantization (covered in Section 9.4) required a single A100 GPU ($2.50/hour). Option B: a Mamba-2 model handled the same records on an L4 GPU at one-quarter the cost but missed 8% of medication interactions in testing. Option C: a Jamba-style hybrid preserved attention-level recall for safety checks while using Mamba layers for the bulk of the record.
Decision: The startup chose Option C. The medication interaction task was safety-critical and required attention-level recall precision, while the overall record summarization could leverage the efficiency of SSM layers. They deployed on a single A10G GPU ($1.00/hour), splitting the cost difference.
Result: The hybrid architecture caught 99.2% of medication interactions (matching the pure transformer) while processing records 2.4x faster. Monthly infrastructure costs were 60% lower than the transformer-only option.
Lesson: Architecture selection should be driven by the most demanding sub-task. When one component requires high-precision retrieval and another needs efficient long-context processing, a hybrid architecture lets you optimize each independently rather than paying the cost of the most expensive requirement everywhere.
When NOT to switch. If your application works with contexts under 8K tokens, if you need the vast ecosystem of fine-tuned transformer models on Hugging Face, if your team lacks experience with newer architectures, or if your task is primarily about precise in-context retrieval, the transformer remains the better choice. The efficiency advantages of alternative architectures only materialize at longer sequence lengths, and the model availability gap is significant.
75.3.7 Neuromorphic and Event-Driven Approaches
A more speculative line of research explores architectures inspired by biological neural computation. Spiking neural networks (SNNs) process information as discrete spikes rather than continuous values, offering potential energy efficiency gains on specialized neuromorphic hardware like Intel's Loihi 2 and IBM's NorthPole.
SpikingGPT and similar projects have demonstrated that language modeling is possible with spiking architectures, though quality lags behind conventional networks at comparable scale. The primary advantage is energy consumption: neuromorphic chips can process inference workloads at 10-100x lower energy per operation than GPUs. If this efficiency gap translates to language models at scale, the implications for sustainability and edge deployment would be transformative.
Event-driven architectures extend this concept to data processing. Rather than processing all tokens uniformly, event-driven models activate computation only when the input changes significantly from the model's current prediction. For tasks like real-time document monitoring where most content is unchanged between updates, this can reduce compute costs by orders of magnitude. These approaches remain in the research stage and are not yet practical for production deployment.
- Self-attention scales quadratically with sequence length. This fundamental bottleneck drives research into linear-time alternatives for long-context processing.
- State space models (Mamba) achieve linear scaling. They process sequences in $O(n)$ time by maintaining compressed hidden states rather than attending to all previous tokens.
- Hybrid architectures combine the strengths of both paradigms. Interleaving attention and SSM layers captures both local precision and long-range efficiency.
- No single architecture dominates all tasks. Transformers still excel at in-context learning and complex retrieval, while SSMs shine at long-range dependency tasks.
The convergence of architectures. Mamba-2's state space duality theorem suggests that SSMs and attention may be endpoints on a spectrum rather than fundamentally different approaches.
Recent work on "linear attention" (Katharopoulos et al., Yang et al.) and "gated linear attention" further blurs the boundary. The research community is moving toward a unified framework where the architectural choice is a hyperparameter (how much to compress the state) rather than a philosophical commitment. Watch for architectures that can dynamically adjust their compression ratio per layer and per input, spending full attention on tokens that need it and using compressed state for the rest.
Explain why standard self-attention has $O(n^2)$ time and memory complexity, and describe how state space models (SSMs) and linear attention variants achieve $O(n)$ complexity. For each alternative, explain: (a) the core mechanism that replaces pairwise token comparisons, (b) what capability is lost compared to full attention, (c) the empirical performance gap on standard language modeling benchmarks, and (d) the types of tasks where the linear alternatives perform comparably to full attention.
Analyze the Mamba architecture (a selective state space model). Describe: (a) how the "selection mechanism" allows the model to decide what information to remember or forget from the input, (b) how this differs from a fixed-parameter state space model, (c) why Mamba achieves comparable perplexity to transformers on language modeling despite not having explicit attention, and (d) the inference speed advantage of Mamba for long sequences (no KV cache, constant memory per token). Calculate the memory savings for a 1M-token sequence compared to a transformer with a standard KV cache.
Write a Python benchmarking script that measures the memory usage and forward-pass time of a standard self-attention layer versus a simulated linear-time alternative as sequence length increases from 512 to 32,768 tokens. For the attention layer, use PyTorch's scaled_dot_product_attention. For the linear alternative, implement a simple recurrent scan that processes tokens sequentially with a fixed-size hidden state. Plot both memory and time on log-log axes and verify that attention shows $O(n^2)$ scaling while the linear alternative shows $O(n)$.
Several recent models (Jamba, Zamba, Griffin) combine attention layers with SSM or linear attention layers in a hybrid architecture. Discuss: (a) Why would mixing attention and SSM layers work better than either alone? (b) What is the optimal ratio of attention to SSM layers, and how might this depend on the task? (c) Do hybrid architectures represent a transitional step or a permanent design pattern? (d) What would need to be true about a pure SSM architecture for it to completely replace transformers?
One of the strongest arguments for attention is that it enables in-context learning (the ability to learn from examples in the prompt). If SSMs can achieve comparable perplexity on language modeling, can they also perform in-context learning as effectively as transformers? Discuss: (a) What properties of attention enable in-context learning (the implicit cross-entropy theory from Section 6.7), (b) whether SSMs' fixed-size hidden state limits their ability to "remember" all examples in a long prompt, and (c) what recent empirical evidence shows about SSMs' in-context learning capabilities.
What Comes Next
This section surveyed the architectures challenging transformer dominance. The next section, Section 75.4: World Models, explores how neural networks are learning to simulate physical environments through video generation, interactive 3D worlds, and embodied reasoning for agent planning.
For the sequence-model and attention foundations these architectures extend, see Section 2.2 and Section 3.1. For inference-side optimizations these architectures co-design with, see Section 9.1 and Section 9.4. For PEFT applied to frontier architectures, see Chapter 17.