Section 3.5: Transformer Variants & Efficiency

BERT reads both ways, GPT only looks left, and T5 just converts everything into a text-to-text problem. Family dinners are awkward.
Norm, Architecturally Confused AI Agent

Big Picture: The Design Space Around Attention

Building on the attention mechanism from Section 3.1, this section catalogs the major design choices a modern Transformer makes within the attention paradigm: which architectural family to use (encoder, decoder, encoder-decoder); how to encode position (sinusoidal, learned, RoPE, ALiBi); how to make attention efficient (sparse, linear, FlashAttention, GQA, MQA, Differential); how many heads to use and why; and where to place LayerNorm. Stepping outside dense softmax attention entirely (SSMs, MoE, MLA) is the subject of Section 3.8.

Key Insight: Remember

Three families, one mental shortcut: BERT (encoder) reads, GPT (decoder) writes, T5 (encoder-decoder) reads-then-writes. Positional encoding is how attention learns word order, since the math itself is permutation-invariant. Without it, "dog bites man" and "man bites dog" look identical to the model.

3.5.1 The Three Architectural Families

If someone asks you "which Transformer should I use?", your first question should be "what are you trying to do?" The original Transformer is an encoder-decoder model, but since 2017 three distinct families have emerged. Each family uses a different subset of the Transformer's components and targets different types of tasks. Understanding when to use which architecture is a fundamental skill in applied NLP.

Fun Fact: Encoders, Decoders, and the Tool Shed

A workshop has three tools that look almost identical at first: a saw cuts in one direction, a plane shaves a flat surface, and a router does both with attachments. Encoder-only models are the saws (read once, classify), decoder-only models are the planes (generate forward), and encoder-decoder models are the routers (read then generate). Pick the wrong one and you can still get the job done; pick the right one and the work becomes effortless.

Three Transformer families: encoder-only, decoder-only, encoder-decoder

Figure 3.5.1: The three Transformer architectural families. The decoder-only pattern dominates modern LLMs.

Three friendly robots representing the three Transformer architectural families: an encoder-only robot, a decoder-only robot, and an encoder-decoder robot, each with distinct visual characteristics — **Figure 3.5.2**: The three Transformer families: encoder-only models (like BERT) read bidirectionally, decoder-only models (like GPT) generate left to right with causal attention, and encoder-decoder models (like T5) combine both approaches.

Prerequisites

This section builds on the complete Transformer architecture from Section 3.1 and the hands-on implementation in Section 3.3. Understanding self-attention, cross-attention, and causal masking from Section 2.3 is essential. The efficiency techniques discussed here connect directly to the inference-optimization strategies covered later in the book.

3.5.1.1 Encoder-Only (BERT Family)

Encoder-only models use bidirectional attention: every token can attend to every other token, including those to its right. This makes them excellent for understanding tasks (classification, token-level labeling, sentence similarity) but unsuitable for generation, since there is no causal structure.

BERT (Devlin et al., 2018) is trained with masked language modeling (MLM): 15% of input tokens are randomly masked, and the model must predict them. This pretraining objective forces the model to build rich bidirectional representations. Key descendants include RoBERTa (larger data, longer training), DeBERTa (disentangled attention with relative position), and ELECTRA (replaced-token detection, which is more sample-efficient than MLM).

3.5.1.2 Decoder-Only (GPT Family)

Decoder-only models use causal (left-to-right) attention with the auto-regressive language modeling objective. They are the dominant architecture for modern LLMs because: (a) the training objective is simple, scalable, and naturally aligns with text generation; (b) they can be prompted to perform virtually any task (classification, translation, reasoning) through in-context learning; and (c) they are straightforward to scale.

GPT-2 (Radford et al., 2019) demonstrated that a sufficiently large language model develops emergent abilities. GPT-3 (Brown et al., 2020) showed that these abilities improve reliably with scale. Today, virtually all frontier LLMs (GPT-4, Claude, LLaMA, Gemini, Mistral) are decoder-only.

3.5.1.3 Encoder-Decoder (T5 Family)

Encoder-decoder models process an input sequence with bidirectional attention (encoder) and generate an output sequence with causal attention (decoder), using cross-attention to condition the decoder on the encoder's output. This is the original Transformer architecture and remains the best choice for tasks with a clear input/output structure where the input benefits from bidirectional processing: translation, summarization, and speech recognition (Whisper).

T5 (Raffel et al., 2020) reframed all NLP tasks as text-to-text problems, demonstrating that a single encoder-decoder model could handle classification, translation, summarization, and question answering with the same architecture, just different input/output text formats.

3.5.2 Positional Encoding Variants

Key Insight

Why RoPE extrapolates and sinusoidal embeddings do not

The deeper reason RoPE wins is that it makes attention scores a function of position difference, not absolute position, so a context-length extension only requires extrapolating one scalar (the rotation rate) instead of teaching the model a new vocabulary of position vectors. Sinusoidal embeddings break at extrapolation because the absolute-position basis becomes out-of-distribution; learned embeddings cannot extrapolate at all. RoPE's frequencies form a geometric series, borrowed from Fourier analysis: low-frequency dimensions encode long-range position, high-frequency dimensions encode local position. NTK-aware scaling and YaRN exploit this by rescaling only the low-frequency band, preserving the local geometry the model was trained on.

Transformers have no built-in notion of order. Without positional information, the model sees a sequence as an unordered set: "the cat sat on the mat" and "the mat sat on the cat" would be indistinguishable. Every positional encoding scheme solves this problem differently, and the choice has real consequences for how well a model generalizes to sequence lengths it has not seen during training. Four main approaches have emerged since 2017.

3.5.2.1 Sinusoidal Encoding (Vaswani et al., 2017)

The original "Attention Is All You Need" paper used a fixed, non-learned encoding: each position is represented by a vector of sine and cosine values at geometrically spaced frequencies. For position $pos$ and dimension index $i$:

PE_{\text{pos}, 2i} = \sin\!\left(\frac{\text{pos}}{10000^{2i/d}}\right), \quad PE_{\text{pos}, 2i+1} = \cos\!\left(\frac{\text{pos}}{10000^{2i/d}}\right)

The low-frequency dimensions change slowly across positions (giving a coarse sense of where in the sequence we are) while high-frequency dimensions oscillate rapidly (giving fine-grained local position information). Together they form a unique "fingerprint" for every position.

Note: Why This Surprises First-Time Readers

The positional encoding is added to the token embedding, not concatenated. Many readers expect that adding a position vector should destroy the token's meaning, the way adding noise to a JPEG destroys the picture. It does not, because the embedding dimensionality (say 768) is much larger than the effective rank needed to encode either tokens or positions, and the optimizer learns to allocate non-overlapping subspaces. The model can recover both "which token" and "which position" from the sum, similar to how you can recover two superimposed sine waves from their sum if their frequencies differ.

Key advantage: because the encoding is a fixed mathematical function, it can be evaluated at any position, including positions beyond the training sequence length. A model trained on sequences up to length 512 can in principle receive a sinusoidal encoding for position 1000. Whether it attends to that position correctly is a separate empirical question, but the encoding itself does not break.

Key limitation: sinusoidal encoding represents absolute positions. The model must implicitly learn that position 5 and position 15 are 10 steps apart; that relative relationship is not directly injected. This becomes a problem for tasks that depend heavily on relative structure rather than absolute location.

3.5.2.2 Learned Positional Embeddings (BERT, GPT-2)

A simpler alternative: treat positions exactly like tokens. Create an embedding table nn.Embedding(max_length, d_model) and look up a learned vector for each position index. This is precisely what our Section 3.3 implementation does with self.pos_emb.

Key advantage: the model can learn whatever positional representation is most useful for the task. If the corpus has strong structural patterns (e.g., the first token is often a subject, the last token is often a period), the learned embeddings can capture those patterns directly. In practice, learned embeddings perform comparably to or slightly better than sinusoidal on standard benchmarks.

Key limitation: the embedding table has exactly max_length rows. If the model encounters a sequence longer than max_length at inference time, it has no embedding for those positions. BERT's maximum length of 512 tokens is a direct consequence of this fixed table. Workarounds (interpolating learned embeddings, fine-tuning on longer sequences) exist but add complexity.

Note: Absolute vs. Relative Position

Both sinusoidal and learned embeddings encode absolute position: they tell the model "this token is at position 7," but not "this token is 3 steps before that one." For many NLP tasks, relative position is what matters: whether a verb is close to its subject is more important than whether the subject is at position 7 or 700. The next two approaches address this directly.

3.5.2.3 Rotary Position Embedding (RoPE)

RoPE (Su et al., 2021) has become the dominant positional encoding scheme in modern LLMs (used in LLaMA, Mistral, Qwen, Gemma). Instead of adding positional information to the input embeddings, RoPE applies a rotation to the query and key vectors in each attention head. The rotation angle depends on the position and the dimension index.

The key insight: after applying RoPE to queries and keys, their dot product depends only on their relative position, not their absolute positions. This is achieved by rotating pairs of dimensions by $pos \times \theta _{i}$:

\text{RoPE}(x, \text{pos})_{2i} = x_{2i} \cos(\text{pos} \cdot \theta_i) - x_{2i+1} \sin(\text{pos} \cdot \theta_i)$$ $$\text{RoPE}(x, \text{pos})_{2i+1} = x_{2i} \sin(\text{pos} \cdot \theta_i) + x_{2i+1} \cos(\text{pos} \cdot \theta_i)

Algorithm 3.5.1: Rotary Position Embedding (RoPE)

Algorithm: Apply RoPE to Q or K
Input:  vector x in R^{T x d_k} (one head), base frequency theta_0 = 10000
Output: rotated x' in R^{T x d_k}

  // 1. Precompute per-pair inverse frequencies (i = 0..d_k/2 - 1)
  For i = 0..d_k/2 - 1:
      theta_i := theta_0^{ -2 i / d_k }         // decays geometrically with dim index

  // 2. Build cos/sin tables of shape (T, d_k/2)
  For pos = 0..T-1, for i = 0..d_k/2 - 1:
      cos_table[pos, i] := cos(pos * theta_i)
      sin_table[pos, i] := sin(pos * theta_i)

  // 3. For each token position, rotate each 2-dim pair
  For pos = 0..T-1, for i = 0..d_k/2 - 1:
      a := x[pos, 2 i]
      b := x[pos, 2 i + 1]
      x'[pos, 2 i]     := a * cos_table[pos, i] - b * sin_table[pos, i]
      x'[pos, 2 i + 1] := a * sin_table[pos, i] + b * cos_table[pos, i]
  Return x'

Property (relative position): for any two positions m, n with rotation matrix R_pos,
    < R_m q, R_n k > = q^T R_{n - m} k
so the post-rotation dot product depends only on (n - m), not on m or n alone.

Source: Su et al., "RoFormer: Enhanced Transformer with Rotary Position Embedding" (arXiv:2104.09864, 2021). This relative-position property is what makes context-window extension via NTK-aware scaling and YaRN possible: rescaling theta_i shifts every position's rotation in a coherent way, so the trained relative-distance behavior generalizes to longer sequences.

import torch
# Rotary Position Embedding (RoPE): rotate pairs of dimensions by
# position-dependent angles so relative distance is encoded in dot products.
def apply_rope(x, freqs_cos, freqs_sin):
 """Apply Rotary Position Embedding to queries or keys.
    x: (B, n_heads, T, d_k)
    freqs_cos, freqs_sin: (T, d_k//2) precomputed cos/sin of rotation angles
    """
# Split into pairs and rotate
x_r = x.float().reshape(*x.shape[:-1], -1, 2) # (..., d_k//2, 2)
x0, x1 = x_r[..., 0], x_r[..., 1]
cos = freqs_cos.unsqueeze(0).unsqueeze(0) # broadcast over B, n_heads
sin = freqs_sin.unsqueeze(0).unsqueeze(0)
# 2D rotation: [cos -sin; sin cos] @ [x0; x1]
out0 = x0 * cos - x1 * sin
out1 = x0 * sin + x1 * cos
out = torch.stack([out0, out1], dim=-1).flatten(-2)
return out.type_as(x)

Code Fragment 3.5.1a: Split into pairs and rotate.

RoPE rotates embedding dimensions by position angle; dot product depends only on relative distance

Figure 3.5.3: RoPE rotates each pair of embedding dimensions by an angle proportional to the token's position. Because the dot product of two rotated vectors depends only on the angle between them (the relative position), RoPE naturally captures relative position without explicit position IDs.

RoPE advantages: (1) it naturally encodes relative positions, (2) it requires no additional parameters, (3) it can be extended to longer sequences through frequency scaling (NTK-aware scaling, YaRN), and (4) it has strong empirical performance.

Note: PI, NTK-Aware, and YaRN Frequency Scaling

Three extension schemes have emerged to push a model trained at context length $L_{\text{train}}$ out to $L_{\text{test}} = s \cdot L_{\text{train}}$ without retraining from scratch, where $s$ is the scale factor (e.g., $s = 4$ to go from 4K to 16K). Each rescales the RoPE frequencies $\theta_i = b^{-2i/d}$ (with base $b = 10000$ in the original RoPE) but with a different bias toward high or low frequencies, and the differences matter because rotation collisions at small positions break in-distribution behavior even as they fix out-of-distribution extrapolation.

Position Interpolation (PI) (Chen et al., 2023) divides every position index by $s$ before rotation, equivalently scaling all frequencies as $\theta_i' = \theta_i / s$. PI keeps the rotation pattern globally compressed and is mathematically equivalent to interpolating the position embedding inside the trained window; it requires only ~1000 fine-tuning steps to recover quality. The drawback is that high-frequency bands (small $i$, which carry sub-token-distance information) get squeezed below their training distribution, so the model loses some precise short-range positional acuity. NTK-aware scaling (bloc97, 2023) addresses this by scaling the base rather than the frequencies: $b' = b \cdot s^{d/(d-2)}$, equivalently $\theta_i' = \theta_i \cdot s^{-2i/(d-2)}$. The exponent in $s^{-2i/(d-2)}$ is near 0 for the highest-frequency bands (small $i$) and approaches $-1$ for the lowest-frequency bands (large $i$). High-frequency rotations are left near their training values; only the low-frequency, long-range bands get stretched. This works zero-shot for modest $s$ but degrades for $s > 4$.

YaRN (Peng et al., 2023) combines NTK-aware scaling with two refinements. First, it partitions the $d/2$ frequency bands by their wavelength relative to the training context: bands whose wavelength is much shorter than $L_{\text{train}}$ are left untouched (these never wrap around the trained window, so they need no fix); bands whose wavelength exceeds $L_{\text{train}}$ are divided by $s$ (full PI for these); intermediate bands are smoothly interpolated. The smooth interpolation function $\gamma(i)$ uses a ramp between the two regimes:

\theta_i^{\mathrm{YaRN}} = \bigl(1 - \gamma(i)\bigr) \cdot \theta_i \;+\; \gamma(i) \cdot \frac{\theta_i}{s},

with $\gamma(i)$ a clamped linear ramp in the band index. Second, YaRN multiplies the attention logits by a small temperature factor $\sqrt{1 + 0.1 \ln s}$ to compensate for the slightly different score distribution induced by the rescaled rotations. Empirically YaRN extends Llama-2 from 4K to 64K and 128K with only ~400 fine-tuning steps and matches or beats PI on long-context evaluations. References: Su et al., "RoFormer," arXiv:2104.09864 (2021); Chen et al., "Extending Context Window via Position Interpolation," arXiv:2306.15595 (2023); Peng et al., "YaRN: Efficient Context Window Extension of Large Language Models," arXiv:2309.00071 (2023).

Figure 3.5.4 contrasts how the three schemes treat the RoPE frequency spectrum: PI squeezes every band uniformly, NTK-aware scaling spares the high-frequency (short-range) bands and stretches only the low-frequency ones, and YaRN partitions the spectrum into untouched, ramped, and fully interpolated regions.

Comparison of three RoPE context-extension schemes across the frequency spectrum, from high-frequency short-range bands on the left to low-frequency long-range bands on the right. Position Interpolation divides every band by the scale factor s uniformly. NTK-aware scaling leaves high-frequency bands near their trained values and progressively stretches the low-frequency bands. YaRN keeps short-wavelength bands untouched, fully interpolates long-wavelength bands by s, and uses a smooth ramp for the intermediate band region. — **Figure 3.5.4**: The three context-extension schemes differ in *which* RoPE frequency bands they rescale. PI (top) divides every band by the scale factor $s$, which damages short-range acuity. NTK-aware scaling (middle) rescales the base instead, leaving high-frequency bands near their trained values and stretching mostly the low-frequency, long-range bands. YaRN (bottom) makes that boundary explicit: short-wavelength bands are untouched, long-wavelength bands get full PI, and a smooth ramp $\gamma(i)$ blends the intermediate region, which is why YaRN extends context furthest with the fewest fine-tuning steps.

RoPE is powerful but still fundamentally couples the model to a maximum trained sequence length. An alternative approach avoids learned or computed embeddings entirely, instead injecting position information directly into the attention computation itself.

3.5.2.4 ALiBi (Attention with Linear Biases)

ALiBi (Press et al., 2022) takes a minimalist approach: it adds a linear bias to the attention scores that penalizes distant positions. No positional encoding is added to the embeddings at all. For head $h$, a bias of $-m_{h} \cdot |i - j|$ is added to the attention score between positions $i$ and $j$, where $m_{h}$ is a head-specific slope (a fixed, non-learned geometric sequence: 1/2, 1/4, 1/8, …). The attention formula becomes:

\text{score}(i, j) = \frac{q_i \cdot k_j}{\sqrt{d_k}} - m_h \cdot |i - j|

The distance penalty grows linearly with separation. Each head has a different slope, so some heads are more "local" (high slope, sharp penalty) and others are more "global" (low slope, gentler penalty). The title of the original paper, "Train Short, Test Long," captures the key benefit: a model trained on sequences of length 1024 can extrapolate well to length 4096 because the bias function is just a linear ramp and never encounters an out-of-range index. ALiBi is used in BLOOM (176B) and several Falcon variants.

Key advantage: outstanding length extrapolation; extremely simple to implement (two lines of code); no additional parameters.

Key limitation: unlike RoPE, ALiBi biases are fixed and not learned, so the model cannot adapt its distance-sensitivity to the data. Empirically, RoPE tends to outperform ALiBi when the training and evaluation lengths are the same; ALiBi wins when lengths differ substantially.

3.5.2.5 Comparing the Four Approaches

The four positional encoding strategies occupy distinct points in a design space defined by whether position is absolute or relative, whether encodings are learned or fixed, and how well the approach generalizes to sequence lengths beyond training. The SVG diagram below shows each approach schematically, and the table beneath gives a quick reference.

Four positional encoding strategies: Sinusoidal, Learned, RoPE, ALiBi

Figure 3.5.5: The four major positional encoding strategies. Sinusoidal and learned embeddings inject absolute position before the first attention layer. RoPE rotates Q and K vectors in each attention layer so their dot product depends on relative distance. ALiBi bypasses embeddings entirely and subtracts a distance penalty from attention scores. RoPE has become the dominant choice in frontier LLMs.

Table 3.5.1b: Positional Encoding Comparison (as of 2026).

Method	Position type	Length generalization	Parameters	Key models	Compute cost
Sinusoidal	Absolute	Moderate (can evaluate beyond training length, but quality degrades)	None (fixed formula)	Original Transformer, some T5 variants	Negligible (precomputed)
Learned	Absolute	None (hard cap at max_length)	max_length × d_model	BERT, GPT-2, RoBERTa, ELECTRA	Negligible (one lookup per token)
RoPE (Sec 3.5.2.3)	Relative (via rotation)	Good; extendable with NTK/YaRN scaling	None (computed from frequencies)	LLaMA 1-3, Mistral, Gemma, Qwen, Gemini, GPT-NeoX	Small (rotation per head per layer)
ALiBi (Sec 3.5.2.4)	Relative (linear penalty)	Excellent (linear extrapolation)	None (fixed slopes)	BLOOM, some Falcon variants, MPT	Negligible (one addition per attention score)

For a survey of which frontier models use which positional encoding in their production configurations, see Section 7.1. The KV cache interaction with RoPE (specifically, how cached keys and values retain their rotated coordinates across steps) is covered in Section 9.3.

Key Takeaways

Three architectural families (encoder-only, decoder-only, encoder-decoder) serve different use cases; decoder-only dominates modern LLMs.
Four positional encoding strategies exist: sinusoidal (fixed, absolute, length-generalizable), learned (trainable, absolute, fixed max length), RoPE (relative, no parameters, strong extrapolation), and ALiBi (relative linear penalty, excellent extrapolation). RoPE is now standard for frontier LLMs.
Length extrapolation depends on how positions are encoded: learned embeddings fail beyond their lookup table, sinusoidal degrades gracefully, RoPE benefits from NTK-aware or YaRN scaling, and ALiBi extrapolates by construction.

Self-Check

1. Why did decoder-only models become dominant over encoder-decoder models for general-purpose LLMs?

Show Answer

Decoder-only models have a simpler training objective (next-token prediction), scale straightforwardly, and can handle any task through prompting (in-context learning). The auto-regressive structure naturally supports generation, and prompting eliminates the need for task-specific architectural modifications.

2. A model uses learned positional embeddings with max_length=512. What problem arises if you pass it a sequence of 600 tokens, and how do sinusoidal and RoPE encodings handle the same situation differently?

Show Answer

With learned embeddings, positions 513 through 600 have no embedding in the lookup table, causing an index error (or requiring a workaround like clamping, which produces incorrect encodings). Sinusoidal encodings can be evaluated at any position by formula, so positions beyond 512 get a valid encoding, though the model may not have learned to use it well. RoPE applies a rotation computed from the position index at attention time; it can be computed for any position. Extensions like NTK-aware scaling and YaRN further improve RoPE's quality at lengths beyond training.

3. What is the key property of RoPE that makes the dot product of two vectors depend on their relative position rather than absolute positions?

Show Answer

RoPE rotates each query or key vector by an angle proportional to its position. When you compute the dot product of a rotated query at position i and a rotated key at position j, the rotation angles partially cancel: the result depends on (i - j), the relative offset, not on i and j separately. This follows from the property of 2D rotations: rotating vector a by angle α and vector b by angle β, then taking the dot product, gives a result that depends on (α - β), i.e. the relative rotation.

What's Next?

The discussion continues in Section 3.5a: Efficient Attention, Multi-Head Reasoning & Normalization, which covers sparse attention, linear attention, FlashAttention, MQA and GQA, Differential Attention, why multiple attention heads matter, and the Pre-Norm vs. Post-Norm question (plus RMSNorm). After that, Section 3.6: GPU Fundamentals & Systems turns to systems-level concepts.

Further Reading

Transformer Variants

Devlin, J. et al. (2019). "BERT: Pretraining of Deep Bidirectional Transformers for Language Understanding." NAACL 2019. The foundational encoder-only Transformer paper, establishing masked language modeling and the [CLS] classification paradigm. Compare the bidirectional architecture here with the decoder-only approach in Section 3.3 to understand the design space.

Raffel, C. et al. (2020). "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer." JMLR. Introduces T5 and the encoder-decoder text-to-text framework, with systematic comparisons of architecture choices and pretraining objectives. The most comprehensive empirical study of Transformer design decisions available.

Positional Encoding

Su, J., Lu, Y., Pan, S., Wen, B., & Liu, Y. (2021). "RoFormer: Enhanced Transformer with Rotary Position Embedding." arXiv:2104.09864. Introduces RoPE, the dominant positional encoding in modern LLMs. The paper proves that rotating Q and K vectors by position-dependent angles causes their dot product to depend only on relative distance. This section's explanation and SVG diagrams are based on the formulation in this paper. Required reading for anyone implementing or debugging attention in a modern LLM.

Press, O., Smith, N. A., & Lewis, M. (2022). "Train Short, Test Long: Attention with Linear Biases Enables Input Length Generalization." ICLR 2022. Introduces ALiBi, replacing positional embeddings with a linear bias subtracted from attention scores. Demonstrates that a model trained on 1024-token sequences can reliably process 2048-token sequences at inference. The title is a memorable summary of the core contribution: simplicity in the encoding enables robustness in deployment.