Chapter 3: The Transformer Architecture

Chapter opener illustration: The Transformer Architecture.

"Attention is all you need. Well, that and a few hundred billion parameters, a small country's worth of electricity, and a team of researchers who haven't slept since 2017."
Attn, Sleep-Deprived AI Agent

Looking Back

Chapter 3 introduces attention as a mechanism. This chapter assembles attention into the Transformer, the architecture every modern LLM is built on. You will see how one token gets computed end-to-end (Section 3.1), then build a working decoder-only Transformer from scratch in PyTorch (Section 3.3). The 300 lines of code at the end of this chapter are the most important code in the book; everything from here on is engineering on top of this core.

Chapter Overview

This is the central module of the entire book. The Transformer, introduced in the landmark 2017 paper "Attention Is All You Need," is the architecture behind every modern large language model. Building on the attention mechanisms introduced in Chapter 03, this chapter will dissect the Transformer layer by layer, build one from scratch, survey the many variants that have emerged since (explored further in Chapter 07: Modern LLM Landscape), understand the GPU hardware it runs on, and explore the theoretical limits of what Transformers can and cannot compute.

By the end of this chapter you will be able to read a Transformer implementation, modify it confidently, reason about its computational cost (a foundation for the inference optimization techniques in Chapter 09), and understand why certain architectural choices (positional encoding, layer normalization, residual connections) are not arbitrary but deeply principled.

Fun Fact: The Architecture That Ate AI

In 1995 a strange new building called Bilbao Guggenheim opened in northern Spain, and within a decade every museum on Earth was hiring Frank Gehry. The 2017 Transformer paper had the same effect on AI. Every modern LLM, every multimodal model, almost every protein folder is a variation on the same titanium-clad blueprint, and every chapter after this one assumes you can read it.

Big Picture

The transformer is the architecture every chapter after this assumes you understand. We build it from the ground up: token to embedding, attention, residual plus LayerNorm, feed-forward, repeat. By the end of the chapter you can sketch a 300-line decoder-only transformer from memory and reason about its memory and compute budget.

Transformer anatomy diagram showing the seven stages of next-token generation — **Figure 4.0.2**: Token to next-token in seven stages. Every modern LLM is this diagram repeated N times; Mamba, MoE, LoRA, FlashAttention, and quantization are all substitutions into specific boxes, not changes to the skeleton. Step 1 is tokens (integer ids, shape T). Step 2 is token embedding (matrix vocab times d). Step 3 is positional information added. Step 4 is the transformer block, a dashed-bordered region containing 4a multi-head causal self-attention with LayerNorm and a residual, and 4b feed-forward MLP with LayerNorm and a residual; the block repeats N times. Step 5 is the final LayerNorm. Step 6 is the LM head (linear E-transpose, tied weights) producing logits. Step 7 samples the next token; an autoregressive loop arrow returns to step 1. A second panel breaks down where the parameters live for a 7B example.

Note: Learning Objectives

Walk through the original Transformer paper and explain every component from positional encodings to output probabilities
Implement a complete decoder-only Transformer in ~300 lines of PyTorch (using skills from Chapter 00), training it on a small dataset
Compare encoder-only, decoder-only, and encoder-decoder architectures with concrete use cases, preparing you for Chapter 06's pretraining discussion
Explain efficient attention mechanisms (linear attention, sparse attention, FlashAttention) and their tradeoffs
Describe State Space Models (SSMs/Mamba), Mixture-of-Experts (MoE), RWKV, Gated Attention, and Multi-head Latent Attention
Understand GPU architecture (SMs, memory hierarchy, bandwidth) and write a basic Triton kernel
State the universal approximation and computational complexity results for Transformers, and explain how chain-of-thought reasoning (a concept revisited in Chapter 11: Interpretability) extends their power

Prerequisites

Chapter 0: Comfortable with PyTorch tensors, autograd, and training loops
Chapter 1: Familiarity with word embeddings, vector spaces, and similarity measures
Chapter 1: Understanding of tokenization and vocabulary construction
Chapter 2: Solid grasp of attention mechanisms (dot-product attention, multi-head attention, causal masking)
Linear algebra: matrix multiplication, softmax, norms

Sections

What's Next?

Next: Chapter 4: Decoding Strategies & Text Generation. A trained transformer is a probability distribution over the next token. That is not text. Chapter 4 covers the algorithms that turn that distribution into useful output: greedy and beam search, top-k and nucleus sampling, speculative decoding, and the newer diffusion-language-model approach that breaks the left-to-right assumption entirely. Spoiler: temperature 0 is rarely what you want.