Chapter 04: The Transformer Architecture

"Attention is all you need. Well, that and a few hundred billion parameters, a small country's worth of electricity, and a team of researchers who haven't slept since 2017."
Attn, Sleep-Deprived AI Agent

The Transformer Architecture chapter illustration — **Figure 4.0.1**: The Transformer, depicted as a layered machine of self-attention and feed-forward blocks, is the architectural blueprint behind every modern large language model.

Chapter Overview

This is the central module of the entire book. The Transformer, introduced in the landmark 2017 paper "Attention Is All You Need," is the architecture behind every modern large language model. Building on the attention mechanisms introduced in Chapter 3, this chapter will dissect the Transformer layer by layer, build one from scratch, survey the many variants that have emerged since (explored further in Chapter 7: Modern LLM Landscape), understand the GPU hardware it runs on, and explore the theoretical limits of what Transformers can and cannot compute.

By the end of this chapter you will be able to read a Transformer implementation, modify it confidently, reason about its computational cost (a foundation for the inference optimization techniques in Chapter 9), and understand why certain architectural choices (positional encoding, layer normalization, residual connections) are not arbitrary but deeply principled.

Prerequisites

Chapter 00: Comfortable with PyTorch tensors, autograd, and training loops
Chapter 01: Familiarity with word embeddings, vector spaces, and similarity measures
Chapter 02: Understanding of tokenization and vocabulary construction
Chapter 03: Solid grasp of attention mechanisms (dot-product attention, multi-head attention, causal masking)
Linear algebra: matrix multiplication, softmax, norms

Learning Objectives

Walk through the original Transformer paper and explain every component from positional encodings to output probabilities
Implement a complete decoder-only Transformer in ~300 lines of PyTorch (using skills from Chapter 0), training it on a small dataset
Compare encoder-only, decoder-only, and encoder-decoder architectures with concrete use cases, preparing you for Chapter 6's pretraining discussion
Explain efficient attention mechanisms (linear attention, sparse attention, FlashAttention) and their tradeoffs
Describe State Space Models (SSMs/Mamba), Mixture-of-Experts (MoE), RWKV, Gated Attention, and Multi-head Latent Attention
Understand GPU architecture (SMs, memory hierarchy, bandwidth) and write a basic Triton kernel
State the universal approximation and computational complexity results for Transformers, and explain how chain-of-thought reasoning (a concept revisited in Chapter 18: Interpretability) extends their power

Sections

What's Next?

In the next section, Section 4.1: Transformer Architecture Deep Dive, we take a deep dive into the complete Transformer architecture, examining how each component contributes to the whole.

Bibliography & Further Reading

Foundational Papers

Vaswani, A. et al. (2017). "Attention Is All You Need." NeurIPS 2017. arxiv.org/abs/1706.03762

The paper that introduced the Transformer architecture, replacing recurrence entirely with self-attention and positional encodings.

Devlin, J. et al. (2019). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding." NAACL 2019. arxiv.org/abs/1810.04805

Introduces the encoder-only Transformer for masked language modeling, setting new benchmarks across NLP tasks.

Radford, A. et al. (2019). "Language Models are Unsupervised Multitask Learners." OpenAI. openai.com (GPT-2 paper)

Demonstrates that decoder-only Transformers trained on large corpora can perform diverse tasks without task-specific fine-tuning.

Dao, T. et al. (2022). "FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness." NeurIPS 2022. arxiv.org/abs/2205.14135

Introduces an IO-aware attention algorithm that reduces memory usage from quadratic to linear while maintaining exact computation.

Gu, A. & Dao, T. (2023). "Mamba: Linear-Time Sequence Modeling with Selective State Spaces." arxiv.org/abs/2312.00752

Proposes a selective state space model as an alternative to Transformers, achieving linear-time complexity for sequence modeling.

Shazeer, N. et al. (2017). "Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer." ICLR 2017. arxiv.org/abs/1701.06538

Introduces the Mixture-of-Experts architecture with learned gating, enabling models to scale capacity without proportional compute increases.

Su, J. et al. (2021). "RoFormer: Enhanced Transformer with Rotary Position Embedding." arxiv.org/abs/2104.09864

Introduces Rotary Position Embeddings (RoPE), now the dominant positional encoding scheme in modern LLMs including LLaMA and Mistral.

Key Books

Jurafsky, D. & Martin, J. H. (2024). Speech and Language Processing (3rd ed. draft), Chapter 10: Transformers and Large Language Models. web.stanford.edu/~jurafsky/slp3

A clear textbook treatment of the Transformer architecture, covering self-attention, positional encoding, and language model heads.

Phuong, M. & Hutter, M. (2022). "Formal Algorithms for Transformers." arxiv.org/abs/2207.09238

A mathematically precise specification of Transformer algorithms, useful as a reference for implementers seeking unambiguous definitions.

Tools & Libraries

The Annotated Transformer (Harvard NLP). nlp.seas.harvard.edu/annotated-transformer

A line-by-line PyTorch implementation of the original Transformer paper with extensive commentary and visualizations.

Hugging Face Transformers Library. github.com/huggingface/transformers

The most widely used library for working with pretrained Transformer models, supporting hundreds of architectures and model weights.

OpenAI Triton. github.com/triton-lang/triton

A Python-based language for writing GPU kernels, used to implement custom attention and fused operations discussed in the GPU systems section.

Karpathy, A. "nanoGPT." github.com/karpathy/nanoGPT

A minimal, readable implementation of GPT training and inference in about 300 lines of PyTorch, ideal for learning the decoder-only Transformer.