Part I: Foundations
Chapter 04: The Transformer Architecture

The Transformer Architecture

"Attention is all you need. Well, that and a few hundred billion parameters, a small country's worth of electricity, and a team of researchers who haven't slept since 2017."

Attn Attn, Sleep-Deprived AI Agent
The Transformer Architecture chapter illustration
Figure 4.0.1: The Transformer, depicted as a layered machine of self-attention and feed-forward blocks, is the architectural blueprint behind every modern large language model.

Chapter Overview

This is the central module of the entire book. The Transformer, introduced in the landmark 2017 paper "Attention Is All You Need," is the architecture behind every modern large language model. Building on the attention mechanisms introduced in Chapter 3, this chapter will dissect the Transformer layer by layer, build one from scratch, survey the many variants that have emerged since (explored further in Chapter 7: Modern LLM Landscape), understand the GPU hardware it runs on, and explore the theoretical limits of what Transformers can and cannot compute.

By the end of this chapter you will be able to read a Transformer implementation, modify it confidently, reason about its computational cost (a foundation for the inference optimization techniques in Chapter 9), and understand why certain architectural choices (positional encoding, layer normalization, residual connections) are not arbitrary but deeply principled.

Prerequisites

Learning Objectives

Sections

What's Next?

In the next section, Section 4.1: Transformer Architecture Deep Dive, we take a deep dive into the complete Transformer architecture, examining how each component contributes to the whole.

Bibliography & Further Reading

Foundational Papers

Vaswani, A. et al. (2017). "Attention Is All You Need." NeurIPS 2017. arxiv.org/abs/1706.03762
The paper that introduced the Transformer architecture, replacing recurrence entirely with self-attention and positional encodings.
Devlin, J. et al. (2019). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding." NAACL 2019. arxiv.org/abs/1810.04805
Introduces the encoder-only Transformer for masked language modeling, setting new benchmarks across NLP tasks.
Radford, A. et al. (2019). "Language Models are Unsupervised Multitask Learners." OpenAI. openai.com (GPT-2 paper)
Demonstrates that decoder-only Transformers trained on large corpora can perform diverse tasks without task-specific fine-tuning.
Dao, T. et al. (2022). "FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness." NeurIPS 2022. arxiv.org/abs/2205.14135
Introduces an IO-aware attention algorithm that reduces memory usage from quadratic to linear while maintaining exact computation.
Gu, A. & Dao, T. (2023). "Mamba: Linear-Time Sequence Modeling with Selective State Spaces." arxiv.org/abs/2312.00752
Proposes a selective state space model as an alternative to Transformers, achieving linear-time complexity for sequence modeling.
Shazeer, N. et al. (2017). "Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer." ICLR 2017. arxiv.org/abs/1701.06538
Introduces the Mixture-of-Experts architecture with learned gating, enabling models to scale capacity without proportional compute increases.
Su, J. et al. (2021). "RoFormer: Enhanced Transformer with Rotary Position Embedding." arxiv.org/abs/2104.09864
Introduces Rotary Position Embeddings (RoPE), now the dominant positional encoding scheme in modern LLMs including LLaMA and Mistral.

Key Books

Jurafsky, D. & Martin, J. H. (2024). Speech and Language Processing (3rd ed. draft), Chapter 10: Transformers and Large Language Models. web.stanford.edu/~jurafsky/slp3
A clear textbook treatment of the Transformer architecture, covering self-attention, positional encoding, and language model heads.
Phuong, M. & Hutter, M. (2022). "Formal Algorithms for Transformers." arxiv.org/abs/2207.09238
A mathematically precise specification of Transformer algorithms, useful as a reference for implementers seeking unambiguous definitions.

Tools & Libraries

The Annotated Transformer (Harvard NLP). nlp.seas.harvard.edu/annotated-transformer
A line-by-line PyTorch implementation of the original Transformer paper with extensive commentary and visualizations.
Hugging Face Transformers Library. github.com/huggingface/transformers
The most widely used library for working with pretrained Transformer models, supporting hundreds of architectures and model weights.
A Python-based language for writing GPU kernels, used to implement custom attention and fused operations discussed in the GPU systems section.
Karpathy, A. "nanoGPT." github.com/karpathy/nanoGPT
A minimal, readable implementation of GPT training and inference in about 300 lines of PyTorch, ideal for learning the decoder-only Transformer.