Mathematical Foundations

The essential linear algebra, probability, calculus, and information theory that power every transformer

Friendly geometric shapes including vectors, matrices, and gradient arrows collaborating on a chalkboard while a neural network peeks in from the corner
Figure A.0.1: Friendly geometric shapes including vectors, matrices, and gradient arrows collaborating on a chalkboard while a neural network peeks in from the corner.
Big Picture

Every transformer is, mathematically, a long composition of matrix products, probability distributions, and gradient updates. This appendix collects the four bodies of math that recur most often in the book: linear algebra (vectors, matrices, eigendecomposition), probability and statistics (distributions, Bayes, expectation), calculus for ML (gradients, backpropagation, optimization), and information theory (entropy, KL divergence, cross-entropy loss). It is not a substitute for a full course in any of them; it is a targeted reference that ties each concept directly to its role in large language models.

These topics matter because every layer of a transformer is ultimately a sequence of matrix operations, probability distributions, and gradient-based updates. Understanding why attention scales by sqrt(d_k), why cross-entropy is the natural loss for language modeling, or why KL divergence appears in RLHF all requires fluency with the material here. Readers who treat these as black boxes will struggle to debug training runs or interpret research papers.

This appendix is most useful for readers who studied these subjects previously but need a refresher targeted to LLMs, and for practitioners who want to understand the "why" behind formulas they already use. If these topics feel entirely new, supplement with a linear algebra or probability textbook before proceeding.

The mathematical foundations here underpin everything in Chapter 04 (Transformer Architecture), which is the primary destination for applying this material. Optimization and gradient concepts connect directly to Chapter 00 (ML and PyTorch Foundations). Information-theoretic concepts such as cross-entropy and KL divergence reappear in Chapter 06 (Pretraining and Scaling Laws) and throughout evaluation in Chapter 34 (Evaluation).

Note: Prerequisites

No specific chapter prerequisites are required for this appendix; it is designed to be read before the main text. Comfort with basic algebra and function notation is assumed. If you have taken any calculus or introductory statistics course, the material here will be review. If not, treat the examples as the primary learning vehicle and do not be discouraged by notation that looks unfamiliar at first.

Real-World Scenario: When to Use This Appendix

Read this appendix at the start of the course if your mathematics background is rusty, or use it as a lookup reference when a main chapter uses notation or a concept that needs clarification. In particular, return here when Chapter 04's attention mechanism formulas feel opaque, when loss functions in Chapter 06 need grounding, or when PEFT methods in Chapter 19 involve matrix decompositions. If you are already comfortable with all four domains, a quick skim of Section A.5 (Connecting the Pieces) may be sufficient.

Sections

A.6 Information Theory for Language Models Entropy, cross-entropy, perplexity, KL divergence. The four information-theoretic quantities that anchor language modeling.