Appendix A: Mathematical Foundations

Friendly geometric shapes including vectors, matrices, and gradient arrows collaborating on a chalkboard while a neural network peeks in from the corner

Big Picture

This appendix collects the mathematical background you will encounter throughout the textbook. It is not a substitute for a full course in linear algebra or probability; rather, it is a targeted reference that connects each concept directly to its role in large language models. Four domains are covered: linear algebra (vectors, matrices, eigendecomposition), probability and statistics (distributions, Bayes, expectation), calculus for ML (gradients, backpropagation, optimization), and information theory (entropy, KL divergence, cross-entropy loss).

These topics matter because every layer of a transformer is ultimately a sequence of matrix operations, probability distributions, and gradient-based updates. Understanding why attention scales by sqrt(d_k), why cross-entropy is the natural loss for language modeling, or why KL divergence appears in RLHF all requires fluency with the material here. Readers who treat these as black boxes will struggle to debug training runs or interpret research papers.

This appendix is most useful for readers who studied these subjects previously but need a refresher targeted to LLMs, and for practitioners who want to understand the "why" behind formulas they already use. If these topics feel entirely new, supplement with a linear algebra or probability textbook before proceeding.

The mathematical foundations here underpin everything in Chapter 4 (Transformer Architecture), which is the primary destination for applying this material. Optimization and gradient concepts connect directly to Chapter 0 (ML and PyTorch Foundations). Information-theoretic concepts such as cross-entropy and KL divergence reappear in Chapter 6 (Pretraining and Scaling Laws) and throughout evaluation in Chapter 29 (Evaluation).

Prerequisites

No specific chapter prerequisites are required for this appendix; it is designed to be read before the main text. Comfort with basic algebra and function notation is assumed. If you have taken any calculus or introductory statistics course, the material here will be review. If not, treat the examples as the primary learning vehicle and do not be discouraged by notation that looks unfamiliar at first.

When to Use This Appendix

Read this appendix at the start of the course if your mathematics background is rusty, or use it as a lookup reference when a main chapter uses notation or a concept that needs clarification. In particular, return here when Chapter 4's attention mechanism formulas feel opaque, when loss functions in Chapter 6 need grounding, or when PEFT methods in Chapter 15 involve matrix decompositions. If you are already comfortable with all four domains, a quick skim of Section A.5 (Connecting the Pieces) may be sufficient.

Sections