Section A.5: Connecting the Pieces

Every concept in this appendix appears in the transformer architecture and its training. Here is how they fit together in a single forward pass and weight update:

Embedding lookup converts token IDs to vectors (linear algebra).
Self-attention computes dot products between query and key vectors, applies softmax to get attention weights (probability), and takes a weighted sum of value vectors (linear algebra).
Feed-forward layers apply matrix multiplications followed by activation functions (linear algebra, calculus).
Output projection and softmax produce a probability distribution over the vocabulary (probability).
Cross-entropy loss compares the predicted distribution to the true next token (information theory).
Backpropagation computes gradients of the loss with respect to every weight (calculus, chain rule).
Gradient descent updates the weights to reduce the loss (calculus, optimization).

Big Picture

You do not need to compute these operations by hand. PyTorch and similar frameworks handle the gradient calculations automatically. But understanding what happens under the hood gives you the ability to diagnose problems (why is my loss NaN?), interpret research papers, and make informed decisions about architecture and hyperparameter choices. The mathematics here is the shared vocabulary of the field.

Quick Reference Summary

Dot product ($a \cdot b$): Measures similarity between vectors; the core of attention.
Matrix multiplication ($Y = XW + b$): Every linear layer in a neural network.
Softmax: Converts logits to probabilities; ensures outputs sum to 1.
Gradient ($\nabla L$): Direction of steepest increase of the loss; we go the opposite way.
Chain rule: Enables backpropagation through deep networks.
Cross-entropy: The loss function for language modeling.
KL divergence: Measures distribution mismatch; used in distillation and alignment.
Perplexity = $\exp(cross-entropy)$: The standard evaluation metric for language models.

What Comes Next

Continue to Appendix B: Machine Learning Essentials for the next reference appendix in this collection.

References and Further Reading

Textbooks

Strang, G. (2016). Introduction to Linear Algebra, 5th Edition. Wellesley-Cambridge Press.

The standard reference for linear algebra with excellent geometric intuition.

Textbook

Goodfellow, I., Bengio, Y., and Courville, A. (2016). Deep Learning. MIT Press.

Chapters 2 through 4 provide a thorough treatment of linear algebra, probability, and numerical computation for deep learning.

Textbook

Online Resources

3Blue1Brown (2016). Essence of Linear Algebra. YouTube series.

Outstanding visual explanations of vectors, matrices, eigenvalues, and transformations.

Video Series

Shannon, C. E. (1948). "A Mathematical Theory of Communication." Bell System Technical Journal, 27(3), 379-423.

The founding paper of information theory. Still remarkably readable.

Foundational Paper