Section A.1: Linear Algebra Essentials

Linear algebra provides the data structures and operations at the heart of every neural network. Tokens become vectors, layers become matrices, and the entire forward pass of a transformer is a sequence of linear transformations interleaved with nonlinearities.

**Figure A.1.1**: The dot product as projection. When two vectors point in similar directions, the cosine of the angle between them is close to 1 and their dot product is large. This is exactly how attention decides "how much should query Q listen to key K" inside every transformer layer.

Vectors and Vector Spaces

A vector is an ordered list of numbers. In LLM contexts, a word embedding is a vector of, say, 768 dimensions. Each dimension captures some learned feature of the word's meaning, though individual dimensions are rarely interpretable on their own. Code Fragment A.1.1 below puts this into practice.

# NumPy computation
import numpy as np

# A simple 4-dimensional embedding vector
word_vector = np.array([0.23, -0.87, 1.42, 0.05])

# Vector addition: combining two representations
context_vector = np.array([0.10, 0.30, -0.50, 0.80])
combined = word_vector + context_vector  # element-wise addition

Code Fragment A.1.1: A vector is an ordered list of numbers.

import numpy as np
# Matrix multiplication in NumPy
X = np.random.randn(4, 768) # batch of 4 tokens, each 768-dim
W = np.random.randn(768, 3072) # weight matrix
b = np.random.randn(3072) # bias vector
Y = X @ W + b # shape: (4, 3072)
# The @ operator is Python's matrix multiplication

Code Fragment A.1.2: Creating and combining embedding vectors with NumPy. Element-wise addition merges two representations into a single vector, a pattern used throughout attention and residual connections.

Key properties of vectors you will encounter repeatedly:

Magnitude (norm): The length of a vector. The L2 norm is $||v|| = sqrt(v_1^2 + v_2^2 + ... + v_n^2)$. Normalizing vectors to unit length is common before computing cosine similarity.
Dot product: $a \cdot b = a_1*b_1 + a_2*b_2 + ... + a_n*b_n$. This operation is the backbone of attention scores in transformers. Two vectors with a high dot product point in similar directions.
Cosine similarity: $cos( \theta ) = (a \cdot b) / (||a|| * ||b||)$. By dividing out the magnitudes, cosine similarity measures direction alignment regardless of vector length. This is how embedding search systems compare query vectors to document vectors.

Key Insight: Why Dot Products Matter in Attention

The attention mechanism in transformers computes $score(Q, K) = Q \cdot K^T / sqrt(d_k)$. This is just a scaled dot product. When a query vector and a key vector point in similar directions, their dot product is large, which means the model "pays more attention" to that key's corresponding value. The entire magic of attention reduces to geometry.

Matrices and Matrix Multiplication

A matrix is a 2D grid of numbers. In neural networks, weight matrices transform one vector space into another. If your input is a 768-dimensional vector and you multiply it by a 768×3072 matrix, you get a 3072-dimensional vector. This is exactly what happens in the feed-forward layers of a transformer.

Y = X \cdot W + b

This formula describes a linear layer: input $X$ (a batch of vectors) is multiplied by weight matrix $W$, then bias vector $b$ is added. Nearly every layer in a neural network begins with this operation. Code Fragment A.1.2 below puts this into practice.

# Linear layer in NumPy: implements Y = X @ W + b
# Projects a batch of 4 token vectors from 768-dim to 3072-dim output
# This is the core computation inside every transformer feed-forward block
import numpy as np

np.random.seed(42)
X = np.random.randn(4, 768)    # batch of 4 token vectors, each 768-dim
W = np.random.randn(768, 3072) # weight matrix expanding to 3072 dimensions
b = np.random.randn(3072)      # bias term added to every output row

Y = X @ W + b  # matrix multiply then broadcast-add bias
print(f"Input  X: {X.shape}")   # (4, 768)
print(f"Weight W: {W.shape}")   # (768, 3072)
print(f"Bias   b: {b.shape}")   # (3072,)
print(f"Output Y: {Y.shape}")   # (4, 3072)

Output: Input X: (4, 768) Weight W: (768, 3072) Bias b: (3072,) Output Y: (4, 3072)

Code Fragment A.1.3: A linear layer in NumPy: multiplying a batch of 4 token vectors (768-dim) by a weight matrix to produce 3072-dim outputs, then adding a bias. This is the core operation inside every feed-forward block.

Transpose and Symmetry

The transpose of a matrix flips rows and columns: if $A$ has shape (m, n), then $A^T$ has shape (n, m). In attention, we compute $Q \cdot K^T$ because Q has shape (seq_len, d_k) and K has the same shape; transposing K to (d_k, seq_len) makes the multiplication produce a (seq_len, seq_len) attention matrix.

Eigenvalues and Eigenvectors

An eigenvector of a matrix $A$ is a vector $v$ such that multiplying by the matrix merely scales it:

A \cdot v = \lambda \cdot v

The scalar $\lambda$ is the eigenvalue. While you will not compute eigenvalues during routine LLM work, they appear in several important contexts: understanding the stability of training (exploding or vanishing gradients), principal component analysis for visualizing embedding spaces, and analyzing the rank of weight matrices (relevant for LoRA and low-rank approximations in Chapter 18).

Note: Practical Connection: Low-Rank Factorization

LoRA (Low-Rank Adaptation) freezes the original weight matrix and adds a small trainable update: $W' = W + B \cdot A$, where $B$ is (d, r) and $A$ is (r, d) with rank $r << d$. This works because weight updates during fine-tuning tend to live in a low-dimensional subspace, a fact rooted in the eigenstructure of the update matrices.

Further Reading

Foundational Textbooks

Strang, G. (2016). Introduction to Linear Algebra (5th ed.). Wellesley-Cambridge Press. The standard undergraduate reference; chapters on eigenvalues, SVD, and projection underpin every embedding and attention computation in this book.

Trefethen, L. N., & Bau, D. (1997). Numerical Linear Algebra. SIAM. people.maths.ox.ac.uk/trefethen/text.html. The reference for numerically stable algorithms; relevant to mixed-precision training and inference quantization.

Golub, G. H., & Van Loan, C. F. (2013). Matrix Computations (4th ed.). Johns Hopkins University Press. Encyclopedic reference for matrix algorithms; the source for FlashAttention-style tiling analyses.

Modern Treatments for ML

Deisenroth, M. P., Faisal, A. A., & Ong, C. S. (2020). Mathematics for Machine Learning. Cambridge University Press. mml-book.github.io. Free online textbook; covers the linear algebra and probability prerequisites for modern deep learning.

Petersen, K. B., & Pedersen, M. S. (2012). "The Matrix Cookbook." Matrix Cookbook (PDF). Concise reference of matrix derivative identities used throughout backpropagation derivations.

Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press. Chapter 2: Linear Algebra. deeplearningbook.org. Linear-algebra primer tuned specifically to deep learning notation and conventions.