Section A.2: Probability and Statistics

Language models are, at their core, probability machines. A transformer predicts the next token by outputting a probability distribution over the entire vocabulary. Understanding probability is therefore not optional; it is the lens through which every LLM output must be interpreted.

Logits to probabilities via softmax — **Figure A.2.1**: Softmax turns raw logits (which may be negative or unbounded) into a proper probability distribution over the vocabulary. Notice how the largest logit (2.0 for "the") receives 0.45 of the probability mass, while the negative logit ("of" at −1.0) is squashed near zero. The output is what the LLM samples from at every generation step.

Probability Distributions

A probability distribution assigns a probability to every possible outcome such that all probabilities sum to 1. For a language model with a vocabulary of 50,000 tokens, the output at each step is a distribution over 50,000 possibilities. Code Fragment A.2.1 below puts this into practice.

# PyTorch implementation
import torch
import torch.nn.functional as F

# Raw model outputs (logits) for a vocabulary of 5 tokens
logits = torch.tensor([2.0, 1.0, 0.5, -1.0, 0.1])

# Convert to probabilities using softmax
probs = F.softmax(logits, dim=-1)
# tensor([0.4466, 0.1642, 0.0996, 0.0222, 0.0667])
# All probabilities sum to 1.0

Code Fragment A.2.1: Converting raw logits to a probability distribution with PyTorch's softmax. The output sums to 1.0, with higher logits receiving proportionally larger probabilities.

The layer normalization function is the bridge between raw model scores (logits) and probabilities:

\operatorname{softmax}(z_i) = \exp(z_i) / \sum _j \exp(z_j)

Conditional Probability and Bayes' Theorem

Conditional probability is the probability of an event given that another event has occurred: $P(A | B) = P(A \cap B) / P(B)$. Language modeling is fundamentally about conditional probability: what is the probability of the next token given all previous tokens?

P(\text{token}_t | \text{token}_1, \text{token}_2, ..., \text{token}_{\text{t-1}})

Bayes' theorem lets us reverse the direction of conditioning:

P(A | B) = P(B | A) \cdot P(A) / P(B)

This appears in retrieval-augmented generation (RAG), where we want to find documents relevant to a query, and in classification tasks where we update beliefs about a label given observed features.

Common Distributions

Table A.2.1: Common Distributions Comparison (as of 2026).

Distribution	Type	LLM Relevance
Categorical	Discrete	The output of softmax; used to sample the next token
Gaussian (Normal)	Continuous	Weight initialization, noise in diffusion models, VAE latent spaces
Uniform	Both	Random sampling baselines, certain initialization schemes
Bernoulli	Discrete	Dropout masks (each neuron kept with probability p)

The Multivariate Gaussian

The single-variable normal distribution generalises in a straightforward way to a vector-valued random variable. The standard construction starts from a bag of independent standard normals $W_1, \dots, W_m \sim \mathcal{N}(0, 1)$, picks a mean vector $\mu \in \mathbb{R}^n$ and a matrix $A \in \mathbb{R}^{n \times m}$, and assembles the affine combination:

X = AW + \mu, \qquad X \sim \mathcal{N}(\mu, \Sigma), \quad \Sigma = AA^T \in \mathbb{R}^{n \times n}.

The covariance matrix $\Sigma = AA^T$ is automatically symmetric and positive semi-definite by construction, which is exactly the regularity needed for it to act as a covariance. When $\Sigma$ is invertible, the density of $X$ takes the canonical closed form:

\mathcal{N}(x; \mu, \Sigma) = (2\pi)^{-n/2} \, |\Sigma|^{-1/2} \exp\!\left(-\tfrac{1}{2} (x - \mu)^T \Sigma^{-1} (x - \mu)\right).

This is the density behind nearly every "Gaussian" appearing later in the book: the prior over the latent code in a VAE, the noise injected at each step of a diffusion model ($x_t = \sqrt{\bar{\alpha}_t}\, x_0 + \sqrt{1 - \bar{\alpha}_t}\, \epsilon$ with $\epsilon \sim \mathcal{N}(0, I)$), the analytic prior on transformer weights used in initialisation arguments, and the kernel-form posterior of a Gaussian-process retriever.

Joint Gaussian Conditioning

The single most useful property of multivariate Gaussians is that conditioning preserves Gaussianity in closed form. Stack two jointly Gaussian vectors $X \in \mathbb{R}^n$ and $Y \in \mathbb{R}^m$ into a block-partitioned distribution:

\begin{bmatrix} X \\ Y \end{bmatrix} \sim \mathcal{N}\!\left( \begin{bmatrix} \mu_X \\ \mu_Y \end{bmatrix}, \begin{bmatrix} \Sigma_X & \Sigma_{XY} \\ \Sigma_{YX} & \Sigma_Y \end{bmatrix} \right).

Then the conditional distribution of $X$ given an observation $Y = y$ is again multivariate Gaussian, with a mean shift linear in the observation and a reduced covariance given by the Schur complement:

X \mid Y = y \;\sim\; \mathcal{N}\!\left( \, \underbrace{\mu_X + \Sigma_{XY} \Sigma_Y^{-1} (y - \mu_Y)}_{\text{posterior mean}}, \;\; \underbrace{\Sigma_X - \Sigma_{XY} \Sigma_Y^{-1} \Sigma_{YX}}_{\text{posterior covariance}} \, \right).

Two things deserve emphasis. First, the posterior mean is a linear regression of $X$ on $Y$ with coefficient matrix $\Sigma_{XY} \Sigma_Y^{-1}$, which is exactly the linear minimum mean-square-error (MMSE) estimator. Second, the posterior covariance is strictly smaller (in the Loewner sense) than the prior covariance whenever $\Sigma_{XY} \neq 0$, so observing a correlated variable always tightens the belief about $X$ by the Schur-complement amount, independent of the actual value of $y$.

Expected Value and Variance

The expected value (mean) of a distribution tells you the average outcome: $E[X] = \sum x_i \cdot P(x_i)$. The variance measures spread: $Var(X) = E[(X - E[X])^2]$. Section 4.1 in transformers works by subtracting the mean and dividing by the standard deviation (the square root of variance), ensuring that activations stay in a well-behaved range throughout the network.

Practical Example: Temperature Sampling

When generating text, the temperature parameter reshapes the probability distribution. Given logits $z$, we compute $\operatorname{softmax}(z / T)$. A temperature of 1.0 is the default distribution. Temperatures below 1.0 make the distribution sharper (more confident), while temperatures above 1.0 flatten it (more random). At $T \rightarrow 0$, the model always picks the highest-probability token (greedy decoding). At $T \rightarrow \infty$, all tokens become equally likely.

Further Reading

Foundational Textbooks

Wasserman, L. (2004). All of Statistics: A Concise Course in Statistical Inference. Springer. A graduate-level survey of probability and statistics that is small enough to actually finish; matches the level of treatment used in this appendix.

Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer. Chapter 1 and 2 cover the probability theory used in modern ML; the canonical reference for Bayesian formulations of classification and regression.

Murphy, K. P. (2022). Probabilistic Machine Learning: An Introduction. MIT Press. probml.github.io/pml-book. Up-to-date free textbook; probability chapters connect directly to language-model likelihoods and information-theoretic objectives.

Modern Treatments for ML

Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning (2nd ed.). Springer. hastie.su.domains/ElemStatLearn. The standard graduate-level statistics-for-ML reference; covers the bias-variance tradeoff in detail.

Mohri, M., Rostamizadeh, A., & Talwalkar, A. (2018). Foundations of Machine Learning (2nd ed.). MIT Press. The reference for PAC learning bounds and concentration inequalities used in modern scaling-law derivations.