Section A.2: Probability and Statistics

Language models are, at their core, probability machines. A transformer predicts the next token by outputting a probability distribution over the entire vocabulary. Understanding probability is therefore not optional; it is the lens through which every LLM output must be interpreted.

Probability Distributions

A probability distribution assigns a probability to every possible outcome such that all probabilities sum to 1. For a language model with a vocabulary of 50,000 tokens, the output at each step is a distribution over 50,000 possibilities. Code Fragment A.2.1 below puts this into practice.


# PyTorch implementation
# Key operations: softmax normalization, structured logging
import torch
import torch.nn.functional as F

# Raw model outputs (logits) for a vocabulary of 5 tokens
logits = torch.tensor([2.0, 1.0, 0.5, -1.0, 0.1])

# Convert to probabilities using softmax
probs = F.softmax(logits, dim=-1)
# tensor([0.4466, 0.1642, 0.0996, 0.0222, 0.0667])
# All probabilities sum to 1.0

Code Fragment A.2.1: Converting raw logits to a probability distribution with PyTorch's softmax. The output sums to 1.0, with higher logits receiving proportionally larger probabilities.

The softmax function is the bridge between raw model scores (logits) and probabilities:

\operatorname{softmax}(z_i) = \exp(z_i) / \Sigma _j \exp(z_j)

Conditional Probability and Bayes' Theorem

Conditional probability is the probability of an event given that another event has occurred: $P(A | B) = P(A \cap B) / P(B)$. Language modeling is fundamentally about conditional probability: what is the probability of the next token given all previous tokens?

P(token_t | token_1, token_2, ..., token_{t-1})

Bayes' theorem lets us reverse the direction of conditioning:

P(A | B) = P(B | A) \cdot P(A) / P(B)

This appears in retrieval-augmented generation (RAG), where we want to find documents relevant to a query, and in classification tasks where we update beliefs about a label given observed features.

Common Distributions

Common Distributions Comparison

Distribution	Type	LLM Relevance
Categorical	Discrete	The output of softmax; used to sample the next token
Gaussian (Normal)	Continuous	Weight initialization, noise in diffusion models, VAE latent spaces
Uniform	Both	Random sampling baselines, certain initialization schemes
Bernoulli	Discrete	Dropout masks (each neuron kept with probability p)

Expected Value and Variance

The expected value (mean) of a distribution tells you the average outcome: $E[X] = \Sigma x_i \cdot P(x_i)$. The variance measures spread: $Var(X) = E[(X - E[X])^2]$. Layer normalization in transformers works by subtracting the mean and dividing by the standard deviation (the square root of variance), ensuring that activations stay in a well-behaved range throughout the network.

Practical Example: Temperature Sampling

When generating text, the temperature parameter reshapes the probability distribution. Given logits $z$, we compute $\operatorname{softmax}(z / T)$. A temperature of 1.0 is the default distribution. Temperatures below 1.0 make the distribution sharper (more confident), while temperatures above 1.0 flatten it (more random). At $T \rightarrow 0$, the model always picks the highest-probability token (greedy decoding). At $T \rightarrow \infty$, all tokens become equally likely.