Language models are, at their core, probability machines. A transformer predicts the next token by outputting a probability distribution over the entire vocabulary. Understanding probability is therefore not optional; it is the lens through which every LLM output must be interpreted.
Probability Distributions
A probability distribution assigns a probability to every possible outcome such that all probabilities sum to 1. For a language model with a vocabulary of 50,000 tokens, the output at each step is a distribution over 50,000 possibilities. Code Fragment A.2.1 below puts this into practice.
# PyTorch implementation
# Key operations: softmax normalization, structured logging
import torch
import torch.nn.functional as F
# Raw model outputs (logits) for a vocabulary of 5 tokens
logits = torch.tensor([2.0, 1.0, 0.5, -1.0, 0.1])
# Convert to probabilities using softmax
probs = F.softmax(logits, dim=-1)
# tensor([0.4466, 0.1642, 0.0996, 0.0222, 0.0667])
# All probabilities sum to 1.0
The softmax function is the bridge between raw model scores (logits) and probabilities:
Conditional Probability and Bayes' Theorem
Conditional probability is the probability of an event given that another event has occurred: $P(A | B) = P(A \cap B) / P(B)$. Language modeling is fundamentally about conditional probability: what is the probability of the next token given all previous tokens?
Bayes' theorem lets us reverse the direction of conditioning:
This appears in retrieval-augmented generation (RAG), where we want to find documents relevant to a query, and in classification tasks where we update beliefs about a label given observed features.
Common Distributions
| Distribution | Type | LLM Relevance |
|---|---|---|
| Categorical | Discrete | The output of softmax; used to sample the next token |
| Gaussian (Normal) | Continuous | Weight initialization, noise in diffusion models, VAE latent spaces |
| Uniform | Both | Random sampling baselines, certain initialization schemes |
| Bernoulli | Discrete | Dropout masks (each neuron kept with probability p) |
Expected Value and Variance
The expected value (mean) of a distribution tells you the average outcome: $E[X] = \Sigma x_i \cdot P(x_i)$. The variance measures spread: $Var(X) = E[(X - E[X])^2]$. Layer normalization in transformers works by subtracting the mean and dividing by the standard deviation (the square root of variance), ensuring that activations stay in a well-behaved range throughout the network.
When generating text, the temperature parameter reshapes the probability distribution. Given logits $z$, we compute $\operatorname{softmax}(z / T)$. A temperature of 1.0 is the default distribution. Temperatures below 1.0 make the distribution sharper (more confident), while temperatures above 1.0 flatten it (more random). At $T \rightarrow 0$, the model always picks the highest-probability token (greedy decoding). At $T \rightarrow \infty$, all tokens become equally likely.