Information theory, pioneered by Claude Shannon in 1948, gives us a mathematical framework for measuring uncertainty and the "surprise" in a message. These concepts are deeply woven into how we train and evaluate language models.
Entropy
Entropy measures the average amount of surprise (or information) in a probability distribution:
A distribution where one outcome has probability 1.0 has zero entropy (no surprise). A uniform distribution over many outcomes has maximum entropy (maximum uncertainty). When a language model is confident about the next token, the entropy of its output distribution is low. When it is uncertain, entropy is high. Code Fragment A.4.1 below puts this into practice.
# implement entropy
# Key operations: results display, structured logging
import numpy as np
def entropy(probs):
"""Compute entropy of a probability distribution."""
# Filter out zero probabilities to avoid log(0)
probs = probs[probs > 0]
return -np.sum(probs * np.log2(probs))
# Confident distribution: low entropy
confident = np.array([0.9, 0.05, 0.03, 0.02])
print(f"Confident entropy: {entropy(confident):.3f} bits") # ~0.57
# Uncertain distribution: high entropy
uncertain = np.array([0.25, 0.25, 0.25, 0.25])
print(f"Uncertain entropy: {entropy(uncertain):.3f} bits") # 2.0
Cross-Entropy
Cross-entropy measures how well a predicted distribution $Q$ matches a true distribution $P$:
This is the standard loss function for training language models. The "true distribution" $P$ is the one-hot vector for the actual next token (probability 1 for the correct token, 0 for everything else). The predicted distribution $Q$ is the model's softmax output. Minimizing cross-entropy loss means making the model assign higher probability to the correct next token.
Perplexity, the standard metric for language models, is simply $2^{\text{cross-entropy}}$ (or $e^{\text{cross-entropy}}$ if using natural log). A perplexity of 20 means the model is, on average, as uncertain as if it were choosing uniformly among 20 tokens. Lower perplexity means better predictions. When papers report that a model achieves "perplexity 8.5 on WikiText-103," they are describing the exponential of the average cross-entropy loss on that dataset.
KL Divergence
The Kullback-Leibler divergence measures how one probability distribution differs from another:
KL divergence has a critical property: it is always non-negative, and equals zero only when $P = Q$. However, it is not symmetric: $D_{KL}(P || Q) \neq D_{KL}(Q || P)$.
In LLM work, KL divergence appears in several important places: Code Fragment A.4.2 below puts this into practice.
- Knowledge distillation (Chapter 15): The student model is trained to minimize the KL divergence between its output distribution and the teacher's output distribution.
- RLHF and DPO (Chapter 16): A KL penalty prevents the fine-tuned model from drifting too far from the base model, preserving general capabilities while aligning behavior.
- Variational methods: The evidence lower bound (ELBO) involves KL divergence between approximate and true posterior distributions.
# PyTorch implementation
# Key operations: results display, structured logging
import torch
import torch.nn.functional as F
# KL divergence between two distributions
p = torch.tensor([0.4, 0.3, 0.2, 0.1]) # "true" distribution
q = torch.tensor([0.25, 0.25, 0.25, 0.25]) # predicted distribution
# PyTorch's kl_div expects log-probabilities for the input
kl = F.kl_div(q.log(), p, reduction='sum')
print(f"KL(P || Q) = {kl.item():.4f}") # ~0.0849
Mutual Information
Mutual information measures how much knowing one variable tells you about another: $I(X; Y) = H(X) - H(X | Y)$. It is symmetric, unlike KL divergence. In NLP, mutual information can quantify how much a word in one position tells you about a word in another position, and it has been used to analyze what transformers learn about linguistic structure.
Claude Shannon estimated the entropy of English at about 1.0 to 1.5 bits per character by having people guess the next letter in a text. Modern language models, with their vast training data and billions of parameters, achieve cross-entropy rates that approach this fundamental limit. In a sense, LLMs are playing Shannon's guessing game at superhuman scale.