Information Theory

Section A.4

Information theory, pioneered by Claude Shannon in 1948, gives us a mathematical framework for measuring uncertainty and the "surprise" in a message. These concepts are deeply woven into how we train and evaluate language models.

Entropy: confident versus uncertain distributions
Figure A.4.1: Shannon entropy as average surprise. The same vocabulary can yield very different entropies depending on how peaked the distribution is. The confident next-token prediction on the left carries about 0.57 bits of uncertainty; the uniform distribution on the right carries the maximum 2.00 bits for four choices, which translates to a perplexity of 4.

Self-Information: The Atom of Surprise

Before averaging surprise over a distribution, it pays to define surprise for a single event. For an event $E$ with probability $p(E)$, the self-information (also called the information content) is:

$$I(E) = - \log_2 p(E) \quad \text{bits}$$

The functional form encodes two intuitions at once. Almost-sure events ($p \to 1$) carry almost no information, because the observer already expected them. Rare events ($p \to 0$) carry large surprise, because they overturn the prior. A fair-coin outcome therefore costs exactly 1 bit, a fair-die roll costs $\log_2 6 \approx 2.585$ bits, and the certain event "the sun will rise tomorrow" costs essentially zero.

Key Insight: Why a Logarithm?

The logarithm is not arbitrary. It is forced on us by the requirement that the information from two independent experiments should add. Roll an $n$-sided die and an $m$-sided die independently. Counting outcomes separately, the total uncertainty is $\log n + \log m$ bits. Counting outcomes jointly across the $nm$ combined possibilities, the uncertainty is $\log(nm)$. The identity $\log(nm) = \log n + \log m$ is precisely what makes these two views consistent, and the logarithm is the unique well-behaved function with this property. Any other choice would force the receiver to track joint vs marginal experiments separately, which would be a mess for compound events like a long sequence of tokens.

Entropy

Entropy is the expected self-information: the average amount of surprise (or information) the receiver picks up per draw from a probability distribution:

$$H(P) = - \sum P(x) \cdot \log P(x) = \mathbb{E}_{x \sim P} [I(x)]$$

A distribution where one outcome has probability 1.0 has zero entropy (no surprise). A uniform distribution over many outcomes has maximum entropy (maximum uncertainty). When a language model is confident about the next token, the entropy of its output distribution is low. When it is uncertain, entropy is high. Code Fragment A.4.1 below puts this into practice.

# implement entropy
import numpy as np

def entropy(probs):
    """Compute entropy of a probability distribution."""
    # Filter out zero probabilities to avoid log(0)
    probs = probs[probs > 0]
    return -np.sum(probs * np.log2(probs))

# Confident distribution: low entropy
confident = np.array([0.9, 0.05, 0.03, 0.02])
print(f"Confident entropy: {entropy(confident):.3f} bits")  # ~0.57

# Uncertain distribution: high entropy
uncertain = np.array([0.25, 0.25, 0.25, 0.25])
print(f"Uncertain entropy: {entropy(uncertain):.3f} bits")  # 2.0
Output: Confident entropy: 0.567 bits Uncertain entropy: 2.000 bits
Code Fragment A.4.1: Computing entropy for a confident distribution (0.57 bits) versus a uniform distribution (2.0 bits) using NumPy. The gap illustrates how entropy quantifies uncertainty: the more spread out the probabilities, the higher the entropy.
Practical Example: The Binary Entropy Function

For a Bernoulli random variable with $P(\text{heads}) = p$ and $P(\text{tails}) = 1 - p$, entropy collapses to a one-parameter function:

$$H_{\text{bs}}(p) = - p \log_2 p - (1 - p) \log_2 (1 - p)$$

The curve $H_{\text{bs}}(p)$ is the familiar symmetric arch: zero at $p = 0$ and $p = 1$ (the outcome is deterministic, no surprise) and peaking at exactly 1 bit when $p = 0.5$ (the fair coin, maximum surprise). The same shape governs any binary decision a language model makes, including the per-token "is this the correct token or not" view used implicitly in cross-entropy loss. The fact that uncertainty peaks at the fair coin is why high-temperature sampling pushes a transformer toward maximum-entropy behaviour, and why entropy-based decoding metrics flag a model as "confused" exactly when its top token sits near 50% probability.

Cross-Entropy

layer normalization measures how well a predicted distribution $Q$ matches a true distribution $P$:

$$H(P, Q) = - \sum P(x) \cdot \log Q(x)$$

This is the standard loss function for training language models. The "true distribution" $P$ is the one-hot vector for the actual next token (probability 1 for the correct token, 0 for everything else). The predicted distribution $Q$ is the model's softmax output. Minimizing cross-entropy loss means making the model assign higher probability to the correct next token.

Key Insight: Cross-Entropy and Perplexity

Perplexity, the standard metric for language models, is simply $2^{\text{cross-entropy}}$ (or $e^{\text{cross-entropy}}$ if using natural log). A perplexity of 20 means the model is, on average, as uncertain as if it were choosing uniformly among 20 tokens. Lower perplexity means better predictions. When papers report that a model achieves "perplexity 8.5 on WikiText-103," they are describing the exponential of the average cross-entropy loss on that dataset.

KL Divergence

The Kullback-Leibler divergence measures how one probability distribution differs from another:

$$D_{\text{KL}}(P || Q) = \sum P(x) \cdot \log(P(x) / Q(x))$$

KL divergence has a critical property: it is always non-negative, and equals zero only when $P = Q$. However, it is not symmetric: $D_{KL}(P || Q) \neq D_{KL}(Q || P)$.

Forward KL is mode-covering; reverse KL is mode-seeking
Figure A.4.2: Forward vs. reverse KL. Minimising forward KL $D_{\text{KL}}(P \,||\, Q)$ punishes Q heavily wherever P has mass but Q does not, so Q is "mode-covering": it stretches to put nonzero probability under every mode of P. Minimising reverse KL $D_{\text{KL}}(Q \,||\, P)$ punishes Q only where Q itself puts mass, so Q is "mode-seeking": it can safely zero out modes of P and collapse onto the easiest one to fit. The forward KL is what maximum-likelihood training (and standard knowledge distillation) optimises; the reverse KL is what RLHF's KL penalty in Chapter 18 uses to keep the policy close to the reference, and what variational inference minimises.
Practical Example: KL Divergence as the RLHF Trust Region

Consider a fine-tuned policy $\pi_{\theta}$ that has drifted toward producing answers about cats no matter what the user asks. After 1000 RL steps, the model assigns probability 0.7 to "Cats are great." for every prompt; the reference $\pi_{\text{ref}}$ (the SFT model) assigned this string probability 0.0001 on a random prompt. The per-token KL contribution is roughly $0.7 \cdot \log(0.7/0.0001) \approx 6.2$ nats, dwarfing the reward signal of about 0.5. RLHF uses this large $D_{\text{KL}}$ as a penalty: the effective loss is $-\mathbb{E}[r(x, y)] + \beta \, D_{\text{KL}}(\pi_{\theta} \,||\, \pi_{\text{ref}})$ with $\beta \in [0.01, 0.1]$. The penalty makes the gradient pull $\pi_{\theta}$ back toward $\pi_{\text{ref}}$ until the reward genuinely outweighs the divergence. This is why a KL spike in your training logs is the leading indicator that your reward model has been hacked, see Section 18.6 on reward hacking.

In LLM work, KL divergence appears in several important places: Code Fragment A.4.2 below puts this into practice.

# PyTorch implementation
import torch
import torch.nn.functional as F

# KL divergence between two distributions
p = torch.tensor([0.4, 0.3, 0.2, 0.1])  # "true" distribution
q = torch.tensor([0.25, 0.25, 0.25, 0.25])  # predicted distribution

# PyTorch's kl_div expects log-probabilities for the input
kl = F.kl_div(q.log(), p, reduction='sum')
print(f"KL(P || Q) = {kl.item():.4f}")  # ~0.0849
Output: KL(P || Q) = 0.0849
Code Fragment A.4.2: Computing KL divergence between two distributions in PyTorch. The result (0.0849) quantifies how much the uniform distribution Q diverges from the non-uniform P.

Conditional Entropy (Equivocation)

Conditional entropy $H(X \mid Y)$ measures the average remaining uncertainty about $X$ after the receiver has observed $Y$. For each specific value $y$, the residual surprise is the entropy of the conditional distribution $P(X \mid Y = y)$:

$$H(X \mid Y = y) = - \sum_x P(x \mid y) \log P(x \mid y).$$

Averaging over all possible observations $y$ yields the global conditional entropy:

$$H(X \mid Y) = \sum_y P(y) \, H(X \mid Y = y) = - \sum_{x, y} P(x, y) \log P(x \mid y).$$

Information theorists call this quantity the equivocation: it is the average information about $X$ still missing once the corresponding $Y$ has arrived. A cleaner communication channel, or in our setting a more informative context window, drives the equivocation toward zero. A noisy channel, or a poorly trained transformer that barely uses its context, leaves equivocation close to the full entropy $H(X)$. The same quantity reappears as the numerator of every "How much does context $Y$ reduce the loss on token $X$?" probing experiment in Chapter 10.

Mutual Information

Mutual information measures how much knowing one variable tells you about another:

$$I(X; Y) = H(X) - H(X \mid Y) = H(Y) - H(Y \mid X) = H(X) + H(Y) - H(X, Y).$$

The first form reads as "starting uncertainty minus residual uncertainty after seeing $Y$", which is precisely the information that crossed from $Y$ to $X$. The equality of the three forms makes the symmetry explicit: $I(X; Y) = I(Y; X)$, unlike KL divergence. In NLP, mutual information can quantify how much a word in one position tells you about a word in another position, and it has been used to analyze what transformers learn about linguistic structure.

Fun Fact: Shannon's Guessing Game

Claude Shannon estimated the entropy of English at about 1.0 to 1.5 bits per character by having people guess the next letter in a text. Modern language models, with their vast training data and billions of parameters, achieve cross-entropy rates that approach this fundamental limit. In a sense, LLMs are playing Shannon's guessing game at superhuman scale.

Further Reading

Foundational Texts

Shannon, C. E. (1948). "A Mathematical Theory of Communication." Bell System Technical Journal 27. Shannon 1948 (PDF). The founding paper of information theory; entropy, mutual information, and channel capacity all originate here.
Cover, T. M., & Thomas, J. A. (2006). Elements of Information Theory (2nd ed.). Wiley. The standard graduate textbook; chapters on KL divergence and entropy underpin every language-model objective.
MacKay, D. J. C. (2003). Information Theory, Inference, and Learning Algorithms. Cambridge University Press. inference.org.uk/itila. Free textbook that connects information theory directly to Bayesian inference and to neural networks.

Modern Applications

Tishby, N., & Zaslavsky, N. (2015). "Deep Learning and the Information Bottleneck Principle." IEEE Information Theory Workshop. arXiv:1503.02406. Information bottleneck framing of representation learning; influential perspective on what deep networks compress.