Section A.4: Information Theory

Information theory, pioneered by Claude Shannon in 1948, gives us a mathematical framework for measuring uncertainty and the "surprise" in a message. These concepts are deeply woven into how we train and evaluate language models.

Entropy

Entropy measures the average amount of surprise (or information) in a probability distribution:

H(P) = - \Sigma P(x) \cdot \log P(x)

A distribution where one outcome has probability 1.0 has zero entropy (no surprise). A uniform distribution over many outcomes has maximum entropy (maximum uncertainty). When a language model is confident about the next token, the entropy of its output distribution is low. When it is uncertain, entropy is high. Code Fragment A.4.1 below puts this into practice.


# implement entropy
# Key operations: results display, structured logging
import numpy as np

def entropy(probs):
    """Compute entropy of a probability distribution."""
    # Filter out zero probabilities to avoid log(0)
    probs = probs[probs > 0]
    return -np.sum(probs * np.log2(probs))

# Confident distribution: low entropy
confident = np.array([0.9, 0.05, 0.03, 0.02])
print(f"Confident entropy: {entropy(confident):.3f} bits")  # ~0.57

# Uncertain distribution: high entropy
uncertain = np.array([0.25, 0.25, 0.25, 0.25])
print(f"Uncertain entropy: {entropy(uncertain):.3f} bits")  # 2.0

Confident entropy: 0.567 bits Uncertain entropy: 2.000 bits

Code Fragment A.4.1: Computing entropy for a confident distribution (0.57 bits) versus a uniform distribution (2.0 bits) using NumPy. The gap illustrates how entropy quantifies uncertainty: the more spread out the probabilities, the higher the entropy.

Cross-Entropy

Cross-entropy measures how well a predicted distribution $Q$ matches a true distribution $P$:

H(P, Q) = - \Sigma P(x) \cdot \log Q(x)

This is the standard loss function for training language models. The "true distribution" $P$ is the one-hot vector for the actual next token (probability 1 for the correct token, 0 for everything else). The predicted distribution $Q$ is the model's softmax output. Minimizing cross-entropy loss means making the model assign higher probability to the correct next token.

Key Insight: Cross-Entropy and Perplexity

Perplexity, the standard metric for language models, is simply $2^{\text{cross-entropy}}$ (or $e^{\text{cross-entropy}}$ if using natural log). A perplexity of 20 means the model is, on average, as uncertain as if it were choosing uniformly among 20 tokens. Lower perplexity means better predictions. When papers report that a model achieves "perplexity 8.5 on WikiText-103," they are describing the exponential of the average cross-entropy loss on that dataset.

KL Divergence

The Kullback-Leibler divergence measures how one probability distribution differs from another:

D_{KL}(P || Q) = \Sigma P(x) \cdot \log(P(x) / Q(x))

KL divergence has a critical property: it is always non-negative, and equals zero only when $P = Q$. However, it is not symmetric: $D_{KL}(P || Q) \neq D_{KL}(Q || P)$.

In LLM work, KL divergence appears in several important places: Code Fragment A.4.2 below puts this into practice.

Knowledge distillation (Chapter 15): The student model is trained to minimize the KL divergence between its output distribution and the teacher's output distribution.
RLHF and DPO (Chapter 16): A KL penalty prevents the fine-tuned model from drifting too far from the base model, preserving general capabilities while aligning behavior.
Variational methods: The evidence lower bound (ELBO) involves KL divergence between approximate and true posterior distributions.


# PyTorch implementation
# Key operations: results display, structured logging
import torch
import torch.nn.functional as F

# KL divergence between two distributions
p = torch.tensor([0.4, 0.3, 0.2, 0.1])  # "true" distribution
q = torch.tensor([0.25, 0.25, 0.25, 0.25])  # predicted distribution

# PyTorch's kl_div expects log-probabilities for the input
kl = F.kl_div(q.log(), p, reduction='sum')
print(f"KL(P || Q) = {kl.item():.4f}")  # ~0.0849

KL(P || Q) = 0.0849

Code Fragment A.4.2: Computing KL divergence between two distributions in PyTorch. The result (0.0849) quantifies how much the uniform distribution Q diverges from the non-uniform P.

Mutual Information

Mutual information measures how much knowing one variable tells you about another: $I(X; Y) = H(X) - H(X | Y)$. It is symmetric, unlike KL divergence. In NLP, mutual information can quantify how much a word in one position tells you about a word in another position, and it has been used to analyze what transformers learn about linguistic structure.

Fun Fact: Shannon's Guessing Game

Claude Shannon estimated the entropy of English at about 1.0 to 1.5 bits per character by having people guess the next letter in a text. Modern language models, with their vast training data and billions of parameters, achieve cross-entropy rates that approach this fundamental limit. In a sense, LLMs are playing Shannon's guessing game at superhuman scale.