Section 0.5: Reinforcement Learning Foundations

The reward was sparse, the episodes were long, and my discount factor suggested I should have studied supervised learning instead.
Tensor, Reward-Starved AI Agent

Big Picture

Why RL in an LLM book? Because the most powerful technique for making LLMs helpful, harmless, and honest, Reinforcement Learning from Human Feedback (RLHF), is built on reinforcement learning. Before we can understand how ChatGPT was fine-tuned to follow instructions, we need to understand the RL vocabulary and core algorithms that make it possible.

Prerequisites

This section assumes you understand cross-entropy and loss functions from Section 0.1: ML Basics and neural network training from Section 0.2: Deep Learning Essentials. Comfort with probability (expected values, distributions) is helpful. The RL concepts here directly prepare you for understanding RLHF later in the book.

0.5.1 The Reinforcement Learning Framework

A cartoon dog trying different tricks while a friendly trainer rewards the correct behavior with treats, illustrating the agent-environment-reward loop — **Figure 0.5.1**: Reinforcement learning is like training a dog. The dog (agent) tries different actions in the room (environment) and receives treats (rewards) only for the desired behavior, gradually learning which tricks earn the biggest reward.

Supervised learning needs labeled data: input X, correct output Y. But what if there is no single "correct" answer, only better and worse ones? What if the learner must try something, observe the consequence, and improve over time? That is reinforcement learning.

Think of training a dog. You say "sit." The dog tries something (maybe it lies down). You give a treat only when it actually sits. Over many repetitions, the dog learns which action earns the treat. This is exactly the RL loop: an agent (the dog) interacts with an environment (you and the room), takes actions (sitting, lying down), and receives rewards (treats or nothing).

Key Insight: LLM Connection

Replace the dog with an LLM. The LLM (agent) generates a response (a sequence of actions, one per token). A reward model (environment) scores the response. Through RL, the LLM learns which responses earn high scores, just as the dog learns which behaviors earn treats.

Real-World Scenario

RL Vocabulary Confusion Delays an RLHF Project

Who: NLP team (3 engineers) at a legal tech company attempting to fine-tune a 7B parameter LLM to generate better contract summaries using RLHF

Situation: The team had strong NLP backgrounds but no reinforcement learning experience. They followed an open-source RLHF tutorial to implement RLHF-based fine-tuning.

Problem: After two weeks of implementation, the reward signal was not improving. Log analysis revealed they had configured the "state" as only the user prompt (ignoring previously generated tokens) and set the discount factor to 0.0, treating each token decision as independent.

Dilemma: They debated whether to hire an RL specialist (expensive, 3-month search), switch to DPO (simpler but less flexible), or invest one week in learning RL fundamentals before retrying.

Decision: The tech lead mandated a one-week study sprint covering the exact material in this section: states, actions, rewards, the Bellman equation, and policy gradients.

How: After the sprint, they corrected the state definition to include the full prompt plus all generated tokens so far, set gamma to 0.99, and added a KL penalty to prevent the model from drifting too far from the base policy.

Result: Reward model scores improved by 34% over 3,000 training steps. Human evaluators preferred the RLHF-tuned summaries 71% of the time over the base model. The one-week investment saved months of trial-and-error debugging.

Lesson: RL vocabulary (state, action, reward, discount factor) is not optional for LLM practitioners. Misunderstanding these concepts leads to silent configuration errors that no amount of hyperparameter tuning can fix.

Let us define each piece of the RL framework precisely:

Table 0.5.1a: Concept Comparison (as of 2026).

Concept	Definition	LLM Analogy
Agent	The learner and decision-maker	The language model
Environment	Everything the agent interacts with	The reward model + user prompt
State (s)	A snapshot of the current situation	The prompt + all tokens generated so far
Action (a)	A choice the agent makes at each step	Choosing the next token from the vocabulary
Reward (r)	A scalar signal indicating how good the action was	The reward model's score for the full response
Episode	One complete interaction from start to finish	Generating one complete response to a prompt

Agent-environment interaction loop with actions, rewards, and state updates

Figure 0.5.2b: The agent-environment interaction loop. At each step the agent observes a state, takes an action, and receives a reward.

0.5.1.1 Why "Deep" RL? The State-Space Argument

Before neural networks, RL agents stored value estimates in a table: one entry per state-action pair. That works only when the state space is small enough to enumerate. Tic-Tac-Toe has 255,168 reachable states, easily tabulated. Atari frames are 84 by 84 pixels in 256 grayscale levels, giving $256^{7056}$ possible states, larger than the number of atoms in the observable universe. Tabular methods are not just slow here; they are impossible. Deep RL replaces the table with a neural network that generalizes across similar states, the same generalization trick that lets a convolutional network recognize an unseen cat photo. The DeepMind Atari paper (Mnih et al., 2013) was the first to show that this scaling works in practice, and the same insight underwrites RLHF on LLMs, where the "state" is a billion-dimensional token sequence no table could ever hold.

Note: DQN, the First Famous Deep RL Algorithm

DQN (Deep Q-Network) is the canonical example. A CNN reads raw Atari pixels and outputs $Q(s, a)$ for each joystick action. Two stabilization tricks made training work: a replay buffer that samples past transitions to break temporal correlations, and a target network (a slowly updated copy of the Q-network) that prevents the moving-target instability. DQN matched or beat humans on 49 Atari games and kicked off the modern deep-RL era. Modern LLM alignment uses policy-gradient methods (PPO, DPO) rather than value-based methods like DQN, but the same generalization-via-neural-net pattern carries over.

Library Shortcut: Gymnasium (formerly OpenAI Gym)

The community-standard playground for RL is Gymnasium (the maintained fork of OpenAI Gym). It provides a uniform reset() / step(action) API across CartPole, MountainCar, Atari, MuJoCo robotics, and dozens of other environments. Every RL paper and tutorial worth reading uses this API, including the custom SimpleGridWorld in the code fragments below.

Show code

# pip install gymnasium
import gymnasium as gym
env = gym.make("CartPole-v1")           # 4D state, 2 actions, +1 per surviving step
state, info = env.reset(seed=0)
for _ in range(200):
    action = env.action_space.sample()   # random policy
    state, reward, terminated, truncated, info = env.step(action)
    if terminated or truncated:
        break
env.close()

Code Fragment 0.5.0a: The Gymnasium API every RL practitioner sees in the wild. The same loop drives Atari, MountainCar, and the LLM-as-policy code in Chapter 18.

0.5.1.2 Multi-Armed Bandits: The Stateless Warm-Up

Before tackling the full RL problem, consider its simplest cousin: the multi-armed bandit (MAB). Imagine $n$ slot machines, each with an unknown reward distribution. The agent has a budget of $T$ pulls and wants to maximize total winnings. There is no "state" to speak of, only the choice of which arm to pull and the noisy reward that follows.

The agent's only tool is an action-value estimate $Q_k(a)$, the empirical mean reward observed for arm $a$ after $k$ pulls of it:

Q_k(a) = \frac{1}{k} \sum_{i=1}^{k} r_i

Storing every observation is wasteful; the incremental-mean update maintains the same estimate using only the previous mean and the new reward:

Q_{k+1}(a) = Q_k(a) + \frac{1}{k+1} \bigl(r_{k+1} - Q_k(a)\bigr)

This "new estimate equals old estimate plus learning-rate times prediction error" template recurs in every RL update rule that follows, including the TD error and the policy gradient. Recognize the shape now and the rest of the section will feel familiar.

Key Insight: Exploration vs. Exploitation

If the agent always pulls the arm with the highest current $Q$ estimate (pure exploitation), it may miss a better arm whose true mean is high but whose first few samples happened to be low. If it always pulls a random arm (pure exploration), it never cashes in on what it has learned. The explore-exploit trade-off is the central tension in RL, and every algorithm in this section is, at heart, a different answer to it.

Two simple exploration policies cover most cases:

$\epsilon$-greedy: with probability $1 - \epsilon$ pick the current best arm, otherwise pick a uniformly random arm. The lab at the end of this section uses this policy with $\epsilon = 0.1$. Simple, robust, and the default in introductory RL.
Softmax (Boltzmann) selection: sample arm $a$ with probability $\frac{\exp(Q(a) / \tau)}{\sum_{a'} \exp(Q(a') / \tau)}$. The temperature $\tau$ smoothly interpolates between greedy ($\tau \to 0$) and uniform random ($\tau \to \infty$). This is the same softmax formula an LLM uses to sample its next token, with $Q$-values replaced by logits, which is why temperature feels familiar to anyone who has tuned an OpenAI generation call.

0.5.1.3 Contextual Bandits: Adding State

A pure bandit ignores context. But many real problems present a feature vector before each decision: the user profile in an ad-recommendation system, the patient chart before a treatment choice, the prompt before an LLM response. A contextual bandit generalizes MAB so that the action-value depends on the context $x$: the agent learns $Q(x, a)$, typically as a neural network that maps the context to a vector of per-action expected rewards. Crucially, the action does not change the next context; each round is still independent. That single restriction is what separates contextual bandits from the full RL problem and from the Markov decision process introduced next.

0.5.1.4 The Markov Property and Transition Kernels

RL adds the missing piece: the action changes the future. The agent's choice now affects which state arrives next, which affects the rewards available later, which affects every subsequent decision. To make this tractable, RL imposes the Markov property:

Key Insight: The Markov Property in One Line

The future depends on the present, not the past. Formally: $P(s_{t+1} \mid s_t, a_t, s_{t-1}, a_{t-1}, \ldots) = P(s_{t+1} \mid s_t, a_t)$. The current state is a complete summary of history for the purpose of predicting what happens next.

The Markov property is a modeling choice, not a physical law. Driving a car is approximately Markov: the current position, velocity, and traffic snapshot tell you what will happen next; the entire route history adds little. Picking stocks is famously not Markov in the price alone: yesterday's price trajectory carries information the current price does not. The cure is usually to enrich the state until Markovness holds, for example by including velocity, momentum, or a window of recent prices. For LLMs, the state must include every previously generated token, because the next-token distribution depends on the full context.

The third core MDP object, after states and actions, is the transition kernel:

P(s' \mid s, a) = \text{probability of landing in } s' \text{ after taking action } a \text{ in state } s

In a deterministic environment, $P$ collapses to a single next-state. In a stochastic one (most real problems, including LLM token generation under sampling), it is a full distribution. The transition kernel together with the reward function $R(s, a)$, the discount $\gamma$, and the state and action spaces, defines a Markov Decision Process (MDP). Everything from this point on, including the Bellman equation, the value functions, and PPO, lives inside this MDP framework.

0.5.2 Policies: The Agent's Strategy

A policy is the agent's strategy: a rule that maps each state to an action. It answers the question, "Given what I see right now, what should I do?"

Policies come in two flavors. A deterministic policy always picks the same action for a given state (if the dog sees a hand signal, it always sits). A stochastic policy assigns probabilities to each possible action, then samples from that distribution.

Key Insight: LLMs Are Stochastic Policies

An LLM is a stochastic policy. Given a state (the prompt plus tokens generated so far), it outputs a probability distribution over the entire vocabulary. The next token is sampled from that distribution. This is why the same prompt can produce different completions: the policy is stochastic by nature.

Formally, a stochastic policy is written as π(a | s), the probability of choosing action a when in state s. For an LLM, this is exactly the Transformer architecture output: the probability the model assigns to each token given the context so far. The entire goal of RL training is to adjust the parameters of π so that high-reward actions become more probable.

0.5.3 Value Functions and the Bellman Equation

Rewards tell us how good a single step was. But we need a way to evaluate long-term prospects. Is this state a good place to be? Is this action a wise choice? Value functions answer these questions.

State-Value Function V(s)

What: V(s) estimates the total future reward the agent expects to accumulate starting from state s and following its current policy onward. Why it matters: It lets the agent judge whether its current situation is promising or dire. How it works: Think of it as a "mood meter." A high V(s) means "things are going well from here"; a low V(s) means "trouble ahead."

Action-Value Function Q(s, a)

What: Q(s, a) estimates the total future reward if the agent takes action a in state s and then follows its policy. Why it matters: It lets the agent compare actions and pick the best one. How it works: If V(s) is the mood meter, Q(s, a) is like asking, "If I take this specific action right now, will things go better or worse than average?"

The Bellman Equation (Intuition)

Tip: The Marshmallow Test for AI

The discount factor gamma is essentially the AI version of the famous "marshmallow test." A high gamma (close to 1) means the agent can delay gratification for a bigger payoff later. A low gamma (close to 0) means the agent grabs whatever reward is available right now. For RLHF, you almost always want a high gamma, because the quality of an LLM response depends on how the entire sequence fits together, not just the next token.

The Bellman equation expresses a simple but powerful recursive idea: the value of a state equals the immediate reward plus the (discounted) value of the next state. In plain English: "How good is it to be here? Well, how much reward do I get right now, plus how good is the place I end up?"

V(s) = E[ r + \gamma \cdot V(s') ]

This recursion is the engine behind nearly all RL algorithms. The discount factor γ (gamma, between 0 and 1) controls how much the agent values future rewards relative to immediate ones. When γ is close to 1, the agent is patient and plans ahead. When γ is close to 0, it is greedy and focuses on immediate rewards.

Warning: Common Misconception

The $E[\,\cdot\,]$ is doing real work here. Readers often skim past it and treat the Bellman equation as a deterministic recursion, but the expectation averages over three sources of randomness: the action the policy samples, the next state the environment transitions to, and any stochastic reward. Skipping the expectation is exactly the kind of state misdefinition that broke the legal-tech team in the case study above; it also leads to the wrong gradient when implementing actor-critic methods.

Note

In LLM training with RLHF, the "value" intuition still applies: we want the model to learn that certain token choices early in a response lead to higher overall reward at the end, even if individual tokens are not scored.

0.5.3.1 The Optimal Policy and Two Roads to It

The formal goal of RL is to find an optimal policy $\pi^*$, defined as a policy that maximizes the expected cumulative discounted reward from every state:

\pi^* = \arg\max_\pi \; \mathbb{E}_\pi\!\left[\sum_{t=0}^{\infty} \gamma^t r_t \right]

Two algorithmic families pursue $\pi^*$ from opposite directions:

Value-based methods learn $Q(s, a)$ first, then read off the policy as $\pi(s) = \arg\max_a Q(s, a)$. Q-learning, SARSA, and DQN live here. The policy is derived from the value function.
Policy-based methods learn $\pi(a \mid s)$ directly via gradient ascent on expected return, treating $V$ and $Q$ as either unnecessary or as auxiliary critics. REINFORCE, Actor-Critic, PPO, and DPO live here. The value function is at most a helper for variance reduction.

RLHF-style LLM alignment uses policy-based methods exclusively. The reason is practical: an LLM's action space (the full vocabulary, $\geq 30{,}000$ tokens) is far too large to ever compute $\arg\max_a Q(s, a)$ at every step, so directly parameterizing $\pi(a \mid s)$ as the model's softmax-over-logits is the only sane option. Knowing both families exist, however, is the right vocabulary for reading the broader RL literature.

0.5.4 Policy Gradients: Learning by Trial and Feedback

Fun Fact

Reinforcement learning famously taught a computer to play Atari games in 2013, but researchers often omit that the agent also discovered bizarre exploits no human would try, like pausing the game indefinitely to avoid losing. RL agents are less "intelligent learner" and more "chaotic rules lawyer."

Now we arrive at the core question: how does the agent improve its policy? The answer is the policy gradient theorem, and the intuition is remarkably straightforward.

Imagine the dog tries ten different behaviors. Three of them earned treats. The training strategy is simple: make those three behaviors more likely in the future. Behaviors that earned nothing (or a scolding) become less likely. That is the entire idea behind policy gradients.

More precisely: the agent samples actions from its policy, observes the rewards, and then adjusts its parameters (the neural network weights) to increase the probability of actions that led to high rewards and decrease the probability of actions that led to low rewards. The magnitude of each adjustment is proportional to the reward received.

# A 4x4 GridWorld environment: the agent navigates from (0,0) to the goal.
# Each step returns (next_state, reward, done) just like OpenAI Gym.
import numpy as np
class SimpleGridWorld:
    """A 4x4 grid where an agent must reach the goal.
    State: (row, col) position. Actions: 0=up, 1=right, 2=down, 3=left.
    Reward: +1 for reaching the goal, -0.01 per step (to encourage speed).
    LLM analogy: each 'step' is like generating one token, and the final
    reward scores the complete trajectory (the full response)."""
    def __init__(self, size=4):
        self.size = size
        self.goal = (size - 1, size - 1)
        self.reset()
    def reset(self):
        self.pos = (0, 0)
        return self.pos
    def step(self, action):
        r, c = self.pos
        if action == 0: r = max(r - 1, 0) # up
        elif action == 1: c = min(c + 1, self.size - 1) # right
        elif action == 2: r = min(r + 1, self.size - 1) # down
        elif action == 3: c = max(c - 1, 0) # left
        self.pos = (r, c)
        done = (self.pos == self.goal)
        reward = 1.0 if done else -0.01
        return self.pos, reward, done
# Run one episode with a random policy
env = SimpleGridWorld()
state = env.reset()
total_reward = 0
for step in range(100):
    action = np.random.randint(4) # random stochastic policy
    state, reward, done = env.step(action)
    total_reward += reward
    if done:
        break
    print(f"Episode finished in {step + 1} steps, total reward: {total_reward:.2f}")

Output: Episode finished in 27 steps, total reward: 0.74

Code Fragment 0.5.1b: More precisely: the agent samples actions from its policy, observes the rewards.

# REINFORCE algorithm: a policy network outputs action probabilities,
# and the loss weights each log-probability by its discounted return.
import torch
import torch.nn as nn
import torch.optim as optim
from torch.distributions import Categorical
class PolicyNetwork(nn.Module):
    """A small neural network that maps states to action probabilities.
    In an LLM, this role is played by the entire transformer: it maps
    the token sequence (state) to a distribution over the next token (action)."""
    def __init__(self, state_dim, n_actions, hidden=64):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(state_dim, hidden),
            nn.ReLU(),
            nn.Linear(hidden, n_actions),
            nn.Softmax(dim=-1)
            )
    def forward(self, state):
        return self.net(state)
        # Simplified REINFORCE training loop
        policy = PolicyNetwork(state_dim=2, n_actions=4)
        optimizer = optim.Adam(policy.parameters(), lr=1e-3)
    def reinforce_episode(env, policy):
        """Collect one episode and update the policy.
        Core idea: increase probability of actions that led to positive reward,
        decrease probability of actions that led to negative reward."""
        state = env.reset()
        log_probs = []
        rewards = []
        # 1. Roll out an episode using the current policy
        for _ in range(200):
            state_tensor = torch.FloatTensor(state)
            probs = policy(state_tensor)
            dist = Categorical(probs)
            action = dist.sample() # stochastic action selection
            log_probs.append(dist.log_prob(action))
            state, reward, done = env.step(action.item())
            rewards.append(reward)
            if done:
                break
                # 2. Compute discounted returns (what was the total reward from each step?)
                returns = []
                G = 0
                for r in reversed(rewards):
                    G = r + 0.99 * G
                    returns.insert(0, G)
                    returns = torch.FloatTensor(returns)
                    returns = (returns - returns.mean()) / (returns.std() + 1e-8)
                    # 3. Policy gradient: nudge probabilities toward high-reward actions
                    loss = -sum(lp * R for lp, R in zip(log_probs, returns))
                    optimizer.zero_grad()
                    loss.backward()
                    optimizer.step()
                    return sum(rewards)

Code Fragment 0.5.2: A minimal RL environment (Grid World). The SimpleGridWorld class implements a 4x4 grid where an agent navigates toward a goal using discrete actions. This environment mirrors how LLM token generation works: each step chooses one action from a set of possibilities, and the cumulative reward scores the complete trajectory.

Key Insight: Tokens as Actions

In the grid world above, the agent chooses one of four directions at each step. An LLM does something analogous but at a vastly larger scale: at each step, it chooses one token from a vocabulary of 30,000 to 100,000 possibilities. The policy gradient approach works the same way, adjusting the probability distribution over all those tokens.

# PPO sketch: collect trajectories, compute advantage estimates,
# and update the policy with a clipped surrogate objective.
import torch
import torch.nn as nn
import torch.optim as optim
from torch.distributions import Categorical
import random
# Simple environment: agent must learn to pick action 2 (out of 0,1,2,3)
# Reward = +1 if action == 2, else -0.1
class SimpleEnv:
    def reset(self):
        return [random.random(), random.random()] # random 2D state
    def step(self, action):
        reward = 1.0 if action == 2 else -0.1
        done = True # single-step episodes for clarity
        return self.reset(), reward, done
    policy = nn.Sequential(nn.Linear(2, 32), nn.ReLU(), nn.Linear(32, 4), nn.Softmax(dim=-1))
    optimizer = optim.Adam(policy.parameters(), lr=3e-3)
    env = SimpleEnv()
    for episode in range(1, 1001):
        state = torch.FloatTensor(env.reset())
        probs = policy(state)
        dist = Categorical(probs)
        action = dist.sample()
        log_prob = dist.log_prob(action)
        _, reward, _ = env.step(action.item())
        loss = -log_prob * reward
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        if episode % 200 == 0:
            # Evaluate: check how often the policy picks action 2
            correct = sum(1 for _ in range(100)
                if policy(torch.FloatTensor(env.reset())).argmax().item() == 2)
            print(f"Episode {episode:4d} | Last reward: {reward:+.1f} | Action 2 rate: {correct}%")

Code Fragment 0.5.3: REINFORCE policy gradient sketch in PyTorch. The PolicyNetwork maps states to action probabilities, mirroring how an LLM transformer maps token sequences to next-token distributions. The reinforce_episode function collects a trajectory, computes discounted returns, and nudges the policy toward actions that led to higher rewards.

Note: Modify and Observe

Change the target action from 2 to 0. Does the policy learn equally fast?
Reduce the learning rate to 1e-4. How many episodes does it take to reach 90%?
Change the reward for wrong actions from -0.1 to 0.0. What happens and why?

Warning

Vanilla REINFORCE (shown above) works in theory but suffers from high variance: training is noisy and unstable. In practice, researchers use more sophisticated algorithms. The most important one for LLM training is PPO, which we cover next. Before that, the next subsection introduces the missing bridge: Actor-Critic methods and the advantage baseline that PPO inherits.

REINFORCE gradient tape: a rollout produces a chain of states and actions; the policy emits log-probabilities at each step; discounted returns weight every log-probability; the negative sum is the loss whose backward pass updates the policy parameters. — **Figure 0.5.2c**: The REINFORCE gradient tape. Every action sampled during the rollout records its log-probability on the autograd tape. After the episode ends, each log-probability is multiplied by its discounted return, the products are summed and negated to form the loss, and a single `loss.backward()` sends gradients back through the whole policy network in one pass.

Stripped of bookkeeping, the per-step REINFORCE update is just one line:

import torch
from torch.distributions import Categorical

# Minimal REINFORCE step (one sampled action, advantage = discounted return)
logits = policy(state)               # policy is any nn.Module that emits action logits
dist = Categorical(logits=logits)
action = dist.sample()
log_pi = dist.log_prob(action)       # stays on the autograd tape

# After the episode, G_t is the discounted return from time t.
loss = -(log_pi * advantage).mean()   # advantage = G_t (or G_t - baseline)
optimizer.zero_grad(); loss.backward(); optimizer.step()

Code Fragment 0.5.4a: The minimum-viable REINFORCE update. The product log_pi * advantage is the policy-gradient estimator; setting advantage = G_t recovers vanilla REINFORCE, and replacing it with $G_t - V_\phi(s_t)$ from a learned critic is exactly the actor-critic upgrade introduced in Section 0.5.4b.

Note: Why Log-Probabilities, Not Raw Probabilities

Look at the REINFORCE loss: it sums $\log \pi(a_t \mid s_t) \cdot G_t$ rather than $\pi(a_t \mid s_t) \cdot G_t$ directly. Three reasons. First, log space is numerically stable: probabilities of long token sequences shrink to underflow ($10^{-400}$ is common in language modeling), while their logs stay in a friendly $[-1000, 0]$ range. Second, products become sums, so the loss for a trajectory is a clean sum of log-probs instead of a product that PyTorch would have to differentiate through. Third, $\nabla \log \pi$ is exactly the policy-gradient quantity the REINFORCE theorem produces, so optimizing $-\log \pi \cdot G$ corresponds one-to-one with maximizing expected return. The same log-prob trick reappears in PPO, DPO, and every LLM alignment loss.

Warning: Policy Collapse

Stochastic policies are not automatically exploratory forever. Gradient descent can drive the policy toward a near-deterministic distribution, after which exploration stops and the agent gets stuck on whatever local optimum it found first. Common countermeasures include an entropy bonus added to the loss ($-\beta \sum_a \pi(a) \log \pi(a)$, which is maximized when the distribution is uniform), or a KL penalty against a reference policy (the trick RLHF uses to keep the fine-tuned LLM from collapsing onto a single high-reward phrase, covered in Section 0.5.6).

0.5.4b Actor-Critic and the Advantage Baseline

Vanilla REINFORCE weights each action's gradient by the raw return $G_t$. That works, but the weights are very noisy: $G_t$ swings whenever the rest of the trajectory swings, even when the action under examination was perfectly reasonable. The fix is to subtract a baseline that captures "how well things were going on average from state $s$, independent of which action we picked." That baseline is the state-value $V(s)$ from Section 0.5.3, and the resulting quantity is the advantage:

$$A(s, a) = G_t - V(s)$$

The advantage is positive when an action did better than expected from $s$ and negative when it did worse. Weighting the policy gradient by $A$ rather than $G$ leaves the gradient direction unchanged in expectation but slashes its variance, often dramatically. Training becomes faster, more stable, and easier to tune.

Actor-Critic is the architecture that makes this practical. Two networks (or, often, two heads on a shared body) cooperate:

The actor is the policy network $\pi_\theta(a \mid s)$, exactly as in REINFORCE.
The critic is a value network $V_\phi(s)$ trained to predict the expected return from $s$ under the current policy.

The loss becomes the sum of two terms, one per network:

L = -\log \pi_\theta(a \mid s) \cdot \bigl(G_t - V_\phi(s)\bigr) \;+\; \bigl(G_t - V_\phi(s)\bigr)^2

The first term is the advantage-weighted policy gradient; the second is the regression loss that pulls the critic toward the actual return.

Actor-critic architecture: a shared trunk feeds an actor head that outputs the action distribution and a critic head that outputs the scalar state value; the advantage A equals the return G minus the critic's prediction V and weights the actor's policy gradient. — **Figure 0.5.3a**: Actor-critic architecture. A shared trunk feeds two heads: the actor outputs the policy $\pi_\theta(a \mid s)$ and the critic outputs the scalar state value $V_\phi(s)$. The advantage $A = G - V$ is then used in two ways: it scales the actor's policy gradient and supplies the regression target that trains the critic.

The pieces above translate directly into PyTorch. The actor head emits action logits, the critic head emits a scalar value, and a single combined loss drives both via shared backbone gradients.

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.distributions import Categorical

class ActorCritic(nn.Module):
    def __init__(self, state_dim, n_actions, hidden=64):
        super().__init__()
        self.trunk = nn.Sequential(nn.Linear(state_dim, hidden), nn.ReLU())
        self.actor = nn.Linear(hidden, n_actions)  # logits
        self.critic = nn.Linear(hidden, 1)         # scalar V(s)

    def forward(self, s):
        h = self.trunk(s)
        return Categorical(logits=self.actor(h)), self.critic(h).squeeze(-1)

# One on-policy update step
policy = ActorCritic(state_dim=4, n_actions=2)
opt = torch.optim.Adam(policy.parameters(), lr=3e-4)

s, a, G = batch["state"], batch["action"], batch["return"]
dist, V = policy(s)
advantage = (G - V).detach()                   # critic gives the baseline
actor_loss = -(dist.log_prob(a) * advantage).mean()
critic_loss = F.mse_loss(V, G)
loss = actor_loss + 0.5 * critic_loss
opt.zero_grad(); loss.backward(); opt.step()

Code Fragment 0.5.4b: A minimal actor-critic update in PyTorch. The actor is a categorical distribution built from logits and the critic is a one-dimensional value head; the same backbone feeds both. The advantage is detached so its gradient does not flow back through the critic when updating the actor, exactly the same pattern PPO uses inside PPOTrainer.

Key Insight: Why the Value Head Appears in PPO

The four-model PPO setup in Section 18.1.3 (policy, frozen reference, reward model, value head) is exactly an actor-critic. The "value head" is the critic $V_\phi$; PPO uses it to compute the advantage that scales every token-level policy update. Without Section 0.5.4b's advantage baseline, the value head in 18.1.3 looks like an arbitrary fourth network. With it, the value head is inevitable: it is the same critic that makes any policy-gradient method usable on real problems.

0.5.4b.1 One-Step TD and the Bootstrap Trick

Computing $G_t$ requires waiting for the episode to end so the future rewards are known. For long episodes (LLM responses can be thousands of tokens), that wait is expensive. Temporal Difference (TD) learning sidesteps the wait by bootstrapping: replace the unknown future return with the critic's current estimate of the next state's value. The resulting one-step TD target is

r_t + \gamma V_\phi(s_{t+1})

and the one-step TD advantage that replaces $G_t - V_\phi(s_t)$ is

\delta_t = r_t + \gamma V_\phi(s_{t+1}) - V_\phi(s_t)

This single quantity, the TD error, is the workhorse of modern RL. It is biased (it trusts the critic, which is still learning) but very low-variance (only one transition contributes). $n$-step methods interpolate between the extremes: $n = 1$ gives pure TD, $n = T$ recovers full Monte Carlo (vanilla REINFORCE). Most modern algorithms use an intermediate $n$ via Generalized Advantage Estimation (GAE), which takes an exponentially weighted average over all $n$-step TD advantages and is the exact recipe PPO uses on LLM responses (see the GAE callout in Section 18.1.3).

Note: From REINFORCE to PPO in Three Steps

(1) REINFORCE weights gradients by the raw return $G_t$. (2) Actor-Critic subtracts a learned baseline $V(s)$ to get the advantage $A = G - V$, cutting variance. (3) PPO replaces the on-policy gradient with a clipped importance-sampling ratio that bounds how far each update can move, so the same trajectory can be reused for several gradient steps without instability. Each step adds one ingredient; nothing is mysterious in the final recipe.

Research Frontier

Distributed Actor-Critic and Hybrid Engines

For environments where rollouts are slow (game engines, robotic simulators, LLM generation), training scales better when many actors collect trajectories in parallel and feed a single learner. A3C (Asynchronous Advantage Actor-Critic, Mnih et al. 2016) and its synchronous cousin A2C were the breakthroughs that proved this; today's LLM RLHF stacks (TRL with vLLM rollouts plus an FSDP-sharded trainer) are direct descendants of the same idea. The split between fast rollout workers and a slow gradient-update worker is one of the few infrastructure patterns that survives intact from Atari-era deep RL to billion-parameter LLM alignment.

Key Takeaways

RL is a learning paradigm where an agent improves through trial and feedback, not labeled examples. The core loop is: observe state, take action, receive reward, update policy.
An LLM is a policy that maps a context (state) to a probability distribution over tokens (actions). This framing reused later in RLHF.
Value functions (V and Q) estimate long-term expected reward. The Bellman equation gives them a recursive structure.
Policy gradients adjust the policy to make high-reward actions more probable. REINFORCE is the canonical algorithm; actor-critic adds a baseline to cut variance.
The advantage A = G - V isolates how much better an action was than the critic expected, and is the quantity PPO will clip in Section 0.5a.

Self-Check

1. In the RLHF framework for LLM training, what plays the role of the RL "action"?

Show Answer

Selecting the next token from the vocabulary. Each individual token selection is one action. The complete response is the result of an entire episode of actions. The prompt is the initial state, and the reward model's score is the reward signal.

2. Why does subtracting a state-dependent baseline V(s) from the return G keep the policy gradient unbiased?

Show Answer

Because the expected value of (b(s) times the score function grad log pi(a|s)) is zero for any baseline that depends only on the state and not the action. The baseline simply re-centers the random variable G(s,a) around its mean V(s), reducing variance without shifting the expectation.

What's Next?

The discussion continues in Section 0.5a: PPO, RLHF Pipeline & the Full RL-to-LLM Alignment Picture, which adds PPO clipping on top of actor-critic, then maps the entire RL vocabulary onto LLM training and walks through the complete RLHF pipeline. After that, Chapter 1 moves to NLP and text representation.

Further Reading

Foundational Papers & Textbooks

Sutton, R. S. & Barto, A. G. (2018). Reinforcement Learning: An Introduction (2nd ed.). MIT Press. The definitive RL textbook, covering everything from multi-armed bandits to policy gradient methods. Chapters 2, 3, and 13 are most relevant to this section's treatment of MDPs, value functions, and REINFORCE. Freely available online and suitable for both beginners and researchers seeking rigorous foundations.

Williams, R. J. (1992). "Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning." Machine Learning, 8, 229-256. The original REINFORCE paper, introducing the policy gradient theorem that forms the basis of the code examples in this section. It provides the mathematical derivation of the log-probability trick used throughout modern RL. Essential for researchers; practitioners can focus on sections 3 and 4 for the core algorithm.

Mnih, V., Kavukcuoglu, K., Silver, D., et al. (2013). "Playing Atari with Deep Reinforcement Learning." The DeepMind paper that introduced DQN and launched the modern deep-RL era. Demonstrates that a single CNN-based agent can learn to play 49 Atari games from raw pixels, motivating the state-space argument for "why deep" in Section 0.5.1.1.

Mnih, V., Badia, A. P., Mirza, M., et al. (2016). "Asynchronous Methods for Deep Reinforcement Learning." ICML 2016. Introduces A3C and the family of asynchronous actor-critic algorithms whose actor-learner split foreshadows the rollout-worker / trainer-worker architecture of modern LLM RLHF stacks. The canonical reference for the actor-critic framing of Section 0.5.4b.

Schulman, J., Moritz, P., Levine, S., Jordan, M., & Abbeel, P. (2016). "High-Dimensional Continuous Control Using Generalized Advantage Estimation." ICLR 2016. Introduces GAE, the exponentially weighted average of $n$-step TD advantages that PPO uses to score every token in an LLM response. Provides the rigorous bridge between the one-step TD error of Section 0.5.4b and the advantage estimator deployed in Section 18.1.3.