Section A.3: Calculus for Machine Learning

Calculus provides the machinery for training neural networks. The entire training loop, in which a model's weights are adjusted to minimize a loss function, rests on computing gradients and following them downhill.

Derivatives and Gradients

A derivative tells you how fast a function's output changes when you nudge its input. For a function $f(x)$, the derivative $f'(x)$ or $df/dx$ is the slope of the function at point $x$.

When a function has many inputs (as a loss function that depends on millions of weights), we compute partial derivatives with respect to each input. The collection of all partial derivatives is called the gradient:

\nabla L = [ \partial L/ \partial w_1, \partial L/ \partial w_2, ..., \partial L/ \partial w_n]

The gradient points in the direction of steepest increase. To minimize the loss, we move in the opposite direction.

Gradient Descent

The weight update rule for gradient descent is remarkably simple:

w_{new} = w_{old} - \eta \cdot \nabla L

Here, $\eta$ (eta) is the learning rate, a small positive number that controls step size. Too large, and training diverges; too small, and convergence is painfully slow. Finding the right learning rate (and scheduling it to change over time) is one of the most important hyperparameter decisions in LLM training. Code Fragment A.3.1 below puts this into practice.

# Simplified gradient descent in PyTorch
import torch

# Suppose we have a simple loss function: L = (prediction - target)^2
w = torch.tensor([1.0], requires_grad=True)
target = torch.tensor([3.0])
learning_rate = 0.1

for step in range(10):
    prediction = w * 2.0  # simple linear model
    loss = (prediction - target) ** 2
    loss.backward()        # compute gradient

    with torch.no_grad():
        w -= learning_rate * w.grad  # update weight
    w.grad.zero_()         # reset gradient for next step

Code Fragment A.3.1: A minimal gradient descent loop in PyTorch. Each iteration computes the loss, calls backward() to obtain the gradient, updates the weight, and zeros the gradient for the next step.

The Chain Rule and Backpropagation

Neural networks are compositions of many functions: the output of one layer feeds into the next. To compute how the loss depends on a weight deep in the network, we need the chain rule:

\partial L/ \partial w = ( \partial L/ \partial y) \cdot ( \partial y/ \partial h) \cdot ( \partial h/ \partial w)

Each factor in this chain corresponds to one layer. Backpropagation is simply the efficient, automated application of the chain rule, working backwards from the loss to compute gradients for every weight in the network.

Key Insight: Why Deep Networks Can Be Difficult to Train

If the chain rule multiplies many factors less than 1, the gradient shrinks exponentially as it flows backward through layers (the vanishing gradient problem). If factors exceed 1, gradients explode. Techniques like residual connections (adding the input of a layer to its output), layer normalization, and careful initialization all exist to keep this product near 1. The transformer architecture uses residual connections around every sub-layer, which is one reason it can be trained to hundreds of layers.

Common Activation Functions and Their Derivatives

Function	Formula	Derivative	Used In
ReLU	max(0, x)	0 if x < 0, else 1	Early transformers, CNNs
GELU	x · Φ(x)	Smooth approximation	BERT, GPT-2+
SiLU (Swish)	x · σ(x)	Smooth, non-monotonic	LLaMA, modern LLMs
Sigmoid	1 / (1 + exp(-x))	σ(x)(1 - σ(x))	Gating mechanisms