Section B.2: Loss Functions and Optimization

Covered in Detail

For a comprehensive treatment of loss functions, gradient descent, and optimization, see Section 0.1: ML Basics: Features, Optimization & Generalization. For hands-on PyTorch training loops, see Section 0.3: PyTorch Tutorial.

This page collects the most commonly referenced loss functions and optimizers in a single lookup table. For derivations, intuition, and worked examples, see the main text references above.

Loss Functions Quick Reference

Common Loss Functions

Loss Function	Formula	Use Case
Cross-Entropy	$- \Sigma y_i \log(p_i)$	Language modeling, classification
Mean Squared Error	$(1/n) \Sigma (y_i - p_i)^2$	Regression, reward modeling
Binary Cross-Entropy	$-[y \log(p) + (1-y) \log(1-p)]$	Binary classification, DPO preference pairs
Contrastive Loss	Various formulations	Embedding training (Chapter 18), CLIP
Hinge Loss	$\max(0, 1 - y \cdot p)$	SVMs, ranking tasks

Optimizers Quick Reference

Optimizers Commonly Used in LLM Training

Optimizer	Key Property	Typical Use
SGD	Simple gradient updates; requires LR tuning	Classical ML, some vision models
Adam	Adaptive per-parameter learning rates (1st + 2nd moment)	Default for most neural networks
AdamW	Adam with decoupled weight decay	BERT, GPT, most modern LLMs
Adafactor	Memory-efficient factorized 2nd moments	Very large models where Adam is too memory-heavy

Learning Rate Scheduling

Most LLM training runs use a warmup phase (learning rate increases linearly from 0 to its peak over the first few thousand steps) followed by a cosine decay (learning rate decreases smoothly back toward 0). This schedule prevents early instability and allows fine-grained convergence later in training. For implementation details, see Section 0.1.