For a comprehensive treatment of loss functions, gradient descent, and optimization, see Section 0.1: ML Basics: Features, Optimization & Generalization. For hands-on PyTorch training loops, see Section 0.3: PyTorch Tutorial.
This page collects the most commonly referenced loss functions and optimizers in a single lookup table. For derivations, intuition, and worked examples, see the main text references above.
Loss Functions Quick Reference
| Loss Function | Formula | Use Case |
|---|---|---|
| Cross-Entropy | $- \Sigma y_i \log(p_i)$ | Language modeling, classification |
| Mean Squared Error | $(1/n) \Sigma (y_i - p_i)^2$ | Regression, reward modeling |
| Binary Cross-Entropy | $-[y \log(p) + (1-y) \log(1-p)]$ | Binary classification, DPO preference pairs |
| Contrastive Loss | Various formulations | Embedding training (Chapter 18), CLIP |
| Hinge Loss | $\max(0, 1 - y \cdot p)$ | SVMs, ranking tasks |
Optimizers Quick Reference
| Optimizer | Key Property | Typical Use |
|---|---|---|
| SGD | Simple gradient updates; requires LR tuning | Classical ML, some vision models |
| Adam | Adaptive per-parameter learning rates (1st + 2nd moment) | Default for most neural networks |
| AdamW | Adam with decoupled weight decay | BERT, GPT, most modern LLMs |
| Adafactor | Memory-efficient factorized 2nd moments | Very large models where Adam is too memory-heavy |
Most LLM training runs use a warmup phase (learning rate increases linearly from 0 to its peak over the first few thousand steps) followed by a cosine decay (learning rate decreases smoothly back toward 0). This schedule prevents early instability and allows fine-grained convergence later in training. For implementation details, see Section 0.1.