Building Conversational AI with LLMs and Agents
Appendix B: Machine Learning Essentials

B.2 Loss Functions and Optimization

Covered in Detail

For a comprehensive treatment of loss functions, gradient descent, and optimization, see Section 0.1: ML Basics: Features, Optimization & Generalization. For hands-on PyTorch training loops, see Section 0.3: PyTorch Tutorial.

This page collects the most commonly referenced loss functions and optimizers in a single lookup table. For derivations, intuition, and worked examples, see the main text references above.

Loss Functions Quick Reference

Common Loss Functions
Loss Function Formula Use Case
Cross-Entropy $- \Sigma y_i \log(p_i)$ Language modeling, classification
Mean Squared Error $(1/n) \Sigma (y_i - p_i)^2$ Regression, reward modeling
Binary Cross-Entropy $-[y \log(p) + (1-y) \log(1-p)]$ Binary classification, DPO preference pairs
Contrastive Loss Various formulations Embedding training (Chapter 18), CLIP
Hinge Loss $\max(0, 1 - y \cdot p)$ SVMs, ranking tasks

Optimizers Quick Reference

Optimizers Commonly Used in LLM Training
Optimizer Key Property Typical Use
SGD Simple gradient updates; requires LR tuning Classical ML, some vision models
Adam Adaptive per-parameter learning rates (1st + 2nd moment) Default for most neural networks
AdamW Adam with decoupled weight decay BERT, GPT, most modern LLMs
Adafactor Memory-efficient factorized 2nd moments Very large models where Adam is too memory-heavy
Learning Rate Scheduling

Most LLM training runs use a warmup phase (learning rate increases linearly from 0 to its peak over the first few thousand steps) followed by a cosine decay (learning rate decreases smoothly back toward 0). This schedule prevents early instability and allows fine-grained convergence later in training. For implementation details, see Section 0.1.