Section B.3: Overfitting, Regularization, and Validation

Covered in Detail

For a comprehensive treatment of overfitting, the bias-variance tradeoff, and regularization techniques (L1, L2, dropout), see Section 0.1: ML Basics: Features, Optimization & Generalization. For dropout and batch normalization in neural networks, see Section 0.2: Deep Learning Essentials.

This page provides a quick-reference lookup for regularization techniques and data-splitting conventions. For worked examples, visualizations, and the bias-variance tradeoff derivation, see the main text references above.

Regularization Techniques Quick Reference

Regularization Techniques for LLM Practitioners

Technique	How It Works	LLM Usage
Dropout	Randomly zeroes a fraction of activations during training	Used in BERT; less common in modern autoregressive LLMs
Weight Decay (L2)	Adds a penalty proportional to weight magnitude to the loss	Standard in all LLM training (via AdamW)
L1 Regularization	Adds a penalty proportional to the absolute value of weights; drives some weights to exactly zero	Feature selection in classical ML; rarely used in LLMs
Early Stopping	Stop training when validation performance stops improving	Common in fine-tuning; pretraining usually runs to a compute budget
Data Augmentation	Create synthetic training examples by transforming existing ones	Paraphrasing, back-translation, synthetic data (Chapter 12)

Data Splitting Conventions

Standard Data Splits

Split	Typical Size	Purpose
Training	~80%	Model learns from this data
Validation	~10%	Tune hyperparameters, detect overfitting
Test	~10%	Final unbiased performance estimate (evaluate once)

Warning: Data Contamination in LLMs

Because LLMs are pretrained on massive internet corpora, there is a risk that test set examples appeared in the pretraining data. This is called data contamination, and it can artificially inflate benchmark scores. Always check for contamination when evaluating, and prefer held-out or recently created benchmarks that could not have appeared in the training data.