About This Section
This section covers training algorithms, optimization methods, decoding strategies, and core ML techniques. Each entry links to the chapter where the technique is explained in detail.
- Adapter
- A small, trainable module inserted into a frozen pretrained model to specialize it for a downstream task. Adapters are a core parameter-efficient fine-tuning technique: you train only the adapter weights (often less than 1% of the model) while leaving the original parameters untouched.
- See Section 15.1 (Parameter-Efficient Methods)
- Backpropagation
- The algorithm that computes gradients of the loss function with respect to every parameter in a neural network by applying the chain rule layer by layer. These gradients tell the optimizer how to adjust each weight to reduce the loss.
- See Section 00.3 (Training Loop and Optimization)
- Batch Size
- The number of training examples processed together in a single forward and backward pass. Larger batches improve GPU utilization but require more memory. When GPU memory is limited, gradient accumulation simulates larger effective batch sizes.
- See Section 14.3 (Training Hyperparameters)
- Beam Search
- A decoding strategy that maintains multiple candidate sequences (beams) at each step, expanding and pruning them by cumulative probability. Beam search produces more coherent outputs than greedy decoding but can be less creative than sampling methods.
- See Section 05.2 (Beam Search)
- BFloat16 (Brain Float 16)
- A 16-bit floating-point format that keeps the same exponent range as 32-bit floats while halving memory usage. BFloat16 is the standard precision for LLM training and inference on A100/H100 GPUs because it avoids overflow issues that plague FP16.
- See Section 09.2 (Quantization)
- Chain-of-Thought (CoT)
- A prompting strategy that instructs the model to show intermediate reasoning steps before giving a final answer. CoT significantly boosts performance on arithmetic, logic, and multi-step tasks by forcing the model to "think out loud" rather than jump to conclusions.
- See Section 11.2 (Advanced Prompting Techniques)
- Chinchilla Scaling Laws
- Research by Hoffmann et al. (2022) showing that, for a fixed compute budget, model size and training data should be scaled roughly equally. This overturned earlier assumptions that favored larger models trained on less data, reshaping how frontier labs plan training runs.
- See Section 06.3 (Scaling Laws)
- Cross-Entropy Loss
- The standard loss function for language modeling. It measures the gap between the model's predicted token probability distribution and the true next token. Lower cross-entropy means the model assigns higher probability to the correct tokens.
- See Section 06.1 (Pretraining Objectives)
- Decoding
- The process of generating text from a language model by iteratively selecting the next token. Strategies range from deterministic (greedy, beam search) to stochastic (top-k, nucleus/top-p sampling), each offering different tradeoffs between coherence and diversity.
- See Section 05.1 (Autoregressive Generation)
- Distillation (Knowledge Distillation)
- A technique for training a smaller "student" model to replicate a larger "teacher" model's behavior. The student learns from the teacher's soft probability distributions rather than hard labels, transferring knowledge into a cheaper, faster form factor.
- See Section 16.1 (Knowledge Distillation)
- DPO (Direct Preference Optimization)
- An alignment technique that trains on pairs of preferred and rejected outputs without needing a separate reward model. DPO simplifies the RLHF pipeline by directly optimizing the policy from preference data using a classification-style loss, making alignment more accessible.
- See Section 17.3 (DPO and Alternatives)
- Dropout
- A regularization technique that randomly zeroes out a fraction of neuron activations during training, forcing the network to learn redundant representations. During inference all neurons are active. Dropout helps prevent overfitting, especially with limited training data.
- See Section 00.3 (Training Loop and Optimization)
- Evol-Instruct
- A synthetic data technique that iteratively prompts an LLM to make existing instructions more complex, generating progressively harder training examples. Introduced in the WizardLM paper, it addresses the challenge of creating diverse, high-quality instruction-tuning data at scale.
- See Section 13.3 (Synthetic Data Generation Techniques)
- Fine-Tuning
- Further training a pretrained model on a task-specific dataset to specialize its behavior. Full fine-tuning updates all weights; parameter-efficient variants (LoRA, QLoRA, adapters) update only a small fraction. Fine-tuning is how you turn a general model into a domain expert.
- See Section 14.1 (Fine-Tuning Fundamentals) and Section 15.1 (PEFT)
- Gradient Accumulation
- A technique that sums gradients over multiple mini-batches before performing a weight update, effectively simulating a larger batch size. This is essential when the desired batch size exceeds available GPU memory, which is common when fine-tuning LLMs.
- See Section 14.3 (Training Hyperparameters)
- KV Cache (Key-Value Cache)
- A memory optimization for autoregressive inference that stores previously computed key and value tensors so they are not recomputed at each generation step. The KV cache is the primary memory bottleneck during long-context inference, growing linearly with sequence length.
- See Section 09.1 (Inference Optimization Fundamentals)
- LoRA (Low-Rank Adaptation)
- A parameter-efficient fine-tuning method that freezes pretrained weights and injects small, low-rank matrices into each target layer. LoRA typically trains less than 1% of original parameters while matching or approaching full fine-tuning quality, making it the most popular PEFT technique.
- See Section 15.1 (LoRA and Parameter-Efficient Methods)
- Model Collapse
- A degradation phenomenon where models trained on AI-generated data progressively lose diversity and accuracy across generations. Model collapse motivates careful quality filtering and mixing real human data into synthetic data pipelines.
- See Section 13.4 (Quality and Risks of Synthetic Data)
- Model Merging
- Combining weights from multiple fine-tuned models into a single model without additional training. Methods include linear interpolation (LERP), SLERP, TIES, and DARE. Merging creates multi-skill models cheaply, and the open-source community uses it extensively.
- See Section 16.3 (Model Merging)
- Next-Token Prediction
- The core pretraining objective of autoregressive LLMs: given a token sequence, predict the probability distribution over the next token. Despite its simplicity, this single objective drives the acquisition of grammar, facts, reasoning, and coding abilities.
- See Section 06.1 (Pretraining Objectives)
- Nucleus Sampling (Top-p Sampling)
- A decoding strategy that samples from the smallest set of tokens whose cumulative probability exceeds a threshold p. Unlike fixed top-k, nucleus sampling dynamically adjusts the candidate pool size, producing a better balance of diversity and coherence.
- See Section 05.3 (Sampling Strategies)
- PEFT (Parameter-Efficient Fine-Tuning)
- A family of techniques that fine-tune only a small subset of model parameters while keeping the rest frozen. Methods include LoRA, QLoRA, adapters, and prefix tuning. PEFT dramatically reduces memory and compute requirements, often by 10x or more compared to full fine-tuning.
- See Chapter 15 (Parameter-Efficient Fine-Tuning)
- Prefix Tuning
- A PEFT technique that prepends trainable "virtual tokens" to the input at each transformer layer. Only these prefix parameters are updated during training. Prefix tuning works well for generation tasks but has largely been superseded by LoRA in practice.
- See Section 15.1 (Parameter-Efficient Methods)
- Pretraining
- The initial training phase where a model learns general language representations from a massive unlabeled text corpus. Pretraining is the most expensive stage (millions of GPU hours for frontier models) and produces the foundation model that downstream fine-tuning and alignment build upon.
- See Section 06.1 (Pretraining Objectives)
- QLoRA (Quantized Low-Rank Adaptation)
- A fine-tuning method that combines 4-bit quantization of the base model with LoRA adapters trained in higher precision. QLoRA enables fine-tuning 65B+ parameter models on a single consumer GPU with minimal quality loss, democratizing LLM adaptation.
- See Section 15.2 (QLoRA and Quantized PEFT)
- Quantization
- Reducing the numerical precision of model weights (e.g., from 16-bit to 4-bit) to shrink memory footprint and accelerate inference. Modern quantization formats like GPTQ, AWQ, and GGUF achieve 3 to 4x compression with minimal quality degradation.
- See Section 09.2 (Quantization)
- Reward Model
- A model trained on human preference data to score the quality of LLM outputs. During RLHF, the reward model serves as a proxy for human judgment, providing the training signal that steers the policy model toward more helpful, harmless behavior.
- See Section 17.2 (RLHF)
- RLHF (Reinforcement Learning from Human Feedback)
- An alignment technique that uses a reward model trained on human preferences to fine-tune an LLM via reinforcement learning (typically PPO). RLHF was the key breakthrough behind ChatGPT's helpfulness, but it is complex and expensive, which motivated simpler alternatives like DPO.
- See Section 17.2 (RLHF)
- Scaling Laws
- Empirical relationships showing that model performance improves predictably as a power law with increases in parameters, data, and compute. Scaling laws let researchers estimate performance of larger models before committing the resources to train them.
- See Section 06.3 (Scaling Laws)
- SFT (Supervised Fine-Tuning)
- A training stage where a pretrained model learns from labeled (input, output) pairs, typically high-quality instruction-response datasets. SFT is the first post-pretraining stage and teaches the model to follow instructions before alignment (RLHF or DPO) refines its behavior.
- See Section 14.2 (Supervised Fine-Tuning)
- Speculative Decoding
- An inference optimization where a fast, small "draft" model generates candidate tokens that are then verified in parallel by the larger target model. Accepted tokens skip individual decoding steps, yielding 2 to 3x speedups with mathematically guaranteed identical output quality.
- See Section 09.3 (Speculative Decoding)
- Temperature
- A scaling parameter applied to logits before softmax during text generation. Higher values (e.g., 1.5) increase randomness and creativity; lower values (e.g., 0.1) make outputs more deterministic. Temperature 0 is equivalent to greedy decoding.
- See Section 05.3 (Sampling Strategies)
- Token
- The basic unit of text processed by a language model. Tokens can be words, subwords, or individual characters depending on the tokenizer. A useful heuristic: one token is roughly 3/4 of an English word, or about 4 characters.
- See Section 02.1 (Tokenization Fundamentals)
- Tokenizer
- An algorithm that converts raw text into a sequence of integer token IDs from a fixed vocabulary. Common algorithms include BPE (byte pair encoding), WordPiece, and SentencePiece. The tokenizer is a critical but often overlooked component; a bad tokenizer can bottleneck the entire system.
- See Chapter 02 (Tokenization and Subword Models)
- Top-k Sampling
- A decoding strategy that restricts next-token selection to the k most probable tokens before sampling. Top-k prevents the model from selecting very unlikely tokens while maintaining diversity. Typical values range from 10 to 50.
- See Section 05.3 (Sampling Strategies)
- Transfer Learning
- Applying knowledge gained during pretraining on a large general corpus to downstream, task-specific problems. Fine-tuning a pretrained model is the most common form: the model's general language understanding transfers to your specific domain.
- See Section 14.1 (Fine-Tuning Fundamentals)
- Weight Decay
- A regularization technique that penalizes large parameter values by adding a term proportional to weight magnitude to the loss. AdamW, the standard LLM optimizer, properly decouples weight decay from gradient updates, which is important for training stability.
- See Section 14.3 (Training Hyperparameters)
- Zero-Shot Chain-of-Thought
- A prompting technique that elicits reasoning by appending "Let's think step by step" to a zero-shot prompt, without any worked examples. This simple addition can dramatically improve performance on reasoning tasks with no extra engineering effort.
- See Section 11.2 (Advanced Prompting Techniques)