Chapter 05: Decoding Strategies & Text Generation

"The question is not what the model knows, but what you let it say."
Greedy, Strategically Decoded AI Agent

Decoding Strategies and Text Generation chapter illustration — **Figure 5.0.1**: A probability distribution alone does not produce text. Decoding strategies are the bridge between a trained model and the words it generates, each method trading off creativity, coherence, and speed.

Chapter Overview

A language model learns a probability distribution over sequences of tokens, but that distribution alone does not produce text. The bridge between a trained transformer model and the words it generates is the decoding strategy: the algorithm that selects which token comes next (or, in newer paradigms, which tokens appear all at once). The choice of decoding method profoundly affects quality, diversity, coherence, speed, and even the safety of generated output.

This chapter walks through the full landscape of text generation, from the simplest deterministic methods (greedy search, beam search) through stochastic sampling techniques (temperature, top-k, top-p, min-p) to advanced and emerging approaches (contrastive decoding, speculative decoding, structured generation, watermarking, and diffusion-based language models). By the end, you will understand not just what each method does, but when and why to choose one over another.

Prerequisites

Chapter 03: Sequence Models and the Attention Mechanism
Chapter 04: The Transformer Architecture (particularly the decoder and autoregressive generation)
Familiarity with softmax, probability distributions, and basic PyTorch

Learning Objectives

Implement greedy decoding and beam search from scratch; explain their strengths and failure modes
Apply temperature scaling, top-k, top-p, and min-p sampling; visualize how each reshapes the token probability distribution (these parameters also appear as LLM API settings)
Explain repetition penalties, frequency penalties, and presence penalties, and when each is appropriate (see also prompt engineering for complementary strategies)
Describe contrastive decoding, speculative decoding (see also Chapter 9: Inference Optimization), and minimum Bayes risk decoding at a conceptual and algorithmic level
Use grammar-constrained decoding to enforce structured output (JSON, XML) at the logit level
Explain the principles behind text watermarking and its implications for AI safety
Articulate how diffusion-based language models differ from autoregressive generation, including their advantages and current limitations

Sections

What's Next?

In the next section, Section 5.1: Deterministic Decoding Strategies, we begin with deterministic decoding strategies like greedy search and beam search, understanding their strengths and trade-offs.

Bibliography & Further Reading

Foundational Papers

Holtzman, A. et al. (2020). "The Curious Case of Neural Text Degeneration." ICLR 2020. arxiv.org/abs/1904.09751

Introduces nucleus (top-p) sampling and provides a thorough analysis of why greedy and beam search produce degenerate, repetitive text.

Fan, A., Lewis, M., & Dauphin, Y. (2018). "Hierarchical Neural Story Generation." ACL 2018. arxiv.org/abs/1805.04833

Introduces top-k sampling for open-ended text generation, demonstrating improved diversity over beam search for creative writing.

Li, X. D. et al. (2023). "Contrastive Decoding: Open-ended Text Generation as Optimization." ACL 2023. arxiv.org/abs/2210.15097

Proposes contrastive decoding, which improves generation quality by penalizing tokens favored by a weaker "amateur" model.

Leviathan, Y., Kalman, M., & Matias, Y. (2023). "Fast Inference from Transformers via Speculative Decoding." ICML 2023. arxiv.org/abs/2211.17192

Introduces speculative decoding, using a small draft model to propose tokens verified in parallel by the large model for lossless speedup.

Kirchenbauer, J. et al. (2023). "A Watermark for Large Language Models." ICML 2023. arxiv.org/abs/2301.10226

Presents a statistical watermarking scheme for LLM outputs using pseudorandom "green list" token biasing, enabling reliable detection.

Sahoo, S. et al. (2024). "Simple and Effective Masked Diffusion Language Models." arxiv.org/abs/2406.07524

Introduces MDLM, a masked diffusion approach that generates text by iteratively denoising masked tokens in parallel.

Key Books

Jurafsky, D. & Martin, J. H. (2024). Speech and Language Processing (3rd ed. draft), Chapter 10: Language Models and Decoding. web.stanford.edu/~jurafsky/slp3

Covers autoregressive generation, beam search, and sampling strategies with clear mathematical exposition.

Meister, C. et al. (2021). "If Beam Search Is the Answer, What Was the Question?" EMNLP 2021. arxiv.org/abs/2010.02650

A theoretical analysis of beam search as regularized decoding, providing insights into when and why beam search works (or fails).

Tools & Libraries

Hugging Face Generation Utilities. huggingface.co/docs/transformers/generation_strategies

Official documentation for the generate() API in Hugging Face Transformers, supporting all major decoding strategies out of the box.

Outlines: Structured Generation for LLMs. github.com/outlines-dev/outlines

A library for grammar-constrained and JSON schema-enforced decoding, ensuring every generated token conforms to a specified structure.

vLLM: High-throughput LLM Serving. github.com/vllm-project/vllm

A fast inference engine supporting continuous batching, PagedAttention, and speculative decoding for production LLM serving.

llama.cpp. github.com/ggerganov/llama.cpp

A C/C++ implementation for efficient local LLM inference, supporting various sampling strategies including min-p, mirostat, and grammar-constrained generation.