Part I: Foundations
Chapter 05: Decoding & Text Generation

Decoding Strategies & Text Generation

"The question is not what the model knows, but what you let it say."

Greedy Greedy, Strategically Decoded AI Agent
Decoding Strategies and Text Generation chapter illustration
Figure 5.0.1: A probability distribution alone does not produce text. Decoding strategies are the bridge between a trained model and the words it generates, each method trading off creativity, coherence, and speed.

Chapter Overview

A language model learns a probability distribution over sequences of tokens, but that distribution alone does not produce text. The bridge between a trained transformer model and the words it generates is the decoding strategy: the algorithm that selects which token comes next (or, in newer paradigms, which tokens appear all at once). The choice of decoding method profoundly affects quality, diversity, coherence, speed, and even the safety of generated output.

This chapter walks through the full landscape of text generation, from the simplest deterministic methods (greedy search, beam search) through stochastic sampling techniques (temperature, top-k, top-p, min-p) to advanced and emerging approaches (contrastive decoding, speculative decoding, structured generation, watermarking, and diffusion-based language models). By the end, you will understand not just what each method does, but when and why to choose one over another.

Prerequisites

Learning Objectives

Sections

What's Next?

In the next section, Section 5.1: Deterministic Decoding Strategies, we begin with deterministic decoding strategies like greedy search and beam search, understanding their strengths and trade-offs.

Bibliography & Further Reading

Foundational Papers

Holtzman, A. et al. (2020). "The Curious Case of Neural Text Degeneration." ICLR 2020. arxiv.org/abs/1904.09751
Introduces nucleus (top-p) sampling and provides a thorough analysis of why greedy and beam search produce degenerate, repetitive text.
Fan, A., Lewis, M., & Dauphin, Y. (2018). "Hierarchical Neural Story Generation." ACL 2018. arxiv.org/abs/1805.04833
Introduces top-k sampling for open-ended text generation, demonstrating improved diversity over beam search for creative writing.
Li, X. D. et al. (2023). "Contrastive Decoding: Open-ended Text Generation as Optimization." ACL 2023. arxiv.org/abs/2210.15097
Proposes contrastive decoding, which improves generation quality by penalizing tokens favored by a weaker "amateur" model.
Leviathan, Y., Kalman, M., & Matias, Y. (2023). "Fast Inference from Transformers via Speculative Decoding." ICML 2023. arxiv.org/abs/2211.17192
Introduces speculative decoding, using a small draft model to propose tokens verified in parallel by the large model for lossless speedup.
Kirchenbauer, J. et al. (2023). "A Watermark for Large Language Models." ICML 2023. arxiv.org/abs/2301.10226
Presents a statistical watermarking scheme for LLM outputs using pseudorandom "green list" token biasing, enabling reliable detection.
Sahoo, S. et al. (2024). "Simple and Effective Masked Diffusion Language Models." arxiv.org/abs/2406.07524
Introduces MDLM, a masked diffusion approach that generates text by iteratively denoising masked tokens in parallel.

Key Books

Jurafsky, D. & Martin, J. H. (2024). Speech and Language Processing (3rd ed. draft), Chapter 10: Language Models and Decoding. web.stanford.edu/~jurafsky/slp3
Covers autoregressive generation, beam search, and sampling strategies with clear mathematical exposition.
Meister, C. et al. (2021). "If Beam Search Is the Answer, What Was the Question?" EMNLP 2021. arxiv.org/abs/2010.02650
A theoretical analysis of beam search as regularized decoding, providing insights into when and why beam search works (or fails).

Tools & Libraries

Official documentation for the generate() API in Hugging Face Transformers, supporting all major decoding strategies out of the box.
Outlines: Structured Generation for LLMs. github.com/outlines-dev/outlines
A library for grammar-constrained and JSON schema-enforced decoding, ensuring every generated token conforms to a specified structure.
vLLM: High-throughput LLM Serving. github.com/vllm-project/vllm
A fast inference engine supporting continuous batching, PagedAttention, and speculative decoding for production LLM serving.
A C/C++ implementation for efficient local LLM inference, supporting various sampling strategies including min-p, mirostat, and grammar-constrained generation.