Part I: Foundations
Chapter 02: Tokenization & Subword Models

Tokenization & Subword Models

"The first step toward understanding a language is deciding where one word ends and another begins."

Token Token, Boundary-Obsessed AI Agent
Tokenization and Subword Models chapter illustration
Figure 2.0.1: Before a language model can read a single word, it must first slice text into pieces; tokenization is the invisible gatekeeper that decides where one "word" ends and the next begins.

Chapter Overview

Before a language model can process a single word, it must first decide what a "word" even means. Tokenization is the gateway between raw text and the numerical world of neural networks, and the choices made at this stage ripple through every aspect of model behavior: the languages it handles well, the cost of running it, the errors it makes, and the size of its context window.

This chapter starts by building intuition for why tokenization matters so much, exploring the fundamental tradeoff between vocabulary size and sequence length. We then take a deep dive into the algorithms that power modern tokenizers: Byte Pair Encoding, WordPiece, Unigram, and their byte-level variants. Along the way, you will implement BPE from scratch and compare tokenizers across languages and modalities. Finally, we examine practical concerns: special tokens, chat templates, multilingual fertility, multimodal tokenization, and how tokenization directly impacts your API bill.

Prerequisites

Learning Objectives

Sections

What's Next?

In the next section, Section 2.1: Why Tokenization Matters, we explore why tokenization choices matter so deeply for model performance, vocabulary efficiency, and multilingual coverage.

Bibliography & Further Reading

Foundational Papers

Gage, P. (1994). "A New Algorithm for Data Compression." C Users Journal, 12(2), 23–38. Wikipedia: Byte Pair Encoding
The original BPE algorithm for data compression, later adapted by Sennrich et al. for subword tokenization in NLP.
Sennrich, R., Haddow, B., & Birch, A. (2016). "Neural Machine Translation of Rare Words with Subword Units." ACL 2016. arxiv.org/abs/1508.07909
Applies BPE to machine translation, demonstrating that subword tokenization elegantly handles rare and compound words.
Kudo, T. (2018). "Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates." ACL 2018. arxiv.org/abs/1804.10959
Introduces the Unigram language model tokenizer and subword regularization, showing that sampling multiple segmentations improves robustness.
Schuster, M. & Nakajima, K. (2012). "Japanese and Korean Voice Search." IEEE ICASSP 2012. doi.org/10.1109/ICASSP.2012.6289079
Introduces the WordPiece algorithm used in BERT and many Google models for vocabulary construction.
Xue, L. et al. (2022). "ByT5: Towards a Token-Free Future with Pre-trained Byte-to-Byte Models." TACL, 10, 291–306. arxiv.org/abs/2105.13626
Explores tokenizer-free models that operate directly on UTF-8 bytes, trading longer sequences for elimination of vocabulary-related biases.
Yu, L. et al. (2023). "MEGABYTE: Predicting Million-Byte Sequences with Multiscale Transformers." NeurIPS 2023. arxiv.org/abs/2305.07185
Proposes a multi-scale architecture for byte-level modeling, enabling efficient processing of very long byte sequences.

Key Books

Jurafsky, D. & Martin, J. H. (2024). Speech and Language Processing (3rd ed. draft), Chapter 2: Regular Expressions, Text Normalization, Edit Distance. web.stanford.edu/~jurafsky/slp3/2.pdf
Covers text normalization, word segmentation, and the linguistic foundations of tokenization in an accessible format.
Kudo, T. & Richardson, J. (2018). "SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing." EMNLP 2018. arxiv.org/abs/1808.06226
Describes the SentencePiece library that implements BPE and Unigram tokenization in a language-agnostic manner, used by many modern LLMs.

Tools & Libraries

Hugging Face Tokenizers. github.com/huggingface/tokenizers
A fast, Rust-backed tokenizer library supporting BPE, WordPiece, and Unigram, with Python bindings and integration into the Transformers ecosystem.
tiktoken: OpenAI's fast BPE tokenizer. github.com/openai/tiktoken
OpenAI's open-source BPE tokenizer used by GPT-3.5 and GPT-4, useful for estimating token counts and API costs.
Google's unsupervised text tokenizer supporting BPE and Unigram models, used in T5, LLaMA, and many multilingual models.