Foundations of NLP & Text Representation

Chapter opener illustration: Foundations of NLP & Text Representation.

"A word is characterized by the company it keeps."

LexicaLexica, Distributional AI Agent
Looking Back

Chapter 0 gave you the PyTorch foundations: tensors, autograd, the training loop. Now we turn from generic deep learning to the specific problem this book is about: how do you represent text as numbers a model can learn from? The chapter starts with the classical answers (bag-of-words, n-grams, TF-IDF), shows what they got wrong, and ends with the embedding revolution (Word2Vec, GloVe) that made transformers possible.

Chapter Overview

Every modern AI system that can read, write, or converse began with the ideas in this chapter.

How do machines learn to read? This chapter traces the evolution of text representation from counting words to understanding meaning. Building on the neural network and optimization fundamentals from Chapter 00: ML and PyTorch Foundations, we start with the fundamental challenge of turning raw human language into numbers, work through classical techniques like Bag-of-Words and TF-IDF, then explore the revolution sparked by Word2Vec and dense word embeddings.

Along the way, you will build a complete text preprocessing pipeline, train word embeddings from scratch, explore the famous king/queen analogy, and see how contextual embeddings (ELMo) paved the road to the transformer models covered in Chapter 04: Transformer Architecture. Understanding this progression is essential: the entire history of NLP is a quest for better representations of meaning, and each technique you learn here is a building block for everything that follows.

Fun Fact: Before Computers Could Read

For most of human history, reading meant a person staring at marks on a surface and decoding them through years of practice. Teaching a computer to read took three big tricks: turn marks into numbers (tokenization), turn numbers into meaningful coordinates (embeddings), and let those coordinates shift with context (contextual embeddings). This chapter is the story of how those three tricks went from research curiosities to the bedrock of every modern LLM.

Big Picture

Every LLM is built on top of representations of text: how you turn words into numbers determines what the model can learn. This chapter traces the path from one-hot vectors through word embeddings to contextual representations, the conceptual ancestors of the transformer attention mechanism in Chapter 3. Understanding why these representations evolved as they did is the fastest way to build intuition for everything that follows.

Note: Learning Objectives

Prerequisites

Sections

What's Next?

Next: Chapter 2: Sequence Models & the Attention Mechanism. Tokens and embeddings give us a way to represent text as vectors, but text is not a bag of words: order matters, and faraway tokens often depend on each other. Chapter 2 traces the path from recurrent networks (which forget) to attention (which lets every position look at every other position). By the end you will see exactly why "Attention is All You Need" is more than marketing.