Chapter 03: Sequence Models & the Attention Mechanism

"A sequence model without attention is like a student who reads an entire textbook, then tries to answer questions from the one sentence it can still remember."
Attn, Bottlenecked AI Agent

Sequence Models and the Attention Mechanism chapter illustration — **Figure 3.0.1**: From the forgetful telegraph operator of the RNN to the spotlight of attention, this chapter traces the breakthrough idea that lets a model learn *where to look* instead of compressing everything into a single vector.

Chapter Overview

This chapter traces one of the most important arcs in deep learning history: the journey from recurrent neural networks to the attention mechanism. We begin with the workhorse of early sequence modeling, the RNN, and uncover why its sequential nature creates both mathematical and practical bottlenecks. Then we introduce the attention mechanism, the breakthrough idea that lets a model learn where to look in a source sequence rather than compressing everything into a single fixed vector. Finally, we formalize attention using the query, key, value framework and build multi-head attention, the engine that powers the Transformer architecture you will study in Chapter 04.

Understanding this progression is essential. You cannot fully appreciate why Transformers revolutionized NLP without first understanding the limitations they were designed to overcome. Each section builds directly on the last, and by the end of this chapter you will have implemented attention from scratch and be ready to assemble the full Transformer.

Prerequisites

Chapter 00: Backpropagation, chain rule, gradient descent, PyTorch basics
Chapter 01: Word embeddings, distributional semantics, vector representations of text
Chapter 02: Tokenization, subword vocabularies, input representation pipelines
Linear Algebra: Matrix multiplication, softmax, dot products, projections

Learning Objectives

Explain how RNNs process sequences and why vanishing gradients limit their effectiveness on long sequences
Describe the gating mechanisms of LSTM and GRU cells and explain how they mitigate the vanishing gradient problem
Derive Bahdanau additive attention and Luong dot-product attention, and explain how backpropagation flows through the attention layer
Define the query, key, value abstraction and compute scaled dot-product attention from first principles
Implement multi-head self-attention in PyTorch, including causal masking for autoregressive generation
Analyze the O(n²) complexity of self-attention and explain why it limits context length

Sections

What's Next?

In the next section, Section 3.1: Recurrent Neural Networks & Their Limitations, we start by examining recurrent neural networks and the fundamental limitations that motivated the search for better architectures.

Bibliography & Further Reading

Foundational Papers

Hochreiter, S. & Schmidhuber, J. (1997). "Long Short-Term Memory." Neural Computation, 9(8), 1735–1780. doi.org/10.1162/neco.1997.9.8.1735

The foundational LSTM paper that introduced gating mechanisms to address the vanishing gradient problem in recurrent networks.

Cho, K. et al. (2014). "Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation." EMNLP 2014. arxiv.org/abs/1406.1078

Introduces the GRU cell and the encoder-decoder architecture for sequence-to-sequence learning.

Bahdanau, D., Cho, K., & Bengio, Y. (2015). "Neural Machine Translation by Jointly Learning to Align and Translate." ICLR 2015. arxiv.org/abs/1409.0473

The landmark paper introducing additive attention for neural machine translation, allowing the model to learn soft alignment between source and target.

Luong, M.-T., Pham, H., & Manning, C. D. (2015). "Effective Approaches to Attention-based Neural Machine Translation." EMNLP 2015. arxiv.org/abs/1508.04025

Presents dot-product and general attention variants, comparing global and local attention strategies for translation.

Sutskever, I., Vinyals, O., & Le, Q. V. (2014). "Sequence to Sequence Learning with Neural Networks." NeurIPS 2014. arxiv.org/abs/1409.3215

Demonstrates that deep LSTMs can map variable-length input sequences to variable-length output sequences for machine translation.

Vaswani, A. et al. (2017). "Attention Is All You Need." NeurIPS 2017. arxiv.org/abs/1706.03762

Introduces the Transformer and scaled dot-product multi-head attention, eliminating recurrence entirely in favor of self-attention.

Key Books

Jurafsky, D. & Martin, J. H. (2024). Speech and Language Processing (3rd ed. draft), Chapters 9–10. web.stanford.edu/~jurafsky/slp3

Covers RNNs, LSTMs, encoder-decoder models, and the attention mechanism with clear notation and worked examples.

Goldberg, Y. (2017). Neural Network Methods for Natural Language Processing. Morgan & Claypool. doi.org/10.2200/S00762ED1V01Y201703HLT037

Provides an accessible treatment of RNNs, LSTMs, and conditioned generation for NLP practitioners.

Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning, Chapter 10: Sequence Modeling. deeplearningbook.org/contents/rnn.html

A rigorous treatment of recurrent networks, backpropagation through time, and the vanishing gradient problem.

Tools & Libraries

PyTorch RNN/LSTM Documentation. pytorch.org/docs/stable/nn.html#recurrent-layers

Official PyTorch reference for nn.RNN, nn.LSTM, and nn.GRU, including bidirectional and multi-layer configurations.

The Annotated Transformer (Harvard NLP). nlp.seas.harvard.edu/annotated-transformer

A line-by-line annotated implementation of the original Transformer paper in PyTorch, excellent for understanding multi-head attention.

Olah, C. (2015). "Understanding LSTM Networks." colah.github.io/posts/2015-08-Understanding-LSTMs

A widely cited visual explanation of LSTM gating mechanisms, ideal for building intuition before diving into code.