Part I: Foundations
Chapter 03: Sequence Models & Attention

Sequence Models & the Attention Mechanism

"A sequence model without attention is like a student who reads an entire textbook, then tries to answer questions from the one sentence it can still remember."

Attn Attn, Bottlenecked AI Agent
Sequence Models and the Attention Mechanism chapter illustration
Figure 3.0.1: From the forgetful telegraph operator of the RNN to the spotlight of attention, this chapter traces the breakthrough idea that lets a model learn where to look instead of compressing everything into a single vector.

Chapter Overview

This chapter traces one of the most important arcs in deep learning history: the journey from recurrent neural networks to the attention mechanism. We begin with the workhorse of early sequence modeling, the RNN, and uncover why its sequential nature creates both mathematical and practical bottlenecks. Then we introduce the attention mechanism, the breakthrough idea that lets a model learn where to look in a source sequence rather than compressing everything into a single fixed vector. Finally, we formalize attention using the query, key, value framework and build multi-head attention, the engine that powers the Transformer architecture you will study in Chapter 04.

Understanding this progression is essential. You cannot fully appreciate why Transformers revolutionized NLP without first understanding the limitations they were designed to overcome. Each section builds directly on the last, and by the end of this chapter you will have implemented attention from scratch and be ready to assemble the full Transformer.

Prerequisites

Learning Objectives

Sections

What's Next?

In the next section, Section 3.1: Recurrent Neural Networks & Their Limitations, we start by examining recurrent neural networks and the fundamental limitations that motivated the search for better architectures.

Bibliography & Further Reading

Foundational Papers

Hochreiter, S. & Schmidhuber, J. (1997). "Long Short-Term Memory." Neural Computation, 9(8), 1735–1780. doi.org/10.1162/neco.1997.9.8.1735
The foundational LSTM paper that introduced gating mechanisms to address the vanishing gradient problem in recurrent networks.
Cho, K. et al. (2014). "Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation." EMNLP 2014. arxiv.org/abs/1406.1078
Introduces the GRU cell and the encoder-decoder architecture for sequence-to-sequence learning.
Bahdanau, D., Cho, K., & Bengio, Y. (2015). "Neural Machine Translation by Jointly Learning to Align and Translate." ICLR 2015. arxiv.org/abs/1409.0473
The landmark paper introducing additive attention for neural machine translation, allowing the model to learn soft alignment between source and target.
Luong, M.-T., Pham, H., & Manning, C. D. (2015). "Effective Approaches to Attention-based Neural Machine Translation." EMNLP 2015. arxiv.org/abs/1508.04025
Presents dot-product and general attention variants, comparing global and local attention strategies for translation.
Sutskever, I., Vinyals, O., & Le, Q. V. (2014). "Sequence to Sequence Learning with Neural Networks." NeurIPS 2014. arxiv.org/abs/1409.3215
Demonstrates that deep LSTMs can map variable-length input sequences to variable-length output sequences for machine translation.
Vaswani, A. et al. (2017). "Attention Is All You Need." NeurIPS 2017. arxiv.org/abs/1706.03762
Introduces the Transformer and scaled dot-product multi-head attention, eliminating recurrence entirely in favor of self-attention.

Key Books

Jurafsky, D. & Martin, J. H. (2024). Speech and Language Processing (3rd ed. draft), Chapters 9–10. web.stanford.edu/~jurafsky/slp3
Covers RNNs, LSTMs, encoder-decoder models, and the attention mechanism with clear notation and worked examples.
Goldberg, Y. (2017). Neural Network Methods for Natural Language Processing. Morgan & Claypool. doi.org/10.2200/S00762ED1V01Y201703HLT037
Provides an accessible treatment of RNNs, LSTMs, and conditioned generation for NLP practitioners.
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning, Chapter 10: Sequence Modeling. deeplearningbook.org/contents/rnn.html
A rigorous treatment of recurrent networks, backpropagation through time, and the vanishing gradient problem.

Tools & Libraries

PyTorch RNN/LSTM Documentation. pytorch.org/docs/stable/nn.html#recurrent-layers
Official PyTorch reference for nn.RNN, nn.LSTM, and nn.GRU, including bidirectional and multi-layer configurations.
The Annotated Transformer (Harvard NLP). nlp.seas.harvard.edu/annotated-transformer
A line-by-line annotated implementation of the original Transformer paper in PyTorch, excellent for understanding multi-head attention.
Olah, C. (2015). "Understanding LSTM Networks." colah.github.io/posts/2015-08-Understanding-LSTMs
A widely cited visual explanation of LSTM gating mechanisms, ideal for building intuition before diving into code.