Contextual Embeddings: ELMo & the Path to Transformers

Section 1.4

A word's meaning depends on context, they said. So does my performance review, but nobody built ELMo for that.

LexicaLexica, Context-Dependent AI Agent
Big Picture: Why Contextual Embeddings Matter

Static embeddings (Word2Vec, GloVe) assign one fixed vector per word, ignoring context entirely. Contextual embeddings, pioneered by ELMo, broke this limitation by producing different vectors for the same word in different sentences. This idea of context-dependent representation became the foundation for every modern LLM, from BERT to GPT-4. Understanding how ELMo works illuminates the "pretrain then fine-tune" paradigm that dominates NLP today and sets the stage for the Transformer architecture in Section 3.1.

Prerequisites

This section assumes you understand static word embeddings (Word2Vec, GloVe) from Section 1.3 and their key limitation: one vector per word regardless of context. The "pretrain then fine-tune" paradigm introduced in this section is a key bridge to the transformer architecture covered later in the book.

The Polysemy Problem

Word2Vec, GloVe, and FastText share a fundamental limitation: each word gets exactly one vector, regardless of context. But language is deeply contextual.

Fun Fact: One Word, Many Personalities

Think of the word "bank" as an actor who changes costume between scenes: a financial institution in one act, a riverbank in another, a verb of trust in a third. Word2Vec hired one actor and asked them to play all three roles in the same outfit, which is why the result feels blurry. ELMo finally let the actor change costumes between scenes, which is the entire reason contextual embeddings won.

Consider the word "bank":

With Word2Vec, all three uses map to the same vector: a compromise that captures none of the meanings well. This is called the polysemy problem.

Illustration of the polysemy problem showing the word bank with multiple meanings (financial institution, river bank, to rely on) all mapped to a single blurred vector in static embeddings
Figure 1.4.1: The polysemy trap. Static embeddings assign one vector to "bank" regardless of context, producing a blurry average that captures none of its distinct meanings well.
Static vs contextual embeddings: one vector per word vs different vectors per usage
Figure 1.4.2: Static vs. contextual embeddings. Word2Vec assigns one fixed vector per word regardless of context, while contextual models produce different vectors for each usage.
Warning: Quick Check: Polysemy Spotting

How many different meanings does the word "run" have in these sentences? Would Word2Vec give them different or identical vectors?

  1. "I went for a run this morning." (exercise)
  2. "There was a run on the bank." (financial panic)
  3. "She has a run in her stockings." (tear in fabric)
  4. "The program takes a long time to run." (execute)
Reveal answer

Four completely different meanings, but Word2Vec assigns one single vector for all of them. That vector would be a blurry average of all four meanings, capturing none of them well. This is exactly the problem contextual embeddings solve.

Key Insight: Meaning as Use

Static embeddings treat meaning as a fixed property of the word itself: one entry, one vector. Contextual embeddings treat meaning as a property of the word in its specific sentence. Linguists have long distinguished between these views: the surrounding context selectively activates one sense of an ambiguous word while suppressing others, a process cognitive scientists call "sense coercion." The shift from Word2Vec to ELMo mirrors this insight: from fixed dictionary definitions to meaning that depends on use.

The polysemy problem is not merely an engineering annoyance; it is the central challenge that motivated the next generation of embedding models. The solution came from a conceptual leap: instead of assigning one vector per word type (the dictionary entry), assign a different vector for every word token (each occurrence in context).

ELMo: Embeddings from Language Models (2018)

Key Insight: From Types to Tokens

Word2Vec assigns one vector per word type (the dictionary entry for "bank"). ELMo assigns a different vector for every word token (each individual occurrence of "bank" in a sentence). This distinction, which linguists have made for centuries, was first operationalized at scale by ELMo. Every contextual model since, including BERT and GPT, follows this same principle.

ELMo (Peters et al., 2018) was published under a name that hints at the playful culture of NLP research: it stands for Embeddings from Language Models, but the acronym was chosen to reference the Sesame Street character. Its successor, BERT, kept the tradition alive.

Note: Quick Refresher: What Is an LSTM?

An LSTM (Long Short-Term Memory network, covered in Chapter 0) is a type of recurrent neural network designed to remember information over long sequences. Unlike a basic RNN, which struggles with vanishing gradients, an LSTM uses gating mechanisms (input, forget, and output gates) to selectively retain or discard information at each time step. This makes LSTMs well suited for processing sequential data like text.

ELMo (Peters et al., 2018) was the first widely successful contextual embedding model. The key idea: run the entire sentence through a deep bidirectional LSTM, and use the hidden states as word representations. Since the LSTM has seen the whole sentence, each word's representation is influenced by its context.

How it works:

  1. Train a bidirectional language model (forward LSTM + backward LSTM) on a large corpus
  2. For each word in a sentence, extract hidden states from all layers
  3. The ELMo embedding is a learned weighted combination of all layer representations
ELMo computation graph: tokens through character embeddings and bidirectional LSTMs
Figure 1.4.3: The ELMo computation graph. Input tokens pass through a character embedding layer, then through forward and backward LSTMs. The final ELMo vector for each token is a learned weighted sum across all layers.

The breakthrough: "bank" in "river bank" now gets a different vector than "bank" in "bank account", because the LSTM hidden states are conditioned on the entire sentence.

ELMo architecture diagram showing a forward LSTM and backward LSTM processing a sentence in both directions, with layer outputs combined into a weighted contextual embedding for each word
Figure 1.4.4: The ELMo architecture. A forward and backward LSTM each process the full sentence, and the outputs from multiple layers are combined with learned weights to produce a contextual embedding for each token.

Why Different Layers Capture Different Information

A remarkable finding from the ELMo paper: different layers of the LSTM capture different types of linguistic information:

Key Insight: "bank" Across Layers

Consider the word "bank" in the sentence "She walked along the river bank."

Each layer refines the representation, moving from raw identity to grammatical role to contextual meaning.

Information hierarchy across ELMo layers from morphology to semantics
Figure 1.4.5: The information hierarchy across ELMo layers. Lower layers capture morphology and word identity, middle layers encode syntax, and upper layers represent semantics and word sense.

This is why ELMo uses a weighted combination of all layers (the α weights in the diagram above): different downstream tasks benefit from different layers. A POS (part-of-speech) tagger might weight Layer 1 heavily, while a sentiment classifier might rely more on Layer 2. The weights are learned during fine-tuning for each specific task.

Note: Historical Context

ELMo improved the state of the art on every single NLP benchmark it was tested on: question answering, sentiment analysis, NER (named entity recognition, labeling spans like people, places, and organizations), coreference resolution (linking pronouns and noun phrases that refer to the same entity), and more. The gains were typically 3 to 10% absolute improvement, which was enormous by the standards of 2018. This proved definitively that contextual representations were the future, and set the stage for BERT just six months later.

Key Insight: The Paradigm Shift: Pretrain, Then Fine-tune

ELMo introduced what would become the dominant paradigm in NLP: pretrain a model on a large unlabeled corpus (learning general language understanding), then fine-tune or use the representations for specific tasks. This is exactly what BERT, GPT, and all modern LLMs do, just at a much larger scale.

Contextual Embeddings in Code

While ELMo itself is rarely used directly today (BERT and transformers have superseded it), we can demonstrate the concept of contextual embeddings using Hugging Face Transformers, which makes it easy to extract hidden states from any model:

# Demonstrating contextual embeddings: same word, different vectors
from transformers import AutoTokenizer, AutoModel
import torch
from scipy.spatial.distance import cosine
# Load a BERT model (the modern successor to ELMo)
# NOTE: First run will download bert-base-uncased (~420 MB).
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")
def get_word_embedding(sentence, word):
    """Extract the contextual embedding for a specific word in a sentence."""
    inputs = tokenizer(sentence, return_tensors="pt")
    with torch.no_grad():
        outputs = model(**inputs)
        # Find the token position for our word
        # WARNING: This assumes the target word is a single token. If BERT's tokenizer
        # splits the word into subwords (e.g., "deposit" -> "dep" + "##osit"), this
        # lookup will fail. For production code, you would need to handle multi-token
        # words by averaging their subword embeddings.
        tokens = tokenizer.tokenize(sentence)
        word_idx = tokens.index(word) + 1 # +1 for [CLS] token
        # Return the hidden state at that position
        return outputs.last_hidden_state[0, word_idx].numpy()
        # "bank" in two different contexts
        bank_river = get_word_embedding("I sat by the river bank", "bank")
        bank_money = get_word_embedding("I went to the bank to deposit money", "bank")
        # Measure how different the two "bank" vectors are
        distance = cosine(bank_river, bank_money)
        print(f"Cosine distance between 'bank' in different contexts: {distance:.3f}")
        # Output: ~0.35: substantially different vectors for the same word!
        # Compare: "bank" (river) is closer to "shore" than to "bank" (money)
        shore = get_word_embedding("We walked along the shore", "shore")
        print(f"Distance bank(river) to shore: {cosine(bank_river, shore):.3f}")
        print(f"Distance bank(river) to bank(money): {cosine(bank_river, bank_money):.3f}")
        # bank(river) is CLOSER to "shore" than to bank(money)!
        # This is exactly what contextual embeddings solve.
Output: Cosine distance between 'bank' in different contexts: 0.349 Distance bank(river) to shore: 0.218 Distance bank(river) to bank(money): 0.349
Code Fragment 1.4.1a: While ELMo itself is rarely used directly today (BERT and transformers have superseded it).
Library Shortcut

For production sentence embeddings, sentence-transformers handles tokenization, pooling, and normalization in two lines:

Show code
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = model.encode([
    "I sat by the river bank",
    "I went to the bank to deposit money",
    "We walked along the shore"
])
# embeddings.shape: (3, 384), ready for cosine similarity
Code Fragment 1.4.8: Encoding three sentences with sentence-transformers to produce dense 384-dimensional vectors; the two "bank" sentences land in different regions of the space, showing how contextual embeddings disambiguate polysemy before any cosine comparison.

pip install sentence-transformers

Note: Why We Used BERT Instead of ELMo

ELMo (2018) proved the concept, but BERT (2018, released just months later) does the same thing better and faster using transformers instead of LSTMs. Both produce contextual embeddings; the code above works identically with either. We use BERT here because it is readily available via Hugging Face and is the tool you would actually use in practice. The concept (same word gets different vectors in different contexts) is ELMo's contribution; the implementation is modern.

Real-World Scenario
Contextual Embeddings Fix a Medical NER System

Who: NLP team at a health-tech startup building a named entity recognition (NER) system to extract drug names, conditions, and procedures from clinical notes

Situation: The initial system used GloVe embeddings (300-dimensional, trained on Wikipedia) as input features for a BiLSTM-CRF tagger. It achieved 79% F1 on a labeled test set of 3,000 clinical notes.

Problem: The system consistently failed on ambiguous medical terms. "Discharge" could mean a patient leaving the hospital or fluid leaving a wound. "Culture" could mean a lab test or a patient's background. GloVe gave each word a single vector regardless of context, causing the tagger to misclassify these terms roughly 40% of the time.

Dilemma: The team considered building hand-crafted disambiguation rules (brittle, incomplete), training domain-specific GloVe on medical text (would not solve the fundamental polysemy issue), or replacing GloVe with contextual embeddings from a pretrained model.

Decision: They replaced GloVe inputs with contextual embeddings from BioBERT (a BERT model pretrained on PubMed articles), feeding BioBERT's hidden states into the same BiLSTM-CRF architecture.

How: Extracted the last hidden layer from BioBERT for each token, then used these 768-dimensional vectors as input features. The rest of the pipeline remained identical.

Result: F1 jumped from 79% to 89.4%. Accuracy on ambiguous terms specifically improved from 60% to 87%. The model correctly distinguished "discharge" (procedure) from "discharge" (event) based on surrounding clinical context.

Lesson: Contextual embeddings are not a marginal improvement over static embeddings; they are transformative for tasks involving polysemy. In domains where the same word carries different meanings in different contexts (medicine, law, finance), the switch from static to contextual representations is the single highest-impact change you can make.

From ELMo to Transformers: What Changed

Let us compare the approaches side by side to see the progression clearly:

Table 1.4.1b: From ELMo to Transformers: What Changed Comparison (as of 2026).
PropertyWord2Vec / GloVeELMoBERT / GPT (next chapters)
Context-aware?No (static)Yes (bi-LSTM)Yes (self-attention)
Pretrained?YesYesYes (much larger scale)
ArchitectureShallow networkDeep bi-LSTMTransformer
Approx. parameters~300K to 3M (embedding table only)~93M~110M (BERT-base) to 175B (GPT-3)
Handles polysemy?NoYesYes (better)
Parallelizable?N/ANo (sequential)Yes (all at once!)
Long-range context?Window onlyLimited by LSTM memoryFull sequence via attention

The key limitation of ELMo was its reliance on LSTMs, which process text sequentially (one word at a time). This made training slow and limited the model's ability to capture very long-range dependencies. The Transformer architecture (Chapter 3) solves this by processing all words simultaneously using self-attention.

Real-World Scenario
Choosing the Right Layer for Feature Extraction

Who: Applied ML researcher at an academic institution building a part-of-speech tagger for low-resource African languages

Situation: Using multilingual BERT (mBERT) to extract features for POS tagging in Yoruba. The researcher needed to decide which of the 12 transformer layers to use as input features for the downstream CRF tagger.

Problem: Using only the final layer (layer 12) gave 74% accuracy, which was disappointing given that English POS tagging with the same setup reached 97%.

Dilemma: The options were to fine-tune the entire mBERT model (expensive, risk of catastrophic forgetting on 500 labeled sentences), use a fixed layer (which one?), or use a learned weighted combination of all layers (the ELMo approach).

Decision: Following the ELMo insight that lower layers encode syntax while upper layers encode semantics, they tested individual layers and found that layers 6 through 8 performed best for POS tagging. They then used the ELMo-style weighted sum across all 12 layers, letting the model learn the optimal mixture.

How: Extracted hidden states from all 12 layers, introduced 12 learnable scalar weights (initialized uniformly), and computed the weighted sum as the input representation. Only these 12 weights and the CRF layer were trained.

Result: Accuracy improved from 74% (last layer only) to 82% (weighted combination). The learned weights confirmed that middle layers (6 through 9) received the highest weights, consistent with the syntactic-layer hypothesis from ELMo.

Lesson: The final layer of a pretrained model is not always the best feature for every task. Syntactic tasks benefit from middle layers; semantic tasks benefit from upper layers. When in doubt, use a learned weighted combination across all layers.

Summary: The Representation Journey

This module traced the evolution of how we represent text for machines. Each step solved a problem that the previous approach could not handle:

Analogy comparing word embeddings to a GPS for words, where coordinates in vector space encode semantic relationships and directions represent meaningful linguistic properties
Figure 1.4.6: Word embeddings as a GPS for meaning. Just as GPS coordinates let you compute distances and directions between locations, embedding vectors let you compute semantic relationships between words.

Step back, and the path from sparse to dense to contextual representations becomes a single arc. Figure 1.4.7 lines up the five milestones (BoW, TF-IDF, Word2Vec, ELMo, Transformers) so you can see at a glance which limitation each new representation was designed to fix.

Evolution from BoW to Transformers, each solving the previous representation's limitation
Figure 1.4.7a: The evolution of text representation. Each step solved a specific limitation: BoW lost word order, TF-IDF added importance weighting, Word2Vec captured semantics, ELMo added context, and Transformers enabled parallel contextual processing.
Key Insight: The Thread That Connects Everything

The entire history of NLP can be read as a quest for better representations of meaning. Each breakthrough, from TF-IDF to Word2Vec to ELMo to Transformers, made the representation denser (fewer dimensions, more information per number), more contextual (same word, different meaning in different contexts), and more general (works across tasks without task-specific engineering). Understanding this trajectory is the key to understanding where the field is heading next.

Key Takeaways

What's Next?

The discussion continues in Section 1.4a: Contextual Embeddings Lab, BERT Pretraining & Exercises, which puts these ideas into practice with a Word2Vec-vs-BERT polysemy lab, a step-by-step walk through BERT's pretraining recipe (MLM + NSP), and exercises on the static-to-contextual story. After that, Section 1.5 turns to tokenization.

Further Reading

Contextual Embeddings

Peters, M. E., Neumann, M., Iyyer, M., et al. (2018). "Deep contextualized word representations." NAACL 2018. The ELMo paper that proved contextual embeddings outperform static embeddings on every NLP task tested, using bidirectional LSTMs to generate context-dependent word representations. Introduced the concept of extracting features from different layers for different tasks.
Howard, J. & Ruder, S. (2018). "Universal Language Model Fine-tuning for Text Classification." ACL 2018. ULMFiT introduced the "pretrain, then fine-tune" paradigm with discriminative fine-tuning and gradual unfreezing that ELMo and BERT built upon. Demonstrated that language model pretraining transfers effectively to downstream tasks with minimal labeled data.