Section 1.4a: Contextual Embeddings Lab, BERT Pretraining & Exercises

I trained a Word2Vec model and then a BERT model on the same corpus. Word2Vec finished before lunch. BERT finished my electricity budget.
Lexica, Carbon-Aware AI Agent

Big Picture

This section continues from Section 1.4, which walked through the polysemy problem, ELMo, contextual embeddings in code, and what changed when Transformers replaced LSTMs. Here we put it all into practice: a hands-on lab comparing static and contextual embeddings, a step-by-step walk through BERT's pretraining recipe (MLM + NSP), and exercises that consolidate the static-to-contextual story.

Prerequisites

This section continues from Section 1.4. Familiarity with ELMo, the polysemy problem, static vs contextual embeddings, and the basic biLSTM architecture covered there is assumed.

Fun Fact: BERT's Mask Was an Accident

Two boxes labeled Word2Vec and BERT, with example token-to-vector mappings showing static versus contextual representations, connected by an arrow — Static vs contextual embeddings: Word2Vec assigns one vector per word; BERT assigns one vector per (word, context) pair.

The 15% masking rate in BERT's MLM objective was not derived from theory. The original authors tried a few values, noticed 15% worked, and moved on. Years of follow-up research, RoBERTa, SpanBERT, ELECTRA, partially exists to relitigate that arbitrary choice. The lesson: in pretraining, hyperparameters that look like sacred constants are often the first value someone tried on a Tuesday.

Having traced the representation journey from static to contextual in Section 1.4, we now consolidate with a lab, the canonical BERT pretraining recipe, and exercises that test the static-vs-contextual distinction.

Note: What You Built in This Chapter

A text preprocessing pipeline in both NLTK and spaCy
Bag-of-Words and TF-IDF vectorizers with sklearn
A Word2Vec model trained from scratch with Gensim
Cosine similarity computations and word analogy queries
A FastText model that handles out-of-vocabulary words
GloVe vectors loaded and compared with Word2Vec
A similarity heatmap and t-SNE visualization
Contextual embeddings extracted from BERT, proving that "bank" gets different vectors in different contexts

You now have hands-on experience with every major text representation technique from the past 30 years. Not bad for one chapter.

Key Insight: Quick Check on Embedding Concepts

Q1: Word2Vec assigns "bank" the same vector in "river bank" and "investment bank." ELMo produces different vectors. What architectural choice enables this?

Show Answer

ELMo uses a bidirectional LSTM that reads the full sentence before computing each word's representation. The hidden state at each position is conditioned on all surrounding tokens, so "bank" near "river" receives a representation shaped by the water-related context, while "bank" near "investment" is shaped by the financial context. Fine-tuning then specializes these context-sensitive representations for the target task with far less labeled data than training task-specific models from scratch.

Q2: ELMo uses a weighted combination of all LSTM layers. What does each layer tend to capture, and why is the learned weighting better than always using the top layer?

Show Answer

Lower layers capture syntactic information (POS, morphology, dependency structure); upper layers capture semantic information (word sense, coreference). NER benefits more from lower-layer syntax; coreference resolution benefits from upper-layer semantics. The task-specific learned weights let the model draw on whichever combination is most useful, rather than committing to one representation that may over-represent semantics at the expense of syntax for syntactically-sensitive tasks.

Q3: Predict which approach is better for (a) question answering, (b) text generation: ELMo's bidirectional LSTM or GPT-1's unidirectional Transformer.

Show Answer

(a) QA: ELMo/bidirectional wins. Understanding a question requires attending to the full question before locating the answer span; a unidirectional model cannot condition each token on future tokens. (b) Generation: GPT/unidirectional wins, and is in fact the only viable choice. Generation is autoregressive; the model must predict the next token without seeing future ones. Both predictions match practice: BERT outperformed GPT-1 on comprehension; GPT models dominate generation.

Lab: Word2Vec and Contextual Embeddings

Lab: Explore Word2Vec and Contextual Embeddings

Duration: ~60 min Intermediate

Objective

Train a Word2Vec model from scratch with gensim, visualize word relationships using t-SNE, then compare static embeddings with contextual embeddings from a pretrained transformer to see how context changes word representations.

Skills Practiced

Training Word2Vec (Skip-gram and CBOW) using gensim
Querying word analogies and nearest neighbors in embedding space
Visualizing high-dimensional embeddings with t-SNE
Extracting contextual embeddings from a pretrained model
Comparing static vs. contextual representations for polysemous words

Setup

Install the required packages for this lab.

pip install gensim matplotlib scikit-learn transformers torch numpy

Code Fragment 1.4.2a: Install gensim, scikit-learn, transformers, and torch for the Word2Vec and contextual-embedding lab.

Steps

Step 1: Train Word2Vec on a sample corpus

Use gensim to train a Skip-gram Word2Vec model. Even on a small corpus, the model learns meaningful word relationships through the distributional hypothesis.

from gensim.models import Word2Vec
# Sample corpus (in practice, use a larger dataset)
sentences = [
    ["the", "king", "rules", "the", "kingdom"],
    ["the", "queen", "rules", "the", "kingdom"],
    ["a", "prince", "is", "the", "son", "of", "a", "king"],
    ["a", "princess", "is", "the", "daughter", "of", "a", "king"],
    ["the", "man", "works", "in", "the", "city"],
    ["the", "woman", "works", "in", "the", "city"],
    ["a", "boy", "plays", "in", "the", "park"],
    ["a", "girl", "plays", "in", "the", "park"],
    ["the", "dog", "runs", "in", "the", "park"],
    ["the", "cat", "sleeps", "on", "the", "couch"],
    ] * 100 # Repeat for better training
model = Word2Vec(sentences, vector_size=50, window=3,
    min_count=1, sg=1, epochs=50)
# Query nearest neighbors
for word in ["king", "queen", "man"]:
    neighbors = model.wv.most_similar(word, topn=3)
    print(f"{word}: {[(w, f'{s:.2f}') for w, s in neighbors]}")

Code Fragment 1.4.3a: Sample corpus (in practice, use a larger dataset)

Step 2: Visualize embeddings with t-SNE

Project the 50-dimensional word vectors down to 2D for visualization. Words with similar meanings should cluster together.

import matplotlib.pyplot as plt
from sklearn.manifold import TSNE
import numpy as np
words = list(model.wv.key_to_index.keys())
vectors = np.array([model.wv[w] for w in words])
tsne = TSNE(n_components=2, random_state=42, perplexity=min(5, len(words) - 1))
coords = tsne.fit_transform(vectors)
fig, ax = plt.subplots(figsize=(10, 8))
ax.scatter(coords[:, 0], coords[:, 1], s=40, alpha=0.6)
for i, word in enumerate(words):
    ax.annotate(word, (coords[i, 0], coords[i, 1]),
        fontsize=9, ha="center", va="bottom")
    ax.set_title("Word2Vec Embeddings (t-SNE projection)")
    ax.grid(True, alpha=0.3)
    plt.tight_layout()
    plt.savefig("word2vec_tsne.png", dpi=150)
    plt.show()

Code Fragment 1.4.4a: Projecting 50-dimensional Word2Vec embeddings to 2D with t-SNE. Words that share similar distributional contexts in the training corpus should cluster together in the visualization, confirming that the model has captured semantic relationships.

Step 3: Compare static vs. contextual embeddings

Use a pretrained transformer to show that "bank" gets different representations depending on context, unlike Word2Vec where it always maps to the same vector. This is the core insight from ELMo that this section covers.

from transformers import AutoTokenizer, AutoModel
import torch
import numpy as np
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
bert = AutoModel.from_pretrained("bert-base-uncased")
sentences = [
    "I deposited money at the bank",
    "We sat on the river bank",
    "The bank approved the loan",
    "Fish swim near the bank of the stream",
    ]
def get_word_embedding(sentence, target_word):
    """Extract BERT embedding for a target word in context."""
    inputs = tokenizer(sentence, return_tensors="pt")
    tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
    with torch.no_grad():
        outputs = bert(**inputs)
        # Find the target word's token index
        idx = next(i for i, t in enumerate(tokens) if target_word in t)
        return outputs.last_hidden_state[0, idx].numpy()
        # Get "bank" embeddings in each context
        embeddings = [get_word_embedding(s, "bank") for s in sentences]
        # Compute pairwise cosine similarity
        from numpy.linalg import norm
        print("Cosine similarity between 'bank' in different contexts:")
        for i in range(len(sentences)):
            for j in range(i + 1, len(sentences)):
                cos_sim = np.dot(embeddings[i], embeddings[j]) / (
                    norm(embeddings[i]) * norm(embeddings[j]))
                print(f" [{i}] vs [{j}]: {cos_sim:.3f}")
                print(f" '{sentences[i]}' vs '{sentences[j]}'")

Output: Cosine similarity between 'bank' in different contexts: [0] vs [1]: 0.761 'I deposited money at the bank' vs 'We sat on the river bank' [0] vs [2]: 0.943 'I deposited money at the bank' vs 'The bank approved the loan' [0] vs [3]: 0.779 'I deposited money at the bank' vs 'Fish swim near the bank of the stream' [1] vs [2]: 0.748 'We sat on the river bank' vs 'The bank approved the loan' [1] vs [3]: 0.912 'We sat on the river bank' vs 'Fish swim near the bank of the stream' [2] vs [3]: 0.762 'The bank approved the loan' vs 'Fish swim near the bank of the stream'

Code Fragment 1.4.5a: Extracting BERT contextual embeddings for the word "bank" across four sentences and computing pairwise cosine similarity. Financial uses of "bank" score higher with each other than with geographic uses, demonstrating that BERT produces distinct representations depending on context.

Expected pattern

Financial sentences (0 and 2) should have higher cosine similarity with each other than with the river sentences (1 and 3). This confirms that BERT produces different representations for "bank" depending on context, unlike Word2Vec.

Step 4: Visualize contextual differences

Plot a heatmap of cosine similarities to see the clustering of financial vs. geographic senses.

import numpy as np
import matplotlib.pyplot as plt
from numpy.linalg import norm
n = len(embeddings)
sim_matrix = np.zeros((n, n))
for i in range(n):
    for j in range(n):
        sim_matrix[i, j] = np.dot(embeddings[i], embeddings[j]) / (
            norm(embeddings[i]) * norm(embeddings[j]))
        fig, ax = plt.subplots(figsize=(6, 5))
        im = ax.imshow(sim_matrix, cmap="RdYlGn", vmin=0.7, vmax=1.0)
        short_labels = ["money at bank", "river bank", "bank approved", "bank of stream"]
        ax.set_xticks(range(n))
        ax.set_xticklabels(short_labels, rotation=45, ha="right", fontsize=9)
        ax.set_yticks(range(n))
        ax.set_yticklabels(short_labels, fontsize=9)
        for i in range(n):
            for j in range(n):
                ax.text(j, i, f"{sim_matrix[i,j]:.2f}", ha="center", va="center", fontsize=10)
                ax.set_title("Contextual Similarity of 'bank' (BERT)")
                plt.colorbar(im, ax=ax, shrink=0.8)
                plt.tight_layout()
                plt.savefig("contextual_bank.png", dpi=150)
                plt.show()

Code Fragment 1.4.6a: Rendering a heatmap of cosine similarities between contextual "bank" embeddings. The 2x2 block structure in the heatmap visually confirms that BERT groups financial and geographic senses into separate clusters.

Stretch Goals

Try the same polysemy experiment with other ambiguous words like "bat," "crane," or "spring."
Extract embeddings from different BERT layers and compare how the representations change from layer 1 (more syntactic) to layer 12 (more semantic).
Train Word2Vec on a larger corpus (e.g., gensim's built-in text8 dataset) and test the classic king - man + woman = queen analogy.

Key Takeaways

Context changes meaning. Unlike static embeddings, contextual embeddings produce different vectors for the same word in different sentences, solving the polysemy problem.
ELMo pioneered the "pretrain then fine-tune" paradigm. A bidirectional LSTM is pretrained on a large corpus, then its internal representations are used as features for downstream tasks.
Different layers capture different information. Lower ELMo layers encode syntax (part of speech, word order), while upper layers capture semantics (word sense, sentiment).
ELMo's limitations motivated Transformers. Sequential processing limits parallelism and long-range dependencies, leading directly to the attention-based architecture covered in Chapter 04.

Exercises & Self-Check Questions

Tip: How to Use These Exercises

The conceptual questions test your understanding of the why behind each technique. Try answering them in your own words before moving on. The coding exercises are hands-on challenges you should run in a Jupyter notebook.

Conceptual Questions

Representation evolution: In your own words, explain why the transition from sparse vectors (BoW) to dense vectors (Word2Vec) was such a big deal. What specific problems did it solve, and what new capabilities did it unlock?
The distributional hypothesis: The phrase "you shall know a word by the company it keeps" is the foundation of Word2Vec. Can you think of cases where this assumption breaks down? (Hint: think about antonyms. "Hot" and "cold" appear in very similar contexts...)
Static vs. contextual: Give three sentences where the word "play" means different things. Explain why Word2Vec would struggle with these but ELMo would handle them well.
Why pretrain? ELMo was pretrained on a large corpus, then used for specific tasks. Why is this better than training a model from scratch for each task? What does the pretraining capture that task-specific training would miss?
Trade-offs: A colleague argues "TF-IDF is obsolete; just use embeddings for everything." Give two scenarios where TF-IDF would actually be the better choice, and explain why.

Coding Exercises

Preprocessing exploration: Take a paragraph from a news article. Run it through the preprocessing pipeline from Section 1.2. Then experiment: what happens if you do not remove stop words? What if you use stemming instead of lemmatization? How do the resulting BoW vectors differ?
Analogy hunting: Using the pretrained word2vec-google-news-300 vectors, find 5 analogies that work well (beyond king/queen) and 3 that fail. Can you explain why the failures happen?
Similarity exploration: Pick 20 words from three different categories (e.g., sports, food, technology). Compute all pairwise cosine similarities. Do words within a category have higher similarity than cross-category pairs? Visualize the similarity matrix as a heatmap.
Word2Vec from scratch (challenge): Implement the Skip-gram model with negative sampling in pure PyTorch (no Gensim). Train it on a small text corpus and verify that similar words end up with similar vectors. Compare your results with Gensim's output.
Scaffolding hint: Your implementation will need four key components: (1) a SkipGramDataset class that generates (center, context, negative) tuples from your corpus; (2) two nn.Embedding layers, one for center words and one for context words; (3) a negative sampling loss function that maximizes the dot product for positive pairs and minimizes it for negative pairs (see the formula in Section 1.3); and (4) a training loop that iterates over batches, computes the loss, and calls optimizer.step(). Start with a small vocabulary (a few hundred words) and short embedding dimension (50) to debug before scaling up.

Topic	Paper / Resource	Why Read It
Word2Vec	Mikolov et al., "Efficient Estimation of Word Representations in Vector Space" (2013)	The original paper. Surprisingly short and readable.
GloVe	Pennington et al., "GloVe: Global Vectors for Word Representation" (2014)	Elegant math showing why co-occurrence ratios encode meaning.
FastText	Bojanowski et al., "Enriching Word Vectors with Subword Information" (2017)	The subword approach that later influenced BPE tokenizers.
ELMo	Peters et al., "Deep contextualized word representations" (2018)	The paper that proved contextual embeddings work on every task.
Word2Vec explained	Jay Alammar, "The Illustrated Word2Vec"	The best visual explanation of Word2Vec on the internet.
Embeddings theory	Levy & Goldberg, "Neural Word Embedding as Implicit Matrix Factorization" (2014)	The mathematical connection between Word2Vec and GloVe.

Exercises

Exercise 1.4.1: Static vs Contextual Conceptual

For the sentence "I deposited the check at the bank by the river", state what (a) Word2Vec, (b) ELMo, and (c) BERT each produce for the word "bank", and explain how each progressively addresses the polysemy problem.

Answer Sketch

(a) Word2Vec: a single static vector for "bank" averaging the financial and river senses; identical regardless of sentence. (b) ELMo: a context-dependent vector formed by combining a forward and backward LSTM's hidden states at this position; different for "bank" in financial vs river contexts but produced from a shallow biLM. (c) BERT: a fully bidirectional transformer hidden state at the "bank" position; richer context integration via self-attention over the whole sentence and far higher representational capacity. The trajectory is "more context, more parameters, less polysemy"; modern LLM internal representations are the natural endpoint.

Exercise 1.4.2: Predict ELMo's Layer-Wise Behavior Predictive

ELMo combines representations from multiple LSTM layers. For a downstream task, predict (a) which layer's representation is most useful for syntax tasks (POS, chunking), (b) which is most useful for semantics (NER, sentiment), and (c) why ELMo learns a per-task layer-mixing weight rather than picking one.

Answer Sketch

(a) Lower layers are most useful for syntax: closer to surface form, encoding morphology and POS-style information. (b) Higher layers are most useful for semantics: more abstracted, encoding word sense and entity-type information. (c) ELMo learns per-task layer weights because the optimal mix varies: a chunking task wants more lower-layer, a question-answering task wants more upper-layer. A learned weighted sum lets the same pretrained model serve all tasks without specialized re-training. This same insight drives modern probe-based analyses of Transformer layers (Chapter 11 Interpretability).

Exercise 1.4.3: Compute Contextual Embeddings Code Tweak

Sketch a 6-line snippet that uses Hugging Face transformers to extract the contextual BERT embedding for a specific token in a sentence. Show how the same token has different embeddings in two sentences.

Answer Sketch

import torch
from transformers import AutoTokenizer, AutoModel
tok = AutoTokenizer.from_pretrained("bert-base-uncased"); model = AutoModel.from_pretrained("bert-base-uncased")
def embed(sentence, target_word):
    enc = tok(sentence, return_tensors="pt")
    out = model(**enc).last_hidden_state[0]
    idx = enc.tokens().index(target_word)
    return out[idx]
v1 = embed("I deposited money at the bank.", "bank")
v2 = embed("We sat by the river bank.", "bank")
print(torch.cosine_similarity(v1, v2, dim=0)) # roughly 0.5-0.7, not 1.0

Code Fragment 1.4.7b: Sketch a 6-line snippet that uses Hugging Face transformers to extract the contextual BERT embedding for a specific token in a sentence.

The output is far below 1.0, demonstrating that BERT produces genuinely different vectors for the same surface token in different contexts. Word2Vec would return 1.0 (same vector). This single experiment is the cleanest empirical demonstration of why contextual embeddings replaced static ones.

Exercise 1.4.4: Why ELMo Lost the Race Failure Mode

ELMo was the first widely-used contextual embedding model in 2018; by 2019 BERT had largely replaced it. Identify three architectural or workflow reasons ELMo lost, beyond just "BERT is bigger." For each, explain why the same reason applies to current LLMs vs older RNN-based systems.

Answer Sketch

(1) Sequential bottleneck: ELMo's biLSTM processes tokens one at a time, limiting parallelism on GPUs. BERT's self-attention is parallel over sequence length. The same advantage carries over: modern Transformers train and infer faster than equivalent-capacity RNNs. (2) Fixed-depth context flow: ELMo's information has to traverse many recurrent steps to combine distant tokens, decaying en route. Self-attention combines any two positions in one step. (3) Pretraining objective: ELMo trained on left-only and right-only LM losses separately; BERT's masked LM uses true bidirectional context per prediction, yielding a much stronger representation per parameter. The general lesson: each pre-LLM architecture had an exploitable architectural bottleneck; LLM dominance came from systematically removing those.

Research Frontier

Contextual representations have evolved far beyond ELMo. Modern models like GPT-4, Claude 3.5, and Gemini 2.0 produce contextual embeddings as a byproduct of their architecture. Active research includes sparse autoencoders for understanding what these representations encode (see Section 10.1), representation engineering for steering model behavior, and linear probing to identify which layers capture which linguistic phenomena.

BERT Pretraining, Step by Step

BERT (Devlin et al., 2019) is the canonical contextual encoder, so it is worth walking through its pretraining recipe once, end to end. The same recipe (with minor variations) is what every encoder-only model has used since, and it establishes the vocabulary ([MASK], [CLS], [SEP]) that the rest of the book will treat as known.

BERT is pretrained on raw text with two self-supervised objectives jointly optimized: Masked Language Modeling (MLM) and Next Sentence Prediction (NSP). Both are designed so that the supervision signal can be manufactured from unlabeled text alone, which is what enables training on entire Wikipedia + BookCorpus dumps.

MLM: The Fill-in-the-Blank Objective

During pretraining, roughly 15% of input tokens are corrupted. Of those, 80% are replaced with the special token [MASK], 10% are replaced with a random token from the vocabulary, and 10% are left unchanged. The model sees the corrupted sequence, runs every token through the bidirectional Transformer encoder, and at each masked position a classifier head on top of the contextual vector reconstructs the original token. The loss is the standard cross-entropy over the vocabulary, summed only at the masked positions.

The genius of MLM is that any contiguous text becomes labeled data. Each sentence is its own self-supervised mini-batch: the model has to use the bidirectional context (both left and right) to guess the missing word, which forces the hidden vectors to encode genuinely contextual meaning rather than just the identity of the surface token.

NSP and the `[CLS]` Sentence Embedding

The second objective addresses a property MLM does not directly reward: sentence-pair coherence. Each pretraining example is two sentences A and B separated by [SEP], with a special [CLS] token prepended to the whole sequence. Half of the time B is the actual next sentence after A in the corpus; the other half it is a random sentence sampled from elsewhere. A binary classifier sits on top of the final hidden vector of [CLS] and predicts whether B follows A.

The mechanical consequence is that the [CLS] vector is forced to summarize the entire input, because it is the single hidden state that feeds the NSP head. After pretraining, the [CLS] vector is therefore the standard sentence embedding used for downstream classification: sentiment analysis, NLI, intent detection, and so on. A common alternative is to mean-pool all context vectors rather than rely on [CLS], especially when fine-tuning with sentence-transformers-style contrastive losses.

The joint loss is simply the sum of the MLM and NSP losses:

\mathcal{L}_{\text{BERT}} = \mathcal{L}_{\text{MLM}} + \mathcal{L}_{\text{NSP}}

Note: RoBERTa, the cleaner BERT

RoBERTa (Liu et al., 2019) revisits BERT's recipe and finds that the NSP loss is not actually pulling its weight: removing it and training longer on more data, with dynamic masking and full-sentence packing (no NSP-shaped sentence pairs), produces a strictly stronger model. RoBERTa also switches from BERT's WordPiece tokenizer to a byte-level BPE tokenizer (similar to GPT-2's). The takeaway: MLM is the heavy lifter; NSP is convenient pedagogy but optional. Most newer encoders (DeBERTa, ELECTRA, modern multilingual variants) drop NSP entirely.

Key Insight: BERT in one line

BERT = bidirectional Transformer encoder + MLM (fill-in-the-blank with 15% masking) + NSP (does sentence B follow sentence A?). The [CLS] vector is the default sentence embedding; the per-token contextual vectors feed every other downstream head. RoBERTa is BERT minus NSP plus more data; it is the production default whenever a static encoder is what you need.

Key Takeaways

The lab proves the polysemy point empirically. Word2Vec returns cosine similarity 1.0 for "bank" across contexts; BERT returns 0.5 to 0.7. That gap is the value contextual embeddings add.
BERT = bidirectional Transformer encoder + MLM (15% masking) + NSP. The [CLS] vector is the default sentence embedding; per-token vectors feed downstream heads.
RoBERTa is BERT minus NSP plus more data, and is the production default whenever a static encoder is what you need.
Different BERT layers encode different linguistic levels: lower for syntax, middle for surface semantics, upper for task-relevant abstractions. Probing studies (Tenney et al., 2019) confirm the ELMo layer hypothesis at Transformer scale.

What's Next?

In the next section, Section 1.5: Why Tokenization Matters, we turn to tokenization, the critical first step that determines how models see and process text.

Further Reading

BERT and Relatives

Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). "BERT: Pretraining of Deep Bidirectional Transformers for Language Understanding." Replaced ELMo's LSTMs with transformers and introduced masked language modeling, achieving state-of-the-art results across 11 NLP benchmarks simultaneously. BERT's bidirectional attention mechanism is a direct evolution of the contextual embedding idea explored in this section. Required reading for all NLP practitioners.

Liu, Y., Ott, M., Goyal, N., et al. (2019). "RoBERTa: A Robustly Optimized BERT Pretraining Approach." Revisits BERT's pretraining recipe and shows that removing NSP, increasing batch size, training longer on more data, and using dynamic masking with byte-level BPE produces a strictly stronger encoder. The standard "cleaner BERT" reference and the practical default whenever a static encoder is needed.

Layer Analysis and Probing

Tenney, I., Das, D., & Pavlick, E. (2019). "BERT Rediscovers the Classical NLP Pipeline." ACL 2019. Showed that different BERT layers encode different linguistic levels (POS tagging in early layers, syntax in middle layers, semantics in upper layers), confirming the ELMo layer hypothesis at transformer scale. Provides practical guidance on which layers to use for feature extraction.

Ethayarajh, K. (2019). "How Contextual are Contextualized Word Representations?" EMNLP 2019. Quantified how much context changes word representations across layers in ELMo, BERT, and GPT-2, finding that upper layers produce more context-specific representations. Introduces anisotropy as a key metric for understanding embedding spaces.

Rogers, A., Kovaleva, O., & Rumshisky, A. (2020). "A Primer in BERTology: What We Know About How BERT Works." TACL. Comprehensive survey synthesizing over 150 studies on what each BERT layer encodes, how attention heads specialize, and which layers matter for which tasks. The go-to reference for making informed decisions about layer selection and feature extraction.

Prerequisites

Lab: Word2Vec and Contextual Embeddings

Objective

Skills Practiced

Setup

Steps

Step 1: Train Word2Vec on a sample corpus

Step 2: Visualize embeddings with t-SNE

Step 3: Compare static vs. contextual embeddings

Step 4: Visualize contextual differences

Stretch Goals

Conceptual Questions

Coding Exercises

Further Reading

Exercises

BERT Pretraining, Step by Step

MLM: The Fill-in-the-Blank Objective

NSP and the [CLS] Sentence Embedding

What's Next?

BERT and Relatives

Layer Analysis and Probing

NSP and the `[CLS]` Sentence Embedding