Part 2: Understanding LLMs
Chapter 06: Pretraining and Scaling Laws

In-Context Learning Theory

Show a model three examples and it figures out the pattern. Show it zero examples and it still tries, with alarming confidence. Nobody trained it to do this; it just started happening one day, and theorists have been catching up ever since.

Scale Scale, Zero-Shot Confident AI Agent

Prerequisites

This section assumes understanding of the attention mechanism from Chapter 04 and the concept of in-context learning introduced in Section 6.1 (GPT-3 discussion). Some familiarity with Bayesian inference is helpful but not required; key concepts are explained as needed.

Big Picture

In-context learning (ICL) is one of the most surprising capabilities of large language models. When you provide a few examples in a prompt, the model adapts its behavior to the demonstrated pattern without any gradient updates to its parameters. This section explores the theoretical frameworks that attempt to explain this phenomenon: the Bayesian inference interpretation, the implicit gradient descent hypothesis, the role of task vectors in internal representations, and the mesa-optimization perspective. Understanding these theories is essential for designing effective few-shot prompts and for reasoning about the capabilities and limitations of in-context learning. The attention mechanism from Section 3.3 is central to several of these explanations.

1. The Mystery of In-Context Learning

In-context learning depicted as an open-book exam where the model uses examples provided in the prompt
Figure 6.7.1: In-context learning is the ultimate open-book exam: give the model a few examples in the prompt and watch it figure out the pattern.

Consider a standard few-shot prompting scenario. You provide a large language model with several input-output pairs followed by a new input: Code Fragment 6.7.1 below puts this into practice.

# Few-shot classification example
prompt = """
Review: "This movie was absolutely wonderful!"
Sentiment: Positive

Review: "Terrible acting and a boring plot."
Sentiment: Negative

Review: "The cinematography was stunning but the story fell flat."
Sentiment: Mixed

Review: "I laughed and cried, a true masterpiece."
Sentiment:"""
# Conceptual demonstration of task vector extraction
import torch

def extract_task_vector(model, tokenizer, few_shot_prompt, zero_shot_prompt):
    """
    Extract the task vector by comparing activations of
    few-shot vs zero-shot prompts at a specific layer.
    """
    activations = {}

    def hook_fn(name):
        def hook(module, input, output):
            activations[name] = output[0][:, -1, :]  # last token
        return hook

    # Register hook at a middle layer
    target_layer = model.layers[len(model.layers) // 2]
    handle = target_layer.register_forward_hook(hook_fn("mid"))

    # Get activations for few-shot prompt
    tokens_fs = tokenizer(few_shot_prompt, return_tensors="pt")
    with torch.no_grad():
        model(**tokens_fs)
    act_few_shot = activations["mid"].clone()

    # Get activations for zero-shot prompt
    tokens_zs = tokenizer(zero_shot_prompt, return_tensors="pt")
    with torch.no_grad():
        model(**tokens_zs)
    act_zero_shot = activations["mid"].clone()

    handle.remove()

    # Task vector = difference in activations
    task_vector = act_few_shot - act_zero_shot
    return task_vector

# The task vector can then be added to zero-shot activations
# to induce few-shot behavior without the demonstrations
Code Fragment 6.7.1: Few-shot classification example.

The model outputs "Positive" without any fine-tuning. Its weights are frozen. Yet it has somehow "learned" the sentiment classification task from just three examples. How?

This is not simply pattern matching or memorization. GPT-3 demonstrated that models can perform ICL on novel tasks that could not have appeared in the training data, such as classifying inputs using randomly assigned labels. The model is not recalling a memorized mapping; it is constructing a task-specific computation from the prompt.

Leveraging In-Context Learning for Rapid Prototype Classification

Who: A data scientist at a consumer electronics company tasked with classifying customer feedback into new, previously undefined categories.

Situation: After a major product launch, the team needed to categorize thousands of support tickets into categories that did not exist in their existing taxonomy (e.g., "battery swelling concern," "wireless charging interference," "MagSafe alignment issue").

Problem: Fine-tuning a classifier required labeled training data they did not yet have. Manually labeling enough examples for supervised learning would take weeks.

Dilemma: Wait weeks for labeled data and a fine-tuned model, or use a rapid approach that might sacrifice accuracy. Zero-shot prompting produced inconsistent results because the categories were domain-specific and unfamiliar to the base model.

Decision: They used 5-shot in-context learning with GPT-4, providing five hand-labeled examples per category directly in the prompt.

How: The team manually labeled 40 examples (5 per category for 8 categories), constructed a few-shot prompt template, and processed all 12,000 tickets through the API. They validated against a random sample of 200 tickets checked by human reviewers.

Result: The few-shot approach achieved 87% accuracy on the human-validated sample, operational within 2 hours of the categories being defined. A fine-tuned classifier built two weeks later achieved 93% accuracy. The few-shot system handled the critical first two weeks while labeled data accumulated.

Lesson: In-context learning is a powerful rapid-prototyping tool: it lets you deploy a working classifier in hours, buying time to collect the labeled data needed for a fine-tuned production model.

In-context learning clearly works in practice, but how? What is happening inside the model when it reads a few examples and suddenly "learns" a new task? Several theoretical frameworks attempt to explain this phenomenon, and they point to a surprisingly elegant mechanism.

2. The Bayesian Inference Interpretation

Fun Fact

Nobody explicitly programmed in-context learning into LLMs. It emerged as a side effect of next-token prediction at scale. Researchers are still debating exactly how it works, which makes it one of the few features in software engineering that shipped before anyone understood why it existed.

Xie et al. (2022) proposed that in-context learning can be understood as implicit Bayesian inference over a latent concept variable. The idea is that pre-training on diverse documents effectively teaches the model a prior distribution over "tasks" or "concepts." When few-shot examples are provided in the prompt, the model performs approximate Bayesian updating to identify which concept generated those examples, and then uses that posterior to predict the answer for the query.

More formally, the model implicitly computes:

$$P(y_{q} | x_{q}, D) \approx \sum _{c} P(y_{q} | x_{q}, c) \cdot P(c | D)$$

where $D = {(x_{1},y_{1}), ..., (x_{k},y_{k})}$ is the set of demonstrations, $c$ is the latent concept, and $(x_{q}, y_{q})$ is the query. The demonstrations narrow the posterior $P(c | D)$ to the correct concept, enabling accurate prediction.

This framework explains several observed properties of ICL: more examples improve performance (they narrow the posterior), the order of examples matters (the model processes them sequentially), and ICL works best for tasks similar to those encountered during pre-training (they must be within the model's prior).

Bayesian interpretation: pre-training as prior, few-shot narrows posterior
Figure 6.7.2: In the Bayesian interpretation, pre-training establishes a prior over tasks. Few-shot examples narrow the posterior to the correct task, enabling accurate prediction.

3. In-Context Learning as Implicit Gradient Descent

A more mechanistic explanation, proposed independently by Akyurek et al. (2023) and Von Oswald et al. (2023), is that transformer attention layers implement something functionally equivalent to gradient descent. When the model processes few-shot examples, the attention mechanism computes updates to an internal "hypothesis" that are analogous to gradient steps on a loss function defined by the demonstrations.

The key insight comes from analyzing the structure of a single attention head. Consider a linear attention head (attention without softmax) operating on a sequence that contains input-output pairs. The attention output at the query position can be decomposed as:

$$f(x_{q}) = W_{V} X^{T} X W_{K}^{T} W_{Q} x_{q}$$

This has the same mathematical form as one step of gradient descent on a linear regression problem. The "training data" is the set of in-context examples encoded in the key-value pairs, the "query" is the test input, and the attention mechanism computes a prediction by comparing the query against the examples. Code Fragment 6.7.2 below puts this into practice.


# Linear attention as one gradient descent step: show that a single
# linear attention layer implicitly performs regression on in-context examples.
import torch
import torch.nn as nn

class LinearAttentionAsGD(nn.Module):
    """
    Demonstrates how linear attention on in-context examples
    implements one step of gradient descent on a regression task.
    """
    def __init__(self, d_in, d_out):
        super().__init__()
        self.W_K = nn.Parameter(torch.randn(d_in, d_in) * 0.1)
        self.W_Q = nn.Parameter(torch.randn(d_in, d_in) * 0.1)
        self.W_V = nn.Parameter(torch.randn(d_out, d_in) * 0.1)

    def forward(self, X_ctx, y_ctx, x_query):
        """
        X_ctx: (n_examples, d_in)  - in-context inputs
        y_ctx: (n_examples, d_out) - in-context outputs
        x_query: (d_in,)           - query input
        """
        # Key-Query similarity (like GD step direction)
        keys = X_ctx @ self.W_K.T       # (n, d_in)
        query = self.W_Q @ x_query      # (d_in,)
        attn = keys @ query             # (n,) linear attention

        # Value-weighted output (like GD update)
        values = y_ctx                   # (n, d_out)
        output = attn @ values          # (d_out,)
        return output

# Compare with explicit gradient descent
def one_step_gd(X_ctx, y_ctx, x_query, lr=0.01):
    """One step of GD on linear regression, starting from w=0."""
    # Gradient of MSE at w=0: -2/n * X^T @ y
    w = lr * X_ctx.T @ y_ctx / len(X_ctx)
    return x_query @ w

# Example with toy data
torch.manual_seed(42)
X = torch.randn(5, 3)   # 5 examples, 3 features
w_true = torch.tensor([1.0, -0.5, 0.3])
y = X @ w_true + torch.randn(5) * 0.1  # noisy labels
x_q = torch.randn(3)

gd_pred = one_step_gd(X, y, x_q)
true_val = x_q @ w_true

print(f"True value:    {true_val.item():.4f}")
print(f"GD prediction: {gd_pred.item():.4f}")
print(f"(Attention-based ICL implements a similar computation)")
True value: 0.7766 GD prediction: 0.1352 (Attention-based ICL implements a similar computation)
Code Fragment 6.7.6: Define LinearAttentionAsGD; implement one_step_gd
Key Insight

Multi-layer transformers implement multi-step gradient descent. While a single attention layer corresponds to one gradient step, stacking multiple layers allows the transformer to implement iterative refinement. Each layer takes the current "hypothesis" and refines it using the in-context examples, analogous to multiple steps of an optimization algorithm. Deeper transformers can solve more complex in-context tasks because they effectively run more optimization iterations.

4. Task Vectors

Todd et al. (2024) and Hendel et al. (2023) identified a concrete mechanism by which transformers implement ICL: task vectors. When a transformer processes few-shot demonstrations, it constructs a vector in its activation space that encodes the task being demonstrated. This task vector can be extracted and transplanted into other forward passes to induce the same behavior without the original demonstrations.

The experimental evidence is compelling. Researchers found that:

Warning

Note: This code example is conceptual and requires downloading a language model (e.g., GPT-2) to run. The purpose is to illustrate the task vector extraction logic, not to provide a standalone runnable script. A full working version would need approximately 500 MB of model weights.

Code Fragment 6.7.3: Conceptual demonstration of task vector extraction.

5. Mesa-Optimization

Russian nesting dolls illustrating mesa-optimization, where an inner optimizer emerges inside the trained model
Figure 6.7.3: Mesa-optimization is the Russian nesting doll of AI: an optimizer inside your optimizer, learning its own objectives during training.

A more speculative but intellectually provocative perspective comes from the mesa-optimization framework (Hubinger et al., 2019). The hypothesis is that sufficiently large transformers do not merely implement fixed input-output mappings but actually learn to run internal optimization algorithms. The pre-training process (the "base optimizer") creates a model that itself contains an optimizer (the "mesa-optimizer") that runs at inference time.

Under this view, when a transformer performs in-context learning, it is literally running an optimization algorithm inside its forward pass: the few-shot examples define an objective, and the stacked attention layers iteratively optimize an internal representation to minimize that objective. The model is not just pattern matching; it is optimizing.

Evidence for this perspective includes the gradient descent equivalence discussed above, the observation that ICL performance improves with model depth (more optimization steps), and the finding that transformers can learn to implement various learning algorithms (ridge regression, logistic regression, decision trees) from in-context examples alone.

Warning

The mesa-optimization perspective remains an active area of debate. It is unclear whether the internal computations of real LLMs are truly optimizing a coherent objective or merely performing pattern matching that resembles optimization in controlled settings. The theoretical frameworks provide useful intuitions but have not been conclusively validated on production-scale models.

6. Practical Implications: Why Prompt Design Matters

These theoretical frameworks have direct implications for prompt engineering:

Key Insight: In-Context Learning and the Duality of Learning

In-context learning reveals a striking duality in how neural networks acquire capabilities. Traditional machine learning updates parameters (weights) through gradient descent over many examples. ICL achieves something functionally equivalent by updating activations (hidden states) through a single forward pass. This mirrors a distinction in biology between evolutionary adaptation (slow, across generations, encoded in DNA) and neural plasticity (fast, within a lifetime, encoded in synaptic connections). Pre-training is the evolutionary process that builds the general architecture; in-context learning is the real-time adaptation that specializes it for a task. The mesa-optimization perspective pushes this analogy further: if transformers truly learn to optimize internally, they have crossed a threshold from being tools that are optimized to being systems that optimize, a qualitative shift that connects to fundamental questions in philosophy of mind about the nature of agency and goal-directed behavior.

7. Limitations of In-Context Learning

Despite its power, ICL has systematic limitations:

8. Connection to Few-Shot Prompting Practice

Understanding ICL theory improves practical few-shot prompting. The table below connects theoretical insights to actionable strategies.

8. Connection to Few-Shot Prompting Practice Comparison
Theory Implication Practical Strategy
Bayesian inference Examples narrow the task posterior Choose diverse, representative examples
Implicit GD More layers = more optimization steps Use larger models for harder ICL tasks
Task vectors Task representation converges quickly 3-5 examples often suffice
Mesa-optimization Model implements a learning algorithm Format examples like "training data"
Research Frontier: ICL Failure Modes

Despite its power, in-context learning fails in systematic and poorly understood ways. ICL can be sensitive to the order of examples (permuting few-shot examples sometimes changes the answer), to the label space (models can be biased toward labels seen more recently), and to the format of examples (small formatting changes can cause large performance swings). Understanding when ICL will fail and why remains an open research question. Practical advice: always test ICL setups with multiple example orderings and formats, and consider fine-tuning when reliability is critical.

Self-Check
1. How does the Bayesian interpretation explain why more in-context examples generally improve performance?
Show Answer
In the Bayesian framework, each in-context example provides evidence for the correct latent concept (task). With more examples, the posterior distribution P(concept | demonstrations) becomes more concentrated around the true concept, reducing uncertainty. This is analogous to how observing more data points narrows a Bayesian posterior. However, there are diminishing returns: once the posterior is sufficiently peaked, additional examples provide little additional information.
2. What mathematical equivalence exists between linear attention and gradient descent?
Show Answer
A single linear attention layer computing output = V^T K^T Q x_query has the same mathematical form as one step of gradient descent on a linear regression problem. The key-value pairs formed from in-context examples play the role of training data, and the query projection plays the role of the test input. The attention computation effectively fits a linear model to the demonstrations and evaluates it at the query point. Multi-layer transformers extend this to multiple gradient steps with nonlinear activations between them.
3. What is a task vector and how does it provide evidence for ICL mechanisms?
Show Answer
A task vector is the difference in a model's internal activations between a few-shot prompt and an equivalent zero-shot prompt. It encodes the "task" demonstrated by the few-shot examples as a direction in the model's representation space. Task vectors provide evidence for ICL mechanisms because: (1) adding a task vector to zero-shot activations recovers few-shot performance, proving the vector carries task information; (2) task vectors for the same task from different example sets are similar, showing task encoding is consistent; (3) they are localized to specific layers, revealing where in the network ICL computations occur.
4. Why might ICL fail on tasks requiring complex multi-step reasoning?
Show Answer
ICL, viewed as implicit gradient descent, implements a limited number of optimization steps (bounded by model depth). Complex multi-step reasoning requires composing many sequential operations, each dependent on the previous result. The implicit optimizer may not have enough steps to converge on such tasks. Additionally, the Bayesian interpretation suggests that complex reasoning tasks are unlikely to appear as coherent "concepts" in the pre-training distribution, making them hard to identify from demonstrations. Finally, the task vector mechanism may be too simple to represent tasks that require conditional branching or recursive computation.

Key Takeaways

Research Frontier

Mechanistic understanding of ICL. Recent work by Todd et al. (2024) has identified "function vectors" that encode specific input-output mappings in model activations, extending the task vector framework to fine-grained function identification. Meanwhile, the Many-Shot ICL paradigm (Agarwal et al., 2024) shows that with sufficiently long context windows (hundreds of examples), ICL performance approaches fine-tuning quality on many benchmarks, blurring the boundary between in-context adaptation and gradient-based learning. The connection between ICL and interpretability (Section 17.1) continues to deepen, with induction head analysis now used as a diagnostic tool for model quality.

What's Next?

In the next chapter, Chapter 07: Modern LLM Landscape, we survey the modern LLM landscape, comparing both closed-source and open-weight models across capabilities and architectures.

References & Further Reading
Empirical Foundations

Brown, T. B. et al. (2020). "Language Models are Few-Shot Learners." NeurIPS 2020.

The GPT-3 paper that first systematically demonstrated in-context learning at scale. Showed that large language models can perform tasks from a handful of examples in the prompt, without any parameter updates.

📄 Paper

Min, S. et al. (2022). "Rethinking the Role of Demonstrations: What Makes In-Context Learning Work?" EMNLP 2022.

Reveals the surprising finding that correct input-label mappings in demonstrations matter less than the format, label space, and input distribution. Challenges naive assumptions about why in-context learning works.

📄 Paper
Theoretical Explanations

Xie, S. M. et al. (2022). "An Explanation of In-context Learning as Implicit Bayesian Inference." ICLR 2022.

Proposes that in-context learning performs implicit Bayesian inference over latent document-generating concepts. Provides a theoretical framework for understanding how prompts activate different "concept" distributions learned during pre-training.

📄 Paper

Von Oswald, J. et al. (2023). "Transformers Learn In-Context by Gradient Descent." ICML 2023.

Demonstrates that transformer attention layers can implement gradient descent steps on in-context examples. Provides a mechanistic connection between ICL and traditional learning algorithms executed within the forward pass.

📄 Paper

Akyurek, E. et al. (2023). "What Learning Algorithm Is In-Context Learning? Investigations with Linear Models." ICLR 2023.

Investigates which learning algorithms transformers implement during in-context learning on linear regression tasks. Finds evidence of ridge regression and gradient descent, offering concrete algorithmic interpretations of ICL behavior.

📄 Paper
Mechanistic Interpretability

Olsson, C. et al. (2022). "In-context Learning and Induction Heads." Transformer Circuits Thread.

Identifies "induction heads" as a key circuit responsible for in-context learning, showing that specific attention head compositions copy and complete patterns from the context. Connects ICL to concrete architectural mechanisms.

📄 Paper