In-Context Learning Theory

Section 6.7

Show a model three examples and it figures out the pattern. Show it zero examples and it still tries, with alarming confidence. Nobody trained it to do this; it just started happening one day, and theorists have been catching up ever since.

ScaleScale, Zero-Shot Confident AI Agent
Big Picture

In-context learning (ICL) is one of the most surprising capabilities of large language models. When you provide a few examples in a prompt, the model adapts its behavior to the demonstrated pattern without any gradient updates to its parameters. This section explores the theoretical frameworks that attempt to explain this phenomenon: the Bayesian inference interpretation, the implicit cross-entropy hypothesis, the role of task vectors in internal representations, and the mesa-optimization perspective. Understanding these theories is essential for designing effective few-shot prompts and for reasoning about the capabilities and limitations of in-context learning. The attention mechanism from Section 2.3 is central to several of these explanations.

Key Insight: Remember

In-context learning is gradient descent in disguise: the model's attention treats the prompt's examples the way SGD treats training batches, just over the forward pass instead of the backward pass. Nobody trained the model to do this; it emerged once scale was high enough that the prompt could carry "training data" the attention could absorb.

Prerequisites

This section assumes understanding of the attention mechanism from Section 3.1 and the concept of in-context learning introduced in Section 6.1 (GPT-3 discussion). Some familiarity with Bayesian inference is helpful but not required; key concepts are explained as needed.

6.7.1 The Mystery of In-Context Learning

In-context learning depicted as an open-book exam where the model uses examples provided in the prompt
Figure 6.7.1: In-context learning is the ultimate open-book exam: give the model a few examples in the prompt and watch it figure out the pattern.

Consider a standard few-shot prompting scenario. You provide a large language model with several input-output pairs followed by a new input:

# Few-shot classification example
prompt = """
Review: "This movie was absolutely wonderful!"
Sentiment: Positive
Review: "Terrible acting and a boring plot."
Sentiment: Negative
Review: "The cinematography was stunning but the story fell flat."
Sentiment: Mixed
Review: "I laughed and cried, a true masterpiece."
Sentiment:"""
Code Fragment 6.7.1a: Consider a standard few-shot prompting scenario.
# Conceptual demonstration of task vector extraction
import torch
def extract_task_vector(model, tokenizer, few_shot_prompt, zero_shot_prompt):
    """
    Extract the task vector by comparing activations of
    few-shot vs zero-shot prompts at a specific layer.
    """
    activations = {}
def hook_fn(name):
    def hook(module, input, output):
        activations[name] = output[0][:, -1, :] # last token
        return hook
        # Register hook at a middle layer
        target_layer = model.layers[len(model.layers) // 2]
        handle = target_layer.register_forward_hook(hook_fn("mid"))
        # Get activations for few-shot prompt
        tokens_fs = tokenizer(few_shot_prompt, return_tensors="pt")
        with torch.no_grad():
            model(**tokens_fs)
            act_few_shot = activations["mid"].clone()
            # Get activations for zero-shot prompt
            tokens_zs = tokenizer(zero_shot_prompt, return_tensors="pt")
            with torch.no_grad():
                model(**tokens_zs)
                act_zero_shot = activations["mid"].clone()
                handle.remove()
                # Task vector = difference in activations
                task_vector = act_few_shot - act_zero_shot
                return task_vector
                # The task vector can then be added to zero-shot activations
                # to induce few-shot behavior without the demonstrations
Code Fragment 6.7.2: Few-shot classification example.

The model outputs "Positive" without any fine-tuning. Its weights are frozen. Yet it has somehow "learned" the sentiment classification task from just three examples. How?

This is not simply pattern matching or memorization. GPT-3 demonstrated that models can perform ICL on novel tasks that could not have appeared in the training data, such as classifying inputs using randomly assigned labels. The model is not recalling a memorized mapping; it is constructing a task-specific computation from the prompt.

Real-World Scenario
Leveraging In-Context Learning for Rapid Prototype Classification

Who: A data scientist at a consumer electronics company tasked with classifying customer feedback into new, previously undefined categories.

Situation: After a major product launch, the team needed to categorize thousands of support tickets into categories that did not exist in their existing taxonomy (e.g., "battery swelling concern," "wireless charging interference," "MagSafe alignment issue").

Problem: Fine-tuning a classifier required labeled training data they did not yet have. Manually labeling enough examples for supervised learning would take weeks.

Dilemma: Wait weeks for labeled data and a fine-tuned model, or use a rapid approach that might sacrifice accuracy. Zero-shot prompting produced inconsistent results because the categories were domain-specific and unfamiliar to the base model.

Decision: They used 5-shot in-context learning with GPT-4, providing five hand-labeled examples per category directly in the prompt.

How: The team manually labeled 40 examples (5 per category for 8 categories), constructed a few-shot prompt template, and processed all 12,000 tickets through the API. They validated against a random sample of 200 tickets checked by human reviewers.

Result: The few-shot approach achieved 87% accuracy on the human-validated sample, operational within 2 hours of the categories being defined. A fine-tuned classifier built two weeks later achieved 93% accuracy. The few-shot system handled the critical first two weeks while labeled data accumulated.

Lesson: In-context learning is a powerful rapid-prototyping tool: it lets you deploy a working classifier in hours, buying time to collect the labeled data needed for a fine-tuned production model.

In-context learning clearly works in practice, but how? What is happening inside the model when it reads a few examples and suddenly "learns" a new task? Several theoretical frameworks attempt to explain this phenomenon, and they point to a surprisingly elegant mechanism.

6.7.2 The Bayesian Inference Interpretation

Fun Fact

Nobody explicitly programmed in-context learning into LLMs. It emerged as a side effect of next-token prediction at scale. Researchers are still debating exactly how it works, which makes it one of the few features in software engineering that shipped before anyone understood why it existed.

Xie et al. (2022) proposed that in-context learning can be understood as implicit Bayesian inference over a latent concept variable. The idea is that pretraining on diverse documents effectively teaches the model a prior distribution over "tasks" or "concepts." When few-shot examples are provided in the prompt, the model performs approximate Bayesian updating to identify which concept generated those examples, and then uses that posterior to predict the answer for the query.

More formally, the model implicitly computes:

$$P(y_{q} | x_{q}, D) \approx \sum _{c} P(y_{q} | x_{q}, c) \cdot P(c | D)$$

where $D = {(x_{1},y_{1}), ..., (x_{k},y_{k})}$ is the set of demonstrations, $c$ is the latent concept, and $(x_{q}, y_{q})$ is the query. The demonstrations narrow the posterior $P(c | D)$ to the correct concept, enabling accurate prediction.

This framework explains several observed properties of ICL: more examples improve performance (they narrow the posterior), the order of examples matters (the model processes them sequentially), and ICL works best for tasks similar to those encountered during pretraining (they must be within the model's prior).

Bayesian interpretation: pre-training as prior, few-shot narrows posterior
Figure 6.7.2a: In the Bayesian interpretation, pretraining establishes a prior over tasks. Few-shot examples narrow the posterior to the correct task, enabling accurate prediction.

6.7.3 In-Context Learning as Implicit Gradient Descent

A more mechanistic explanation, proposed independently by Akyurek et al. (2023) and Von Oswald et al. (2023), is that transformer attention layers implement something functionally equivalent to gradient descent. When the model processes few-shot examples, the attention mechanism computes updates to an internal "hypothesis" that are analogous to gradient steps on a loss function defined by the demonstrations.

The key insight comes from analyzing the structure of a single attention head. Consider a linear attention head (attention without softmax) operating on a sequence that contains input-output pairs. The attention output at the query position can be decomposed as:

$$f(x_{q}) = W_{V} Y^{T} (X W_{K}^{T} W_{Q} x_{q})$$

where $X$ is the matrix of in-context inputs and $Y$ is the matrix of in-context outputs (used as the values). This has the same mathematical form as one step of gradient descent on a linear regression problem (von Oswald et al., 2023, arXiv:2212.07677; see also Schlag et al., 2021, arXiv:2102.11174 for the equivalence to fast-weight programmers). The "training data" is the set of in-context examples encoded in the key-value pairs, the "query" is the test input, and the attention mechanism computes a prediction by comparing the query against the examples.

# Linear attention as one gradient descent step: show that a single
# linear attention layer implicitly performs regression on in-context examples.
import torch
import torch.nn as nn
class LinearAttentionAsGD(nn.Module):
    """
    Demonstrates how linear attention on in-context examples
    implements one step of gradient descent on a regression task.
    """
    def __init__(self, d_in, d_out):
        super().__init__()
        self.W_K = nn.Parameter(torch.randn(d_in, d_in) * 0.1)
        self.W_Q = nn.Parameter(torch.randn(d_in, d_in) * 0.1)
        self.W_V = nn.Parameter(torch.randn(d_out, d_in) * 0.1)
    def forward(self, X_ctx, y_ctx, x_query):
        """
        X_ctx: (n_examples, d_in)  - in-context inputs
        y_ctx: (n_examples, d_out) - in-context outputs
        x_query: (d_in,)           - query input
        """
        # Key-Query similarity (like GD step direction)
        keys = X_ctx @ self.W_K.T # (n, d_in)
        query = self.W_Q @ x_query # (d_in,)
        attn = keys @ query # (n,) linear attention
        # Value-weighted output (like GD update)
        values = y_ctx # (n, d_out)
        output = attn @ values # (d_out,)
        return output
        # Compare with explicit gradient descent
    def one_step_gd(X_ctx, y_ctx, x_query, lr=0.01):
        """One step of GD on linear regression, starting from w=0."""
        # Gradient of MSE at w=0: -2/n * X^T @ y
        w = lr * X_ctx.T @ y_ctx / len(X_ctx)
        return x_query @ w
        # Example with toy data
        torch.manual_seed(42)
        X = torch.randn(5, 3) # 5 examples, 3 features
        w_true = torch.tensor([1.0, -0.5, 0.3])
        y = X @ w_true + torch.randn(5) * 0.1 # noisy labels
        x_q = torch.randn(3)
        gd_pred = one_step_gd(X, y, x_q)
        true_val = x_q @ w_true
        print(f"True value:    {true_val.item():.4f}")
        print(f"GD prediction: {gd_pred.item():.4f}")
        print(f"(Attention-based ICL implements a similar computation)")
Output: True value: 0.7766 GD prediction: 0.1352 (Attention-based ICL implements a similar computation)
Code Fragment 6.7.3: Implementation of one_step_gd
Key Insight

Multi-layer transformers implement multi-step gradient descent. While a single attention layer corresponds to one gradient step, stacking multiple layers allows the transformer to implement iterative refinement. Each layer takes the current "hypothesis" and refines it using the in-context examples, analogous to multiple steps of an optimization algorithm. Deeper transformers can solve more complex in-context tasks because they effectively run more optimization iterations.

Warning: Common Misconception

"ICL is gradient descent" is a structural equivalence, not a literal one. The model's parameters are still frozen at inference; nothing is being updated by backprop. What changes is the residual-stream activation at the query position, which mathematically matches the formula for one step of GD on a regression problem whose data is the in-context examples. Two practical consequences follow: ICL cannot retain information across separate API calls (no actual weight change persists), and adding more demonstrations gives you more "implicit steps" only up to the context window. This is also why ICL works only for tasks that the prior (the frozen weights) already supports.

6.7.4 Task Vectors

Todd et al. (2024) and Hendel et al. (2023) identified a concrete mechanism by which transformers implement ICL: task vectors. When a transformer processes few-shot demonstrations, it constructs a vector in its activation space that encodes the task being demonstrated. This task vector can be extracted and transplanted into other forward passes to induce the same behavior without the original demonstrations.

The experimental evidence is compelling. Researchers found that:

Note: Conceptual Code Example

This code example is conceptual and requires downloading a language model (e.g., GPT-2) to run. The purpose is to illustrate the task vector extraction logic, not to provide a standalone runnable script. A full working version would need approximately 500 MB of model weights.

# Task vector extraction: the DIFFERENCE between a fine-tuned model and its base
# is itself a "task vector" you can add, subtract, or scale.
# Ilharco et al., "Editing Models with Task Arithmetic" (ICLR 2023).
import torch
from transformers import AutoModelForCausalLM

base_id = "meta-llama/Llama-3-8B"
ft_id   = "meta-llama/Llama-3-8B-Instruct"

base = AutoModelForCausalLM.from_pretrained(base_id,  torch_dtype=torch.float16)
ft   = AutoModelForCausalLM.from_pretrained(ft_id,    torch_dtype=torch.float16)

# task_vector = theta_ft - theta_base (one tensor per parameter)
task_vector = {n: ft.state_dict()[n] - base.state_dict()[n] for n in base.state_dict()}

# Apply at any scale alpha; alpha=1 reproduces ft, alpha=0 reproduces base,
# negative alpha "subtracts" the fine-tune behavior.
def apply_task_vector(model, task_vector, alpha: float):
    sd = model.state_dict()
    for name, delta in task_vector.items():
        sd[name] = sd[name] + alpha * delta
    model.load_state_dict(sd)

# Half the instruction-tuning strength
apply_task_vector(base, task_vector, alpha=0.5)
# Composing multiple task vectors gives multi-task models without extra training.
Code Fragment 6.7.4: Conceptual demonstration of task vector extraction.

6.7.5 Mesa-Optimization

Russian nesting dolls illustrating mesa-optimization, where an inner optimizer emerges inside the trained model
Figure 6.7.3a: Mesa-optimization is the Russian nesting doll of AI: an optimizer inside your optimizer, learning its own objectives during training.

A more speculative but intellectually provocative perspective comes from the mesa-optimization framework (Hubinger et al., 2019). The hypothesis is that sufficiently large transformers do not merely implement fixed input-output mappings but actually learn to run internal optimization algorithms. The pretraining process (the "base optimizer") creates a model that itself contains an optimizer (the "mesa-optimizer") that runs at inference time.

Under this view, when a transformer performs in-context learning, it is literally running an optimization algorithm inside its forward pass: the few-shot examples define an objective, and the stacked attention layers iteratively optimize an internal representation to minimize that objective. The model is not just pattern matching; it is optimizing.

Evidence for this perspective includes the gradient descent equivalence discussed above, the observation that ICL performance improves with model depth (more optimization steps), and the finding that transformers can learn to implement various learning algorithms (ridge regression, logistic regression, decision trees) from in-context examples alone.

Warning

The mesa-optimization perspective remains an active area of debate. It is unclear whether the internal computations of real LLMs are truly optimizing a coherent objective or merely performing pattern matching that resembles optimization in controlled settings. The theoretical frameworks provide useful intuitions but have not been conclusively validated on production-scale models.

6.7.6 Practical Implications: Why Prompt Design Matters

Theory becomes concrete the moment you start writing prompts. Four design choices shift directly out of the frameworks above:

Key Insight
Worked Example: Example Order Swings Accuracy by 30 Points

Lu et al. (2022, "Fantastically Ordered Prompts and Where to Find Them," ACL) ran a controlled experiment on GPT-3 with the SST-2 sentiment task. They picked four labeled examples and tested every one of the 24 possible orderings as a 4-shot prompt. Best ordering: 88.7 percent accuracy. Worst ordering, same four examples, same model, same query: 54.3 percent. That 34-point gap from reordering alone is the Bayesian and implicit-GD story made tangible: the prompt is not a bag of examples but a sequence of conditional updates, and ordering matters the same way it matters in an actual stochastic gradient descent run. If you have ever shipped a prompt that "stopped working" after a teammate moved one example to the end, you have already lived through this experiment.

Key Insight: In-Context Learning and the Duality of Learning

In-context learning reveals a striking duality in how neural networks acquire capabilities. Traditional machine learning updates parameters (weights) through gradient descent over many examples. ICL achieves something functionally equivalent by updating activations (hidden states) through a single forward pass. This mirrors a distinction in biology between evolutionary adaptation (slow, across generations, encoded in DNA) and neural plasticity (fast, within a lifetime, encoded in synaptic connections). Pretraining is the evolutionary process that builds the general architecture; in-context learning is the real-time adaptation that specializes it for a task. The mesa-optimization perspective pushes this analogy further: if transformers truly learn to optimize internally, they have crossed a threshold from being tools that are optimized to being systems that optimize, a qualitative shift that connects to fundamental questions in philosophy of mind about the nature of agency and goal-directed behavior.

6.7.7 Limitations of In-Context Learning

Despite its power, ICL has systematic limitations. Each one is a concrete failure mode you will see in production within weeks of shipping a few-shot pipeline.

6.7.8 Connection to Few-Shot Prompting Practice

Understanding ICL theory improves practical few-shot prompting. The table below connects theoretical insights to actionable strategies.

Table 6.7.1b: Connection to Few-Shot Prompting Practice Comparison (as of 2026).
Theory Implication Practical Strategy
Bayesian inference Examples narrow the task posterior Choose diverse, representative examples
Implicit GD More layers = more optimization steps Use larger models for harder ICL tasks
Task vectors Task representation converges quickly 3-5 examples often suffice
Mesa-optimization Model implements a learning algorithm Format examples like "training data"
Key Takeaways
Self-Check
1. How does the Bayesian interpretation explain why more in-context examples generally improve performance?
Show Answer
In the Bayesian framework, each in-context example provides evidence for the correct latent concept (task). With more examples, the posterior distribution P(concept | demonstrations) becomes more concentrated around the true concept, reducing uncertainty. This is analogous to how observing more data points narrows a Bayesian posterior. However, there are diminishing returns: once the posterior is sufficiently peaked, additional examples provide little additional information.
2. What mathematical equivalence exists between linear attention and gradient descent?
Show Answer
A single linear attention layer computing output = V^T K^T Q x_query has the same mathematical form as one step of gradient descent on a linear regression problem. The key-value pairs formed from in-context examples play the role of training data, and the query projection plays the role of the test input. The attention computation effectively fits a linear model to the demonstrations and evaluates it at the query point. Multi-layer transformers extend this to multiple gradient steps with nonlinear activations between them.
3. What is a task vector and how does it provide evidence for ICL mechanisms?
Show Answer
A task vector is the difference in a model's internal activations between a few-shot prompt and an equivalent zero-shot prompt. It encodes the "task" demonstrated by the few-shot examples as a direction in the model's representation space. Task vectors provide evidence for ICL mechanisms because: (1) adding a task vector to zero-shot activations recovers few-shot performance, proving the vector carries task information; (2) task vectors for the same task from different example sets are similar, showing task encoding is consistent; (3) they are localized to specific layers, revealing where in the network ICL computations occur.
4. Why might ICL fail on tasks requiring complex multi-step reasoning?
Show Answer
ICL, viewed as implicit gradient descent, implements a limited number of optimization steps (bounded by model depth). Complex multi-step reasoning requires composing many sequential operations, each dependent on the previous result. The implicit optimizer may not have enough steps to converge on such tasks. Additionally, the Bayesian interpretation suggests that complex reasoning tasks are unlikely to appear as coherent "concepts" in the pretraining distribution, making them hard to identify from demonstrations. Finally, the task vector mechanism may be too simple to represent tasks that require conditional branching or recursive computation.

Exercises

Exercise 6.7.1: ICL is Not Learning Conceptual

"In-context learning" is a misnomer in one important sense. (a) What weights are updated during in-context learning? (b) If the model isn't really learning, why does showing examples in the prompt help so much? (c) What does this imply for tasks where the desired behavior was never seen in any form during pretraining?

Answer Sketch

(a) None: every weight is frozen at inference time. The "learning" happens entirely in the forward pass through context activations, not in any weight update. (b) The examples function as a task identifier and an output-format anchor: they let the model retrieve the right behavior from its pretraining-acquired skill library, conditioning the next-token distribution on the inferred task identity. (c) On truly novel tasks (no analogue in pretraining), in-context examples buy you very little; you typically need fine-tuning or external tools. This is why ICL works astonishingly well for "translate", "summarize", "extract", and surprisingly poorly for, say, novel mathematical reasoning over symbolic schemas the model has never seen.

Exercise 6.7.2: Predict the Few-Shot Curve Predictive

You measure GSM8K accuracy on a frontier model with 0, 1, 2, 4, 8, 16 in-context examples. Predict the qualitative shape of the curve: (a) Where do you expect the largest jump? (b) Where do diminishing returns set in? (c) What single intervention would shift the entire curve up more than adding more examples ever could?

Answer Sketch

(a) The biggest jump is from 0 -> 1: the single example provides the format anchor (chain-of-thought style with "Let's think step by step", or a worked answer pattern) and is worth far more than any subsequent example. (b) Diminishing returns kick in by 4-8 examples; on most reasoning tasks 16 examples is statistically indistinguishable from 8. (c) Switching from few-shot to chain-of-thought prompting shifts the curve up by ~10-30 points on math and logical reasoning, dwarfing the few-shot effect. Even a single chain-of-thought example beats 16 plain examples. This is why 2026 production prompts almost universally include CoT scaffolding rather than relying on raw few-shot.

Exercise 6.7.3: Inspect a Task Vector Code Tweak

The "task vector" hypothesis says that ICL produces an internal vector at one layer that encodes the task identity. Sketch a 6-line probe to test it: feed a model a few-shot prompt for sentiment analysis, capture the residual stream at layer L on the last few-shot example, then inject that captured vector into a zero-shot prompt at the same layer and check if accuracy rises. What is the expected outcome?

Answer Sketch
vec = model.run_capture(few_shot_prompt, layer=L)[-1] # residual at last token
def hook(module, inp, out): out[0][:, -1, :] += vec; return out
model.layers[L].register_forward_hook(hook)
acc_zero_shot_with_vec = eval(model, zero_shot_prompts)
Code Fragment 6.7.5: The "task vector" hypothesis says that ICL produces an internal vector at one layer that encodes the task identity.

Expected: accuracy on the zero-shot prompts rises substantially (often within 5-10% of the few-shot baseline), confirming that much of ICL's effect can be compressed into a single vector at one layer. The Hendel et al. (2023) "task vectors" paper formalized this; it gives empirical weight to the "ICL is Bayesian task identification" interpretation.

Exercise 6.7.4: ICL Failure Modes Failure Mode

List four ways in-context learning can fail in production, beyond simple "wrong answer". For each, give one prompting or systems mitigation.

Answer Sketch

(1) Format leakage: model copies the exact wording of an example label rather than reasoning. Mitigation: vary the examples' surface forms. (2) Recency bias: predictions skew toward the answer in the most recent example. Mitigation: shuffle example order, or use more examples to dilute the bias. (3) Distractor sensitivity: an irrelevant example or a near-duplicate example dramatically changes the answer. Mitigation: dynamic example selection by similarity to the query. (4) Length-induced degradation: adding too many examples pushes important context out of the model's effective attention range. Mitigation: cap shots, or use a smaller relevant subset retrieved at query time. These four failure modes are why "more examples is always better" is wrong in practice.

Research Frontier

Mechanistic understanding of ICL

Recent work by Todd et al. (2024) has identified "function vectors" that encode specific input-output mappings in model activations, extending the task vector framework to fine-grained function identification. Meanwhile, the Many-Shot ICL paradigm (Agarwal et al., 2024) shows that with sufficiently long context windows (hundreds of examples), ICL performance approaches fine-tuning quality on many benchmarks, blurring the boundary between in-context adaptation and gradient-based learning. The connection between ICL and interpretability (Section 11.1) continues to deepen, with induction head analysis now used as a diagnostic tool for model quality.

ICL failure modes

Despite its power, in-context learning fails in systematic and poorly understood ways. ICL can be sensitive to the order of examples (permuting few-shot examples sometimes changes the answer), to the label space (models can be biased toward labels seen more recently), and to the format of examples (small formatting changes can cause large performance swings). Understanding when ICL will fail and why remains an open research question. Practical advice: always test ICL setups with multiple example orderings and formats, and consider fine-tuning when reliability is critical.

What's Next?

In the next chapter, Section 6.8: Production LLM Training Systems: Megatron, Elastic Training, and Fault Tolerance, we survey the modern LLM landscape, comparing both closed-source and open-weight models across capabilities and architectures.

Further Reading

Empirical Foundations

Brown, T. B. et al. (2020). "Language Models are Few-Shot Learners." NeurIPS 2020. The GPT-3 paper that first systematically demonstrated in-context learning at scale. Showed that large language models can perform tasks from a handful of examples in the prompt, without any parameter updates.
Min, S. et al. (2022). "Rethinking the Role of Demonstrations: What Makes In-Context Learning Work?" EMNLP 2022. Reveals the surprising finding that correct input-label mappings in demonstrations matter less than the format, label space, and input distribution. Challenges naive assumptions about why in-context learning works.

Theoretical Explanations

Xie, S. M. et al. (2022). "An Explanation of In-context Learning as Implicit Bayesian Inference." ICLR 2022. Proposes that in-context learning performs implicit Bayesian inference over latent document-generating concepts. Provides a theoretical framework for understanding how prompts activate different "concept" distributions learned during pretraining.
Von Oswald, J. et al. (2023). "Transformers Learn In-Context by Gradient Descent." ICML 2023. Demonstrates that transformer attention layers can implement gradient descent steps on in-context examples. Provides a mechanistic connection between ICL and traditional learning algorithms executed within the forward pass.
Akyurek, E. et al. (2023). "What Learning Algorithm Is In-Context Learning? Investigations with Linear Models." ICLR 2023. Investigates which learning algorithms transformers implement during in-context learning on linear regression tasks. Finds evidence of ridge regression and gradient descent, offering concrete algorithmic interpretations of ICL behavior.

Mechanistic Interpretability

Olsson, C. et al. (2022). "In-context Learning and Induction Heads." Transformer Circuits Thread. Identifies "induction heads" as a key circuit responsible for in-context learning, showing that specific attention head compositions copy and complete patterns from the context. Connects ICL to concrete architectural mechanisms.