Adapting Models for Long Text

Section 16.7

"The model was trained on 4K tokens. I fed it 32K tokens. It handled the first thousand and last thousand beautifully. Everything in the middle? Lost, presumably on vacation."

FinetuneFinetune, Context-Stretching AI Agent
Big Picture

Most real-world documents are longer than models were trained to handle. Legal contracts, research papers, codebases, and book manuscripts routinely exceed the 4K or 8K context windows that many models were originally trained with. Simply passing a longer sequence to a model trained on shorter sequences causes severe quality degradation because the positional encodings extrapolate into regions the model has never seen. This section covers the techniques for extending a model's effective context length: mathematical adjustments to positional encodings, continued Section 7.1 on long documents, and practical chunking strategies for when extension is not enough (chunking is also central to RAG systems). The positional encoding foundations from Section 3.1 explain why position information is necessary and how RoPE encodes it.

Prerequisites

Before starting, make sure you are familiar with fine-tuning overview as covered in Section 16.1: When and Why to Fine-Tune.

16.7.1 The Long Context Challenge

Transformer models encode position information through positional embeddings or positional encodings. When a model trained with a maximum sequence length of 4,096 tokens receives a sequence of 8,192 tokens, the positions beyond 4,096 are "out of distribution." The model has never learned what those position values mean, leading to degraded attention patterns and poor generation quality.

Fun Note

The "lost in the middle" phenomenon is one of the most counterintuitive findings in LLM research. Models with 128K context windows can reliably use information at the beginning and end of the context, but often ignore information placed in the middle. It is like reading a novel where you remember the first chapter and the last chapter but forget everything in between. Researchers discovered this by hiding a critical fact at various positions in a long context and measuring retrieval accuracy, which follows a distinctive U-shaped curve.

Warning
Common Mistake: Assuming Larger Context Windows Mean Better Comprehension

A model that accepts 128K tokens does not necessarily use all 128K tokens effectively. The "lost in the middle" phenomenon means that information placed in the middle third of a long context is frequently ignored, even by models specifically trained for long contexts. Do not assume that feeding an entire document into a long-context model will produce better results than a well-designed RAG pipeline that retrieves only the relevant chunks. Always test retrieval accuracy at different positions within your actual context lengths.

16.7.1.1 Why Models Fail on Long Sequences

The failure mode depends on the type of positional encoding. Models using absolute positional embeddings (original BERT, GPT-2) have a hard limit: positions beyond the embedding table size simply do not exist. Models using Rotary Position Embeddings (RoPE), which are standard in modern LLMs like Llama, Mistral, and Qwen, can technically process longer sequences, but the rotation angles for unseen positions are extrapolated, causing attention scores to become increasingly noisy. The figure below shows this quality degradation beyond the training window.

Key Insight

Mental Model: The Telescope Extension. Think of adapting models for long text as extending a telescope. The base model's context window is the default focal length: sharp and clear within its range, but blind beyond it. Techniques like RoPE scaling and YaRN are lens adjustments that extend the focal range, letting the model 'see' further into the document. The trade-off is that stretching the view can reduce sharpness (per-token attention quality) at the extremes, so you must verify that extended-context performance holds for your specific use case.

Perplexity versus sequence length chart showing degradation beyond training window without context extension, and stable performance with context extension
Figure 16.7.1: Without context extension, perplexity degrades sharply beyond the training window. Context extension techniques maintain quality at longer lengths.
Lost in the middle: U-shaped retrieval accuracy curve across long context positions
Figure 16.7.2: The "lost in the middle" U-curve. Liu et al. (2023) hid a needle fact at different positions inside a long context and measured retrieval accuracy. Even models with 128K-token windows (Llama-3.1, GPT-4-128K, Claude 2.1 then 3.5) show a sharp dip in the middle 50% of positions: accuracy can fall from roughly 98% near the start (primacy) and 95% near the end (recency) down to about 45% at the midpoint. The implication is that a long-context model is not a free RAG replacement: even when the answer fits, the model may not find it. Always test retrieval accuracy at the positions your application actually uses.

16.7.2 Context Extension Techniques

Several techniques have been developed to extend a model's effective context length without retraining from scratch. These techniques modify the positional encoding scheme so that longer sequences map to position values the model has already learned to handle.

16.7.2.1 RoPE Scaling (Linear Interpolation)

The simplest context extension method is linear scaling, also called position interpolation. Instead of using raw position indices (0, 1, 2, ..., 8191) for an 8K sequence, you scale them down to fit within the original training range: (0, 0.5, 1, 1.5, ..., 4095.5). Formally, RoPE rotates the $i$-th query/key dimension pair by angle $\theta_i(m) = m \cdot \omega_i$ at position $m$, where $\omega_i = \theta_{\text{base}}^{-2i/d}$. Linear interpolation with factor $s$ replaces the position $m$ with $m / s$:

$$\theta_i^{\text{scaled}}(m) = \frac{m}{s} \cdot \omega_i.$$

This ensures all position values fall within the range the model was trained on. Code Fragment 16.7.1a shows this approach in practice.

Linear RoPE scaling (s=2): squeeze new positions into the training range
Figure 16.7.3: Linear RoPE scaling (also called Position Interpolation). The blue bar shows the rotation-angle range the model saw during pretraining (positions 0 to 4095). To handle a longer 8192-token inference, every position index is divided by the scale factor s (here s=2), so the new positions land inside the trained range. Angular resolution per token halves, which is why a short fine-tune on long contexts is needed to recover crisp local attention.

The "why" interpolation works where extrapolation fails. RoPE encodes position by rotating each pair of query/key dimensions through an angle proportional to the position index. The model has only seen rotation angles up to its training horizon; outside that range, the rotations alias to arbitrary values the attention dot product was never tuned for. Interpolation forces every position in the new long sequence to live inside the model's familiar angular range, so attention scores stay well-behaved. The tax is angular resolution: tokens that were originally 1 position apart are now 0.25 positions apart, so the model has slightly less ability to tell adjacent positions apart, and brief fine-tuning is required to recover the local precision. This is why interpolation has become standard while naive extension has not: keep the model inside the rotation regime it already understands, then add a small amount of fine-tuning to sharpen its perception of fractional positions.

Position Interpolation in One Equation

Position Interpolation (PI) makes the scale factor explicit. Choose $s = L_{\text{target}} / L_{\text{train}}$, the ratio of the desired context length to the original training length, and map every position $m$ to $m / s$. Because RoPE's angle for the $i$-th frequency band is $\theta_i(m) = m \, \omega_i$, the rescaled angle is simply

$$\theta_i'(m) = \frac{m}{s}\,\omega_i = \frac{L_{\text{train}}}{L_{\text{target}}}\, m\, \omega_i .$$

The point of this rescaling is that the largest angle the model ever sees stays put. At the new maximum position $m = L_{\text{target}} - 1$, the angle becomes $\theta_i'(L_{\text{target}} - 1) \approx (L_{\text{train}} - 1)\,\omega_i$, exactly the maximum angle the model already encountered during pretraining. Every position therefore maps to an angle inside the trained interval $[0, (L_{\text{train}} - 1)\,\omega_i]$, so no band is ever asked to extrapolate to an unseen rotation. The cost, visible in the equation, is that adjacent positions $m$ and $m + 1$ are now only $\omega_i / s$ radians apart instead of $\omega_i$, which is the angular-resolution tax described above.

Why Naive Interpolation Hurts: NTK-Aware Scaling and YaRN

Linear PI applies the same factor $s$ to every band, and that uniformity is its weakness. RoPE's frequencies span a spectrum: high-frequency bands (large $\omega_i$) complete a full rotation over just a few tokens and encode short-range position, the signal the model relies on for local syntax; low-frequency bands (small $\omega_i$) turn slowly and encode long-range position. Dividing every position by $s$ compresses all bands equally, so it squeezes the fast bands the model uses to tell neighbouring tokens apart. That is why naive linear interpolation degrades local resolution: it spends precision on long-range bands that barely needed it while starving the short-range bands that did.

NTK-aware scaling fixes this by stretching the frequency base instead of the positions. Replacing $\theta_{\text{base}}$ with a larger $\theta_{\text{base}}' = \theta_{\text{base}} \cdot s^{\,d/(d-2)}$ rescales the bands non-uniformly: the slow, low-frequency long-range bands are stretched (the ones that must reach further), while the fast, high-frequency short-range bands are left almost untouched, sparing the local resolution that linear PI sacrifices. The effect is that local syntax stays sharp and only the long-range positional structure is interpolated.

YaRN (Yet another RoPE extensioN) makes that intuition explicit by partitioning the spectrum into three regions: high-frequency bands are left untouched (full local resolution preserved), low-frequency bands are interpolated like linear PI (they can safely stretch), and a middle band is smoothly ramped between the two so there is no discontinuity. YaRN also applies a small temperature correction to the attention logits to offset the entropy shift that scaling introduces. The result is that YaRN reaches longer extensions with less fine-tuning than linear PI, because it only stretches the bands that tolerate stretching. The full per-band derivation of RoPE and these scaling schemes lives in Section 3.5.2; the rest of this section focuses on the fine-tuning angle: how to continue training a model at the new length so it adapts to the rescaled angles.

Fine-Tuning at the New Length

Scaling alone repositions the angles, but the model's weights were optimized for the original resolution, so a short bout of continued pretraining at the target length is what recovers crisp attention. Two data choices dominate this stage. First, the fine-tuning corpus must actually contain sequences near $L_{\text{target}}$; if the training documents are all 2K tokens, the model never practices attending across the freshly opened 32K window and the extension stays theoretical. Second, the length distribution matters: a corpus skewed entirely toward maximum-length documents can erode short-context quality, so practitioners mix lengths (a blend of short, medium, and long documents) so the model sharpens its long-range behaviour without forgetting the short-range behaviour it already had. With NTK-aware or YaRN scaling the required amount of this continued pretraining shrinks (sometimes to a few hundred steps) precisely because the high-frequency bands were never disturbed and need little relearning.

Code Fragment 16.7.2a demonstrates linear RoPE scaling configuration.

# Configure linear RoPE scaling to extend context length 4x
# Position indices are interpolated to fit within the original training range
from transformers import AutoModelForCausalLM, AutoTokenizer, AutoConfig
# Method 1: Linear scaling (Position Interpolation)
# Extend a 4K context model to handle 16K sequences
model_name = "meta-llama/Llama-3.1-8B"
config = AutoConfig.from_pretrained(model_name)
# Set the RoPE scaling configuration
config.rope_scaling = {
    "type": "linear",
    "factor": 4.0, # Extend 4x: 4K -> 16K
}
# Update max position embeddings to match
config.max_position_embeddings = 16384 # 4096 * 4
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    config=config,
    torch_dtype="auto",
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
print(f"Max positions: {model.config.max_position_embeddings}")
print(f"RoPE scaling: {model.config.rope_scaling}")
Output: Max positions: 16384 RoPE scaling: {'type': 'linear', 'factor': 4.0}
Code Fragment 16.7.1b: Configure linear RoPE scaling to extend context length 4x

While linear interpolation works well with additional fine-tuning, dynamic NTK scaling (Code Fragment 16.7.2b) adjusts the frequency base automatically at inference time, requiring no fine-tuning at all.

from transformers import AutoConfig
from transformers import AutoModelForCausalLM
# Method 2: Dynamic NTK scaling
config = AutoConfig.from_pretrained(model_name)
config.rope_scaling = {
"type": "dynamic",
"factor": 4.0, # Extend 4x
}
config.max_position_embeddings = 16384
# Dynamic NTK computes the scaling based on actual sequence length
# at inference time, so it adapts to varying input lengths
model_dynamic = AutoModelForCausalLM.from_pretrained(
model_name,
config=config,
torch_dtype="auto",
device_map="auto",
)
print(f"RoPE scaling: {model_dynamic.config.rope_scaling}")
Output: RoPE scaling: {'type': 'dynamic', 'factor': 4.0}
Code Fragment 16.7.2c: Method 2: Dynamic NTK scaling

When input documents exceed the extended context window, you need a chunking strategy. Code Fragment 16.7.2d implements two approaches: fixed-window chunking with token-level overlap (simple, predictable chunk sizes) and semantic chunking that splits at paragraph or section boundaries (preserves meaning but produces variable-length chunks).

# Two chunking strategies: fixed-window with overlap and semantic splitting
# Both produce metadata (token counts, offsets) for downstream retrieval
from typing import List
def chunk_with_overlap(
    text: str,
    chunk_size: int = 512,
    overlap: int = 64,
    tokenizer=None,
    ) -> List[dict]:
    """Chunk text with token-level overlap."""
    if tokenizer is None:
        from transformers import AutoTokenizer
        tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
        tokens = tokenizer.encode(text, add_special_tokens=False)
        chunks = []
        start = 0
        while start < len(tokens):
            end = min(start + chunk_size, len(tokens))
            chunk_tokens = tokens[start:end]
            chunk_text = tokenizer.decode(chunk_tokens, skip_special_tokens=True)
            chunks.append({
                "text": chunk_text,
                "start_token": start,
                "end_token": end,
                "num_tokens": len(chunk_tokens),
                })
            # Move forward by (chunk_size - overlap)
            start += chunk_size - overlap
            # Stop if we have reached the end
            if end >= len(tokens):
                break
                return chunks
    def semantic_chunk(
        text: str,
        max_chunk_tokens: int = 512,
        tokenizer=None,
        ) -> List[dict]:
        """Split text at semantic boundaries (paragraphs, sections)."""
        import re
        # Split on paragraph boundaries (double newlines) and section headers
        segments = re.split(r'\n\n+|\n(?=#)', text)
        segments = [s.strip() for s in segments if s.strip()]
        if tokenizer is None:
            from transformers import AutoTokenizer
            tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
            chunks = []
            current_segments = []
            current_tokens = 0
            for segment in segments:
                seg_tokens = len(tokenizer.encode(segment, add_special_tokens=False))
                if current_tokens + seg_tokens > max_chunk_tokens and current_segments:
                    # Save current chunk and start a new one
                    chunks.append({
                        "text": "\n\n".join(current_segments),
                        "num_tokens": current_tokens,
                        "num_segments": len(current_segments),
                        })
                    current_segments = []
                    current_tokens = 0
                    current_segments.append(segment)
                    current_tokens += seg_tokens
                    # Save the last chunk
                    if current_segments:
                        chunks.append({
                            "text": "\n\n".join(current_segments),
                            "num_tokens": current_tokens,
                            "num_segments": len(current_segments),
                            })
                        return chunks
                        # Example
                        sample_text = "First paragraph about topic A. " * 50 + "\n\n" + \
                        "Second paragraph about topic B. " * 30 + "\n\n" + \
                        "Third paragraph wrapping up. " * 20
                        overlap_chunks = chunk_with_overlap(sample_text, chunk_size=128, overlap=32)
                        semantic_chunks = semantic_chunk(sample_text, max_chunk_tokens=256)
                        print(f"Overlap chunking: {len(overlap_chunks)} chunks")
                        print(f"Semantic chunking: {len(semantic_chunks)} chunks")
Output: Overlap chunking: 8 chunks Semantic chunking: 5 chunks
Code Fragment 16.7.3a: Two chunking strategies: fixed-window with overlap and semantic splitting
Library Shortcut: RecursiveCharacterTextSplitter

In production, prefer langchain_text_splitters.RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=64) instead of a hand-rolled window loop. It handles paragraph and sentence boundaries, falls back through a list of separators, and ships with token-aware variants (TokenTextSplitter, MarkdownTextSplitter) for free. The hand-rolled version above is useful to see the mechanics; the library version is what you would ship.

Show code
# Production-grade chunker: paragraph-aware with token-budget overlap.
from langchain_text_splitters import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=64,
    separators=["\n\n", "\n", ". ", " ", ""],  # fall through if needed
    length_function=len,
)
chunks = splitter.split_text(long_document)
print(f"{len(chunks)} chunks, avg {sum(map(len, chunks))/len(chunks):.0f} chars")
Code Fragment 16.7.5: Production-grade chunker: paragraph-aware with token-budget overlap.

Even with chunking, models tend to attend more to the beginning and end of their context window than to the middle. Code Fragment 16.7.4 implements reordering strategies that place the most relevant passages in high-attention positions.

# Practical strategies for mitigating lost-in-the-middle
from typing import List, Dict


def reorder_context_for_retrieval(
    query: str,
    retrieved_passages: List[Dict],
    strategy: str = "important_first_last"
) -> List[Dict]:
    """Reorder passages to mitigate the lost-in-the-middle effect."""
    if strategy == "important_first_last":
        # Place the most relevant passages at the start and end
        # Less relevant passages go in the middle
        sorted_passages = sorted(
            retrieved_passages,
            key=lambda p: p["relevance_score"],
            reverse=True
        )
        n = len(sorted_passages)
        reordered = [None] * n
        # Alternate between start and end positions
        left, right = 0, n - 1
        for i, passage in enumerate(sorted_passages):
            if i % 2 == 0:
                reordered[left] = passage
                left += 1
            else:
                reordered[right] = passage
                right -= 1
        return reordered
    elif strategy == "reverse_rank":
        # Put least relevant first, most relevant last
        # (recency bias helps with last items)
        return sorted(
            retrieved_passages,
            key=lambda p: p["relevance_score"]
        )
    return retrieved_passages


# Example: 10 passages ranked by relevance
passages = [
    {"text": f"Passage {i}", "relevance_score": 1.0 - i * 0.1}
    for i in range(10)
]
reordered = reorder_context_for_retrieval("query", passages)
positions = [(p["text"], f"score={p['relevance_score']:.1f}") for p in reordered]
for i, (text, score) in enumerate(positions):
    position_label = "START" if i < 2 else "END" if i >= 8 else "middle"
    print(f"  Position {i:2d} [{position_label:6s}]: {text} ({score})")
Output: Position 0 [START ]: Passage 0 (score=1.0) Position 1 [START ]: Passage 2 (score=0.8) Position 2 [middle]: Passage 4 (score=0.6) Position 3 [middle]: Passage 6 (score=0.4) Position 4 [middle]: Passage 8 (score=0.2) Position 5 [middle]: Passage 9 (score=0.1) Position 6 [middle]: Passage 7 (score=0.3) Position 7 [middle]: Passage 5 (score=0.5) Position 8 [END ]: Passage 3 (score=0.7) Position 9 [END ]: Passage 1 (score=0.9)
Code Fragment 16.7.4a: Practical strategies for mitigating lost-in-the-middle
Key Insight

Structure your prompts with the U-shape in mind. Place the most critical information (key instructions, the most relevant retrieved passages, essential context) at the very beginning and the very end of your prompt. Less critical supporting information can go in the middle. This simple reordering can improve retrieval accuracy by 10% to 20% on long-context tasks without any model changes.

16.7.3 Long-Document Classification: A Strategy Catalog

Sections 16.7.1 and 16.7.2 framed the long-context problem from the generative side: how to extend a model's context window so it can continue writing across a long document. Classification asks a different question: given a document longer than the model's context, how does the team produce a single label (or set of labels) for the entire document? The honest answer is that no single technique dominates; the right choice depends on document length, label granularity, available labels, and inference budget. Seven distinct strategies appear in production systems, and most mature pipelines combine two or three.

16.7.3.1 The Seven Strategies

  1. Truncate to the first $N$ tokens. The simplest possible strategy: keep the first 512 (or 4,096) tokens and discard the rest. Works surprisingly well when the label is determined by the document's opening, as in news topic classification (the headline and lead carry the topic) or email intent detection (the subject line and first sentence reveal the request). Fails badly when the discriminative content lives in the middle or end (legal contract red flags, scientific paper conclusions).
  2. Sliding window plus aggregate. Slide a fixed-size window across the document, run the classifier independently on each window, then combine the per-window predictions. The aggregator is either a majority vote over the predicted labels, an average of the softmax logits, or a max-pool across windows (useful when the task is "does this document contain X?"). Window overlap (typically 10 to 20 percent) reduces boundary-effect errors. Cheap, requires no architectural change, and surprisingly strong on most production tasks.
  3. Hierarchical transformer. Two transformers stacked vertically. The lower transformer encodes each window into a single embedding (mean-pooled or [CLS]); the upper transformer (or a small RNN, or a simple mean pool) takes the sequence of window embeddings as input and emits the document-level label. The hierarchical structure preserves cross-window context that pure aggregation loses. Standard pattern in the Hierarchical Attention Networks line of work and the basis of most long-document classification baselines in academic benchmarks.
  4. Long-sequence transformers. Architectures specifically designed for longer attention spans without quadratic cost. Longformer combines a sliding local window with a small set of global attention tokens to reach 4K tokens; BigBird adds random attention to the local + global pattern to provably approximate full attention with linear cost, scaling to roughly 16K. Both ship as drop-in replacements for BERT and require fine-tuning on long sequences to exploit the extended window. The right tool when the labeled long-document dataset is large enough to fine-tune a dedicated long-sequence encoder.
  5. Summarize-first, then classify. Run a summarization model (or an LLM with a "summarize" prompt) over the long document to produce a 256- or 512-token summary, then classify the summary with a normal short-context model. Two-pass and slow, but extremely effective when the labels are abstract and require global understanding (sentiment of a multi-page review, regulatory classification of a contract).
  6. Zero-shot generative classification. Feed the long document directly to an instruction-tuned LLM with a prompt like "Classify the following document into one of {labels}." Works for any length the LLM accepts and requires zero labeled training data, but is by far the most expensive per call and inherits the lost-in-the-middle failure mode for documents in the middle of its context window.
  7. Retrieval-augmented classification. Chunk and embed the document at index time. At classification time, embed a small set of label-related queries ("does the document discuss termination clauses?"), retrieve the top-$k$ matching chunks, and pass only those chunks to a short-context classifier (or a zero-shot LLM). Combines retrieval's scalability with the classifier's discriminative power; the right choice for very long documents where only a small fraction of the text is relevant to the label.

16.7.3.2 Decision Table

Table 16.7.1c: Long-Document Classification Strategy Selector.
StrategyBest WhenAvoid WhenInference Cost
TruncateLabel is determined by document opening (news topic, email intent)Discriminative content is in the middle or endLowest (1 short-context call)
Sliding window + aggregateLabel is local; majority vote / max-pool over windows is meaningfulLabel requires global cross-section reasoningLinear in document length
Hierarchical transformerLarge labeled long-document dataset; cross-window context mattersNo long-document labels availableLinear; two stacked transformers
Longformer / BigBirdDocuments under ~16K tokens; can afford to fine-tune a long-seq encoderDocuments exceed the long-seq attention windowLinear-in-length (vs. quadratic baseline)
Summarize-firstLabels are abstract; summarization model is strong in the domainSubtle local cues drive the label and would be lost in summary2 model passes (summary + classify)
Zero-shot generativeNo labeled data; small label set; budget allows LLM callsHigh-volume production with tight latency / cost budgetsHighest (long-context LLM per call)
RAG classificationVery long documents where only a small fraction matters per labelLabel depends on global properties of the full documentEmbedding lookup + short classifier call
Tip: Start with Strategies 1 and 2

Many teams burn weeks fine-tuning Longformer or wiring up RAG-classification only to discover that plain truncation reaches 95 percent of the eventual accuracy. The honest production playbook is: (1) measure truncate-to-512 baseline, (2) measure sliding-window-512 + majority-vote baseline, and only then (3) invest in hierarchical or long-sequence architectures. The strategy you ship is almost always the cheapest one whose accuracy is "close enough" to the most expensive one.

Note
Long-Context Generation vs. Long-Document Classification

The seven strategies above are classification patterns. Long-context generation (summarizing or answering questions over a long document) has overlapping but distinct trade-offs: RoPE scaling (Section 16.7.2) and chunked retrieval (the RAG strategy above, applied at generation time) dominate. Perceiver-AR and similar cross-attention architectures extend the generative side further but are still rare in production. Pick from the classification table when the output is a discrete label; pick from Section 16.7.2 when the output is free-form text.

Key Takeaways
Key Insight

Context length and context utilization are different problems (the deep mechanics of RoPE/YaRN context-window scaling live in Section 9.3). Extending a model's context window (via RoPE scaling, YaRN, or continued pretraining with longer sequences) only solves the length problem. Research on the "lost in the middle" phenomenon shows that models often ignore information placed in the middle of long contexts, even when they can technically process the full sequence. For practical applications, this means that extending context length alone does not guarantee your RAG system will use all retrieved documents effectively. Chunking strategies and retrieval ordering remain important even with long-context models.

Self-Check
Q1: Why does a model trained with a 4K context window produce poor results on an 8K sequence, even though it can technically process the tokens?
Show Answer
The model's positional encodings (typically RoPE) encode position information using mathematical functions that the model learned to interpret during pretraining. Positions beyond 4,096 produce encoding values that the model has never seen during training, making them "out of distribution." The attention mechanism relies on these position encodings to compute relative distances between tokens, so out-of-range positions produce noisy, meaningless attention patterns that degrade output quality.
Q2: What is the key difference between linear RoPE scaling and Dynamic NTK scaling?
Show Answer
Linear scaling uniformly compresses all position indices by the scaling factor (e.g., dividing all positions by 4 to fit 16K into a 4K range). This treats all frequency components equally. Dynamic NTK scaling applies different scaling factors to different frequency components: high-frequency components (which encode fine-grained position distinctions) receive stronger scaling, while low-frequency components (which encode coarse position information) are left largely unchanged. This preserves local attention patterns better, resulting in higher quality at larger scaling factors.
Q3: Why is FlashAttention essential for long-context training?
Show Answer
Standard self-attention requires $O(n^2)$ memory to materialize the full attention matrix, where n is the sequence length. Doubling the sequence length quadruples memory usage. For a 32K sequence, this would require approximately 16x the memory of an 8K sequence, exceeding the capacity of even 80GB GPUs for 7B+ models. FlashAttention computes attention in blocks using a tiling algorithm that never materializes the full n x n matrix, reducing memory to $O(n)$. This makes long-context training feasible on available hardware.
Q4: What is the "lost-in-the-middle" phenomenon, and how can you mitigate it?
Show Answer
The lost-in-the-middle phenomenon is the observation that language models recall information much better from the beginning and end of the context window than from the middle, producing a U-shaped accuracy curve. Mitigation strategies include: (1) placing the most important information at the start and end of the prompt, (2) reordering retrieved passages so the most relevant appear at boundary positions, (3) using hierarchical summarization that processes chunks independently before combining, and (4) fine-tuning specifically on tasks that require middle-context retrieval.
Q5: When should you use chunking strategies instead of context extension?
Show Answer
Use chunking when: (1) the document exceeds even the extended context window (e.g., a 500-page book exceeds any practical context length), (2) you cannot afford the compute cost of long-context fine-tuning, (3) you need to process many documents and long-context inference is too slow, or (4) your task is naturally decomposable into independent chunks (e.g., searching for specific facts rather than reasoning over the entire document). Context extension is better when the task requires global reasoning across the full document, such as summarizing all themes in a long report or detecting contradictions across sections.
Real-World Scenario
Extending Context Length for Quarterly Earnings Call Analysis

Who: A quantitative research team at an investment firm analyzing earnings call transcripts that typically span 15,000 to 25,000 tokens.

Situation: They used Llama-2 7B (4,096-token context window) to extract sentiment, forward-looking statements, and risk factors from earnings calls. Transcripts had to be chunked into 4 to 6 pieces, processed independently, and merged.

Problem: Chunking caused the model to miss cross-reference patterns (e.g., a CEO's optimistic guidance contradicted by the CFO's cautious revenue projections later in the call). Their analysts estimated that chunking artifacts caused 22% of extracted insights to be incomplete or misleading.

Dilemma: They could switch to a model with a longer native context window (GPT-4 at 128K tokens, but at 15x the cost per call), implement better chunking with overlap (incremental improvement, still misses long-range dependencies), or extend the Llama-2 context window through RoPE scaling and continued pretraining.

Decision: They used YaRN (Yet another RoPE extensioN) to extend Llama-2's context window from 4K to 32K tokens, followed by a short continued pretraining run on 500 earnings call transcripts (no labels needed, just next-token prediction on long documents).

How: They applied YaRN scaling with a scale factor of 8, enabled FlashAttention 2 (required for the memory-intensive long sequences), and ran continued pretraining for 1,000 steps with gradient checkpointing on 4 A100 GPUs. They then fine-tuned with LoRA on 2,000 labeled earnings call analysis examples using LongLoRA's shifted sparse attention during training.

Result: The extended model processed full earnings calls (up to 25K tokens) in a single pass. Analyst-rated extraction quality improved by 28%, with the largest gains in cross-speaker contradiction detection and multi-section trend analysis. Per-call inference cost was $0.04 (self-hosted), compared to $0.45 for GPT-4. The total training cost was $320.

Lesson: RoPE scaling methods (especially YaRN) combined with a short continued pretraining phase on domain-specific long documents are the most cost-effective way to extend context windows; FlashAttention 2 is not optional but mandatory for making long-context training and inference practical.

Research Frontier

The extension of context windows beyond 1 million tokens (as in Gemini 1.5) has been enabled by innovations in positional encoding, including YaRN and NTK-aware interpolation methods. Research on retrieval-augmented generation as a context extension alternative suggests that for many tasks, chunked retrieval over long documents outperforms brute-force context extension at lower cost.

The open frontier is training models that can genuinely reason over extremely long contexts rather than simply retrieving from them, as current evaluations like RULER reveal performance degradation on multi-hop reasoning at long ranges.

Exercises

Exercise 14.7.1: Context length limitations Conceptual

Explain why a model trained with a 4K context window performs poorly on 16K-token inputs, even if the architecture technically supports longer sequences.

Answer Sketch

The model's positional encodings (e.g., RoPE) were trained only on positions 0 to 4095. At positions beyond 4096, the encodings extrapolate into regions the model has never seen, producing unpredictable attention patterns. The model may ignore distant tokens, hallucinate, or produce incoherent output. The attention patterns learned during training do not generalize to much longer sequences without additional adaptation.

Exercise 14.7.2: RoPE scaling methods Conceptual

Compare three approaches to extending RoPE-based models to longer contexts: linear interpolation (PI), NTK-aware scaling, and YaRN. What tradeoff does each make?

Answer Sketch

Linear interpolation (PI): scales all RoPE frequencies uniformly, extending context but slightly degrading short-context performance. NTK-aware: scales only the low frequencies (which encode global position), preserving high-frequency (local) patterns. YaRN: combines NTK-aware scaling with attention temperature adjustment, achieving the best quality across both short and long contexts. Tradeoff: simpler methods require more continued pretraining; YaRN works well with minimal fine-tuning.

Exercise 14.7.3: Context extension fine-tuning Coding

Write the key configuration parameters for extending a Llama model from 8K to 32K context using RoPE scaling and continued pretraining. Include the scaling factor and recommended training steps.

Answer Sketch

Set rope_scaling={'type': 'yarn', 'factor': 4.0} (32K/8K = 4x). Use a long-document dataset with sequences of 16K to 32K tokens. Train for 1000 to 2000 steps with learning rate 2e-5 and batch size that fills GPU memory. Key: use gradient checkpointing to fit long sequences in memory. Validate on a needle-in-a-haystack test at various context positions to verify the model attends to information at all positions.

Exercise 14.7.4: Chunking strategies Analysis

When context extension is not feasible, describe three chunking strategies for processing a 50-page legal contract with a 4K-token model. Compare their tradeoffs for a question-answering task.

Answer Sketch

1. Fixed-size chunks (512 tokens, 50% overlap): simple but may split relevant passages across chunks. 2. Semantic chunking (split on section/paragraph boundaries): preserves logical units but produces variable-length chunks. 3. Hierarchical: first summarize each section, then answer questions against summaries (with retrieval of full sections when needed). For QA: semantic chunking with retrieval works best because legal contracts have clear section boundaries and questions typically target specific sections.

Exercise 14.7.5: Needle-in-a-haystack test Coding

Describe the needle-in-a-haystack evaluation for long-context models. Write pseudocode for a test that checks whether a model can retrieve a planted fact at various positions in a long document.

Answer Sketch

Insert a unique fact (e.g., 'The special magic number is 42.') at position P in a long padding document. Ask: 'What is the special magic number?' Vary P from 0% to 100% of the context length. For each P, check if the model's response contains '42'. Plot accuracy vs. position. A good long-context model should achieve near-100% accuracy at all positions. Failures at specific positions indicate that the model's attention does not reach those regions effectively.

RoPE (Rotary Position Embedding) scaling lets models extrapolate to longer sequences by adjusting the frequency of position encodings. The math is elegant, but the intuition is simple: you are teaching the model that position 50,000 is just a stretched version of position 5,000.

See Also

For the attention-and-positional-encoding mechanisms (RoPE, ALiBi, YaRN) that determine long-context behavior, see Section 3.5: Positional Encodings. For the long-context retrieval alternative (RAG often beats long-context for cost and recall), see Section 32.1: RAG Foundations. For the inference-time KV cache and paged-attention techniques that make long context affordable to serve, see Section 9.4: KV cache.

What Comes Next

In the next chapter, Chapter 17: Parameter-Efficient Fine-Tuning (PEFT), we explore parameter-efficient fine-tuning (PEFT), which achieves comparable results while updating only a fraction of model parameters. Context window extension builds on the positional encoding foundations from Section 3.5.

Further Reading

Context Extension Techniques

Peng, B. et al. (2024). YaRN: Efficient Context Window Extension of Large Language Models. ICLR 2024. Introduces YaRN (Yet another RoPE extensioN), which combines NTK-aware interpolation with attention scaling to extend context windows with minimal fine-tuning. YaRN is the recommended RoPE scaling method discussed in this section due to its superior quality at long ranges. Useful for context extension projects.
Chen, Y. et al. (2023). LongLoRA: Efficient Fine-Tuning of Long-Context Large Language Models. ICLR 2024. Combines shifted sparse attention during training with LoRA to efficiently extend context windows at a fraction of full fine-tuning cost. LongLoRA demonstrates that context extension does not require updating all model parameters. Directly relevant to the parameter-efficient context extension approach discussed here.
Su, J. et al. (2024). RoFormer: Enhanced Transformer with Rotary Position Embedding. Neurocomputing. The original RoPE paper that introduced rotary position embeddings, now the standard positional encoding in most open-source LLMs. Understanding RoPE's mathematical properties is essential for grasping why and how the scaling techniques in this section work. Foundational reference for all RoPE-based context extension.

Long-Context Challenges and Infrastructure

Liu, N. F. et al. (2024). Lost in the Middle: How Language Models Use Long Contexts. TACL. Reveals that models tend to focus on information at the beginning and end of long contexts while neglecting the middle. This "lost in the middle" phenomenon is a critical limitation discussed in this section and motivates the chunking strategies presented as alternatives to pure context extension.
Dao, T. (2023). FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning. Presents FlashAttention-2, which achieves near-optimal GPU utilization for attention computation and is mandatory for practical long-context training and inference. Without FlashAttention-2, the memory requirements for long sequences make training infeasible on consumer hardware. Essential infrastructure for the techniques in this section.
Press, O. et al. (2022). Train Short, Test Long: Attention with Linear Biases Enables Input Length Generalization. ICLR 2022. Introduces ALiBi (Attention with Linear Biases), an alternative to learned positional embeddings that generalizes naturally to longer sequences without any modification. ALiBi provides useful contrast to the RoPE scaling methods and represents a different philosophical approach to length generalization.
Beltagy, I., Peters, M. E., & Cohan, A. (2020). Longformer: The Long-Document Transformer. Introduces Longformer's combination of local sliding-window attention with task-specific global attention tokens, reaching 4,096-token contexts at linear cost. The reference architecture for long-document classification strategy #4 in Section 16.7.3.
Zaheer, M. et al. (2020). Big Bird: Transformers for Longer Sequences. NeurIPS 2020. BigBird augments Longformer's local + global pattern with random attention and proves that the resulting sparse attention approximates full attention with linear cost. Scales to 16K tokens and is the second canonical choice when an off-the-shelf long-sequence encoder is needed.
Hawthorne, C. et al. (2022). General-purpose, long-context autoregressive modeling with Perceiver AR. ICML 2022. Perceiver-AR uses cross-attention to a small latent array to compress long inputs before autoregressive generation, sidestepping the quadratic cost of self-attention on the full sequence. A research-frontier option for very long generative contexts noted in Section 16.7.3.2.