Stochastic Sampling Methods

Section 4.2

At temperature 0.0, I am a boring but reliable narrator. At temperature 2.0, I am a jazz musician who has lost the sheet music.

GreedyGreedy, Temperature-Unstable AI Agent
Big Picture

Why add randomness? Deterministic decoding (Section 4.1) produces the same output every time, which is great for translation but terrible for creative writing, conversation, and brainstorming. If you have not yet read Section 4.1, start there first; it introduces the greedy and beam search foundations that stochastic methods build on. Human language is inherently varied: ask ten people to complete the same sentence, and you will get ten different answers. Stochastic sampling introduces controlled randomness into the decoding process, producing diverse, interesting, human-like text. The challenge is finding the right balance: too little randomness yields repetitive, robotic text; too much yields incoherent gibberish. This section covers every major technique for controlling that balance. Every parameter introduced here (temperature, top-p, top-k, min-p) is a knob you will turn on every LLM API call, so understanding their joint effect on the sampling distribution is what separates a tuned LLM product from one that randomly hallucinates or returns boilerplate.

Key Insight: Remember

Temperature is sampling boldness: T near 0 always picks the safest token; high T rolls dice on the unlikely. Top-p crops the long tail before sampling, so the model can be bold among reasonable options without ever quoting from the 5% of nonsense at the end. Use temperature to set the energy scale; use top-p to set the ceiling.

Prerequisites

This section builds directly on the deterministic decoding strategies (greedy search, beam search) from Section 4.1. Understanding softmax, probability distributions, and how a model produces logits (from Section 3.1) is essential. The temperature and sampling parameters introduced here are the same ones you will use when calling LLM APIs later in the book.

A DJ at a mixing console with a large temperature dial, where low settings produce orderly musical notes and high settings produce wild, chaotic sound waves
Figure 4.2.1: Temperature controls the entropy of the sampling distribution, like a DJ controls the energy of a mix. Low temperature produces focused, deterministic outputs; high temperature flattens the distribution, allowing rare tokens to surface more often.

4.2.1 Pure Random Sampling

Key Insight
Cross-Field: Sampling Temperature as a Boltzmann Distribution

The softmax over logits at temperature T is the Boltzmann distribution from statistical physics: P(token) = exp(logit/T) / Z, where Z is the partition function. T = 0 is absolute zero: the model always picks the single lowest-energy (highest-logit) token. Higher T populates higher-energy (lower-probability) states. Top-p sampling truncates the partition function at a free-energy threshold. These two parameters are not redundant: temperature sets the energy scale of the full distribution; top-p sets a hard ceiling on how far up the energy ladder the model can sample. Use both deliberately, not as arbitrary dials.

The most direct form of stochastic decoding is ancestral sampling: at each step, sample the next token from the full probability distribution. If the model says "the" has probability 0.15, "a" has 0.10, "quantum" has 0.0001, and so on across the entire 50,000-token vocabulary, you sample according to those exact probabilities.

This produces maximally diverse output, but the quality is often poor. The long tail of the vocabulary contains thousands of tokens that are individually very unlikely but collectively hold significant probability mass. Even if each improbable token has only a 0.001% chance, with 50,000 tokens in the vocabulary, sampling from the full distribution occasionally draws rare and contextually inappropriate words, derailing the generation.

Tip: The Long Tail Problem in One Number

In a typical 50,000-token vocabulary, the top 500 tokens might hold 95% of the probability mass for any given position. That means 49,500 tokens share the remaining 5%. Pure sampling treats that 5% as fair game, which is why you occasionally get bizarre outputs like "The president announced a new policy of flamingos." Every truncation method in this section (top-k, top-p, min-p) is a different strategy for taming this tail.

Real-World Scenario
Tuning Sampling Parameters for a Creative Writing Assistant

Who: A product team at an edtech company building an AI creative writing assistant for middle school students.

Situation: The assistant needed to generate story continuations that were creative and surprising, while remaining coherent and age-appropriate.

Problem: With default parameters (temperature 1.0, no truncation), the model produced outputs that frequently veered into nonsensical territory or included vocabulary too advanced for the target audience.

Dilemma: Lowering temperature made outputs safe but predictable and boring for students. Raising it produced exciting but often incoherent text. Top-k filtering helped, but finding the right k value was tricky because different story contexts needed different amounts of creativity.

Decision: The team adopted nucleus sampling (top-p = 0.92) combined with a moderate temperature of 0.85, plus a min-p filter of 0.05 to prune junk tokens.

How: They ran A/B tests with 200 students over two weeks, measuring engagement (time spent reading continuations), coherence ratings (teacher evaluation), and student satisfaction surveys across five parameter configurations.

Result: The nucleus sampling configuration increased student engagement by 28% compared to greedy decoding and reduced incoherent outputs from 15% to under 3%. Teacher coherence ratings improved from 3.2/5 to 4.4/5.

Lesson: Nucleus sampling adapts naturally to context: it allows more diversity when the model is uncertain and constrains output when the model is confident, making it more robust than fixed top-k across varied prompts.

Nucleus sampling controls which tokens are eligible for selection, but it does not change the relative probabilities among them. To control how sharply the model distributes probability across candidates, we need a different knob: temperature.

4.2.2 Temperature Scaling

Key Insight

Temperature in language models is not merely borrowed vocabulary from physics; it is the exact same mathematical object. In statistical mechanics, the Boltzmann distribution gives the probability of a system being in state $i$ as $P(i) \propto e^{-E_i / kT}$, where $E_i$ is the energy, $k$ is Boltzmann's constant, and $T$ is temperature. The softmax function with temperature, $P(i) \propto e^{z_i / T}$, is identical in structure, with logits $z_i$ playing the role of negative energies. This is not a coincidence: both systems are maximum-entropy distributions subject to a constraint on the expected value (energy in physics, log-likelihood in language models). At high temperature, the system explores many states uniformly (high entropy); at low temperature, it collapses toward the lowest-energy (highest-probability) state. This connection to the Boltzmann distribution also explains why temperature 1.0 is the "natural" setting: it recovers the model's trained distribution, just as $T=1$ in physics recovers the canonical ensemble. Any other temperature distorts the distribution away from what the model learned.

Fun Fact

The term "temperature" comes from statistical mechanics, where it controls the randomness of particle states in a physical system. Setting temperature to zero makes a language model maximally deterministic, just as cooling a physical system to absolute zero forces all particles into their lowest energy state. Physicists invented the mathematical framework centuries before anyone thought of applying it to language generation.

Warning
Common Misconception: Temperature Does Not Control "Creativity"

It is tempting to say "higher temperature = more creative output," but this is misleading. Temperature reshapes the probability distribution over the vocabulary, making unlikely tokens more likely to be sampled. That is not creativity; it is increased randomness. True creativity involves novel combinations of ideas that are coherent and purposeful. A high-temperature model does not "think more creatively"; it simply rolls a less biased die across tokens, which sometimes produces surprising text and sometimes produces incoherent nonsense. The correct framing: temperature controls the entropy of the sampling distribution. High entropy means more uniform sampling (diverse but noisy); low entropy means peaked sampling (focused but repetitive). When people say "set temperature to 0.9 for creative writing," what they really mean is "allow more sampling diversity so the output is less predictable," which is a useful heuristic but not the same as creativity.

Temperature is the most fundamental control knob for stochastic sampling. Before applying softmax, we divide the logits by a temperature parameter T. You will encounter temperature again as a practical API parameter in Chapter 11 and as a training hyperparameter for knowledge distillation in Chapter 17:

$$P(x_{i}) = \exp(z_{i} / T) / \sum _{j} \exp(z_{j} / T)$$

The effect is intuitive:

Note: Why This Surprises First-Time Readers

The formula divides logits by T, so the literal value T=0 would divide by zero. Production APIs (OpenAI, Anthropic, Hugging Face) silently special-case this and fall back to argmax (greedy decoding); they do not actually evaluate the softmax. This is why "temperature=0" outputs can still be non-deterministic across servers: floating-point reductions over many tokens are non-associative on GPUs, so two identical argmax calls may return different tokens when the top two logits are nearly tied. If you need bit-identical determinism, you need to fix the random seed AND pin the inference framework's reduction order, not just set temperature to 0.

Same logits, three temperatures: at T=0.3 the distribution is sharply peaked on the top token; at T=1.0 it follows the model's raw distribution; at T=2.0 it flattens, giving low-probability tokens a meaningful chance
Figure 4.2.2: Temperature controls the "peakiness" of the distribution. Lower temperatures concentrate probability on top tokens; higher temperatures spread it more evenly.
# Temperature scaling: divide logits by T before softmax.
# Low T sharpens the distribution; high T flattens it toward uniform.
import torch
import torch.nn.functional as F
# Simulating temperature effect on a small vocabulary
logits = torch.tensor([5.0, 3.5, 2.0, 1.0, 0.5, 0.1, -1.0, -2.0])
tokens = ["the", "cat", "dog", "it", "my", "old", "an", "..."]
for temp in [0.3, 0.7, 1.0, 1.5, 2.0]:
    probs = F.softmax(logits / temp, dim=-1)
    top_prob = probs[0].item()
    entropy = -(probs * probs.log()).sum().item()
    print(f"T={temp:.1f} | P('the')={top_prob:.3f} | entropy={entropy:.3f} | dist={[f'{p:.3f}' for p in probs.tolist()]}")
Output: T=0.3 | P('the')=0.946 | entropy=0.279 | dist=['0.946', '0.046', '0.006', '0.001', '0.001', '0.000', '0.000', '0.000'] T=0.7 | P('the')=0.685 | entropy=0.957 | dist=['0.685', '0.166', '0.056', '0.024', '0.017', '0.013', '0.006', '0.002'] T=1.0 | P('the')=0.513 | entropy=1.382 | dist=['0.513', '0.188', '0.085', '0.042', '0.030', '0.023', '0.009', '0.003'] T=1.5 | P('the')=0.340 | entropy=1.753 | dist=['0.340', '0.194', '0.118', '0.075', '0.060', '0.050', '0.029', '0.015'] T=2.0 | P('the')=0.268 | entropy=1.933 | dist=['0.268', '0.186', '0.132', '0.096', '0.081', '0.070', '0.048', '0.030']
Code Fragment 4.2.1a: Temperature scaling: divide logits by T before softmax.
import torch.nn.functional as F
import torch
# Nucleus (top-p) sampling: sort tokens by probability, accumulate until
# the cumulative mass reaches p, then zero out all remaining tokens.
def top_p_sampling(logits, p=0.9, temperature=1.0):
    """Apply nucleus (top-p) filtering then sample."""
    scaled_logits = logits / temperature
    probs = F.softmax(scaled_logits, dim=-1)
    # Sort probabilities in descending order
    sorted_probs, sorted_indices = torch.sort(probs, descending=True)
    cumulative_probs = torch.cumsum(sorted_probs, dim=-1)
    # Find the cutoff: first index where cumulative prob exceeds p
    # We keep tokens up to (but not including) this cutoff
    sorted_mask = cumulative_probs - sorted_probs > p
    sorted_probs[sorted_mask] = 0.0
    # Renormalize
    sorted_probs /= sorted_probs.sum()
    # Sample from filtered distribution
    sampled_index = torch.multinomial(sorted_probs, num_samples=1)
    return sorted_indices[sampled_index]
# Demonstrate adaptive behavior
confident_logits = torch.tensor([8.0, 4.0, 1.0, 0.5, 0.1, -1.0, -2.0, -3.0])
uncertain_logits = torch.tensor([2.0, 1.8, 1.6, 1.4, 1.2, 1.0, 0.8, 0.5])
for name, logits in [("Confident", confident_logits), ("Uncertain", uncertain_logits)]:
    probs = F.softmax(logits, dim=-1)
    sorted_probs, _ = torch.sort(probs, descending=True)
    cumsum = torch.cumsum(sorted_probs, dim=-1)
    nucleus_size = (cumsum < 0.9).sum().item() + 1
    print(f"{name}: nucleus size = {nucleus_size} tokens for p=0.9")
    print(f" Probs: {[f'{p:.3f}' for p in sorted_probs.tolist()]}")
    print(f" Cumsum: {[f'{c:.3f}' for c in cumsum.tolist()]}\n")
Output: Confident: nucleus size = 2 tokens for p=0.9 Probs: ['0.935', '0.054', '0.003', '0.002', '0.001', '0.000', '0.000', '0.000'] Cumsum: ['0.935', '0.989', '0.991', '0.993', '0.994', '0.996', '0.998', '1.000'] Uncertain: nucleus size = 6 tokens for p=0.9 Probs: ['0.181', '0.167', '0.153', '0.141', '0.129', '0.119', '0.073', '0.038'] Cumsum: ['0.181', '0.348', '0.501', '0.642', '0.771', '0.890', '0.962', '1.000']
Code Fragment 4.2.2a: Simulating temperature effect on a small vocabulary.
Note: Practical Guidance

Common temperature ranges: 0.1 to 0.4 for factual Q&A and code generation (favoring accuracy); 0.6 to 0.8 for general conversation; 0.9 to 1.2 for creative writing and brainstorming. Temperatures above 1.5 are rarely useful in production. Most API providers (OpenAI, Anthropic, Google) expose temperature as a parameter, and it is typically the first knob users should tune. For practical guidance on configuring these parameters through APIs, see Chapter 12: Prompt Engineering.

Key Insight: The Boltzmann Distribution and Free Energy

The temperature-scaled softmax is not merely named after statistical mechanics; it is mathematically identical to the Boltzmann distribution that describes the probability of physical states in a thermal system. In physics, P(state) is proportional to exp(negative energy / kT), where T is temperature and k is Boltzmann's constant. In language models, the logits play the role of negative energy: higher-logit tokens are "lower energy" (more favored) states. This connection runs deeper than notation. The Boltzmann distribution maximizes entropy subject to a constraint on expected energy, meaning it is the least biased distribution consistent with the model's preferences. Temperature thus controls the tradeoff between exploitation (low T, choosing high-confidence tokens) and exploration (high T, sampling diversely), the exact same exploration-exploitation tradeoff that governs simulated annealing in optimization, thermodynamic processes in chemistry, and reinforcement learning in Section 18.1.

4.2.3 Top-k Sampling

Top-k sampling (Fan et al., 2018) restricts sampling to the k most probable tokens at each step. All other tokens have their probability set to zero, and the remaining probabilities are renormalized to sum to 1.

$$P'(x_{i}) = \begin{cases} \frac{P(x_{i})}{\sum_{j \in \text{top-}k} P(x_{j})} &\text{amp}; \text{if } x_{i} \in \text{top-}k \\ 0 &\text{amp}; \text{otherwise} \end{cases}$$

This eliminates the long tail problem: no matter how flat the distribution is, only k tokens are ever considered. However, top-k has a significant limitation: the optimal value of k varies depending on the context. When the model is very confident (e.g., after "The capital of France is"), even k=10 might include irrelevant tokens. When the model is uncertain (e.g., after "I enjoy"), k=10 might be too restrictive, cutting off perfectly valid continuations.

import torch.nn.functional as F
import torch
# Top-k sampling: keep only the k highest-scoring tokens,
# set the rest to -inf, then sample from the truncated distribution.
def top_k_sampling(logits, k=50, temperature=1.0):
    """Apply top-k filtering then sample from the result."""
    # Apply temperature
    scaled_logits = logits / temperature
    # Find the k-th largest value as threshold
    top_k_values, _ = torch.topk(scaled_logits, k)
    threshold = top_k_values[..., -1, None]
    # Zero out everything below threshold
    filtered = scaled_logits.masked_fill(scaled_logits < threshold, float('-inf'))
    # Convert to probabilities and sample
    probs = F.softmax(filtered, dim=-1)
    return torch.multinomial(probs, num_samples=1)
# Example: sampling with different k values
logits = torch.tensor([5.0, 3.5, 2.0, 1.0, 0.5, 0.1, -1.0, -2.0])
tokens = ["the", "cat", "dog", "it", "my", "old", "an", "..."]
for k in [2, 4, 6]:
    filtered = logits.clone()
    threshold = torch.topk(filtered, k).values[-1]
    filtered[filtered < threshold] = float('-inf')
    probs = F.softmax(filtered, dim=-1)
    active = [f"{tokens[i]}({probs[i]:.3f})" for i in range(len(tokens)) if probs[i] > 0]
    print(f"k={k}: {', '.join(active)}")
Output: k=2: the(0.818), cat(0.182) k=4: the(0.596), cat(0.218), dog(0.099), it(0.049) k=6: the(0.526), cat(0.193), dog(0.087), it(0.043), my(0.031), old(0.024)
Code Fragment 4.2.3: Apply temperature.
Warning
Common Misconception: "Larger k Always Means More Diversity"

Beginners often pick top-k=200 or top-k=500 thinking it will produce more creative text. In practice, the top tokens already absorb most of the probability mass, so once k is past ~50, increasing it further changes almost nothing; you only reach junk tokens whose probability is so low they would rarely be sampled anyway. Diversity is governed jointly by temperature (which redistributes mass) and k (which caps the candidate set). Cranking k alone with temperature=0.7 yields outputs barely distinguishable from k=50.

4.2.4 Nucleus (Top-p) Sampling

Nucleus sampling (Holtzman et al., 2020) addresses top-k's fixed-size problem with an elegant idea: instead of keeping a fixed number of tokens, keep the smallest set of tokens whose cumulative probability exceeds a threshold p. This adapts automatically to the shape of the distribution.

$$V_{p} = \text{smallest set such that} \sum _{x \in \text{Vp}} P(x) \geq p$$

When the model is confident, the nucleus might contain only 2 or 3 tokens. When the model is uncertain, it might contain 100 or more. This adaptivity is what makes top-p the most widely used sampling method in production systems.

Top-p sampling comparison: confident distribution with 2-token nucleus versus uncertain distribution with 5-token nucleus
Figure 4.2.3a: Top-p sampling adapts the number of candidate tokens to model confidence. When confident, few tokens suffice; when uncertain, the nucleus expands.
Library Shortcut

In practice, you never implement sampling by hand. The transformers library handles temperature, top-k, top-p, and repetition penalty in a single generate() call:

Show code
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("gpt2")
tokenizer = AutoTokenizer.from_pretrained("gpt2")
inputs = tokenizer("The meaning of life is", return_tensors="pt")
output = model.generate(
    **inputs, max_new_tokens=50,
    do_sample=True, temperature=0.8, top_p=0.92, top_k=50,
    repetition_penalty=1.2
)
print(tokenizer.decode(output[0], skip_special_tokens=True))
Code Fragment 4.2.4: Library shortcut for stochastic decoding. A single transformers.generate() call applies temperature, top-k, top-p, and repetition_penalty in one pass; do_sample=True is the flag that switches from greedy to the sampling-based decoders developed earlier in the section.

pip install transformers

# Production equivalent using model.generate()
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("gpt2")
tokenizer = AutoTokenizer.from_pretrained("gpt2")
inputs = tokenizer("The future of AI", return_tensors="pt")
output = model.generate(
    **inputs, max_new_tokens=50,
    temperature=0.8, top_k=50, top_p=0.95,
    repetition_penalty=1.2, do_sample=True,
)
Code Fragment 4.2.5: The same sampling configuration written as a standalone production snippet. top_k bounds the candidate pool, top_p adapts it by cumulative mass, repetition_penalty discourages loops, and temperature controls the overall sharpness; tune one knob at a time when debugging output quality.
Warning
Common Misconception: Temperature and Top-p Are Not Redundant

Temperature reshapes the entire probability distribution (sharper or flatter). Top-p then truncates the reshaped distribution by removing the tail. Setting temperature=0.1 with top-p=0.9 is almost identical to temperature=0.1 alone, because the distribution is already so peaked that the nucleus contains only 1 to 2 tokens. To see top-p's effect, you need moderate temperature (0.7 to 1.0).

4.2.5 Min-p Sampling

Min-p sampling is a newer technique that takes a different approach to adaptive filtering. Instead of specifying a cumulative probability threshold, min-p sets a minimum relative probability: any token whose probability is less than min_p × max_probability is discarded.

$$\text{Keep token } x_{i} \text{ if } P(x_{i}) \geq \mathrm{min\_p} \times \max_{j} P(x_{j})$$

This is conceptually simple and has appealing properties. When the model is very confident (top token at 0.95), even a min_p of 0.1 only keeps tokens above 0.095, resulting in a tiny nucleus. When the model is uncertain (top token at 0.05), the threshold drops to 0.005, allowing many tokens through. The behavior adapts naturally without the cumulative probability bookkeeping of top-p.

import torch.nn.functional as F
import torch
# Min-p sampling: discard any token whose probability falls below
# min_p times the top token's probability, adapting the cutoff dynamically.
def min_p_sampling(logits, min_p=0.1, temperature=1.0):
    """Apply min-p filtering then sample."""
    scaled_logits = logits / temperature
    probs = F.softmax(scaled_logits, dim=-1)
    # Threshold: min_p * max probability
    max_prob = probs.max()
    threshold = min_p * max_prob
    # Zero out tokens below threshold
    filtered_probs = probs.clone()
    filtered_probs[probs < threshold] = 0.0
    # Renormalize and sample
    filtered_probs /= filtered_probs.sum()
    return torch.multinomial(filtered_probs, num_samples=1)
# Compare min-p behavior
for name, logits in [("Confident", confident_logits), ("Uncertain", uncertain_logits)]:
    probs = F.softmax(logits, dim=-1)
    max_p = probs.max().item()
    threshold = 0.1 * max_p
    kept = (probs >= threshold).sum().item()
    print(f"{name}: max_p={max_p:.3f}, threshold={threshold:.4f}, kept={kept} tokens")
Output: Confident: max_p=0.935, threshold=0.0935, kept=1 tokens Uncertain: max_p=0.181, threshold=0.0181, kept=8 tokens
Code Fragment 4.2.6: Threshold: min_p * max probability.
Fun Fact

Top-k, top-p, and min-p are three approaches to the same anxiety: "what if the model picks something stupid from the long tail?" Top-k says "only consider the 50 most likely tokens" (rigid). Top-p says "consider just enough tokens to cover 90% of the probability mass" (adaptive). Min-p says "consider any token that is at least 10% as likely as the most likely one" (adaptive in a different direction). They all work, they all have evangelists, and choosing between them is a sampling debate that has consumed approximately one billion Reddit comments since 2022.

4.2.6 Typical Sampling

Typical sampling (Meister et al., 2022) takes an information-theoretic approach. The idea is that humans tend to produce words that are neither too predictable nor too surprising. Formally, typical sampling keeps tokens whose information content (negative log-probability) is close to the entropy of the distribution (the expected information content).

A token with probability 0.9 carries very little surprise (low information). A token with probability 0.0001 carries enormous surprise. Typical sampling favors the middle ground: tokens that are about as surprising as you would expect on average. This tends to produce text that feels natural and avoids both boring and incoherent extremes.

Key Insight

Typical sampling reframes the generation question: instead of asking "which tokens are most probable?" it asks "which tokens are most typical given the model's uncertainty?" This is a subtle but important distinction. In a high-entropy context, typical tokens might have relatively low individual probability, while in a low-entropy context, only the top 1 or 2 tokens are typical.

All the sampling strategies we have covered so far control which tokens are considered and how probabilities are distributed. Yet even with well-tuned sampling, language models have a persistent tendency to fall into repetitive loops. The next family of techniques tackles this problem directly by penalizing tokens that have already appeared.

Key Takeaways
Self-Check
1. What is the key advantage of top-p sampling over top-k sampling?
Show Answer
Top-p (nucleus) sampling adapts the number of candidate tokens to the model's confidence at each step. When the model is confident, the nucleus is small; when uncertain, it expands. Top-k always keeps exactly k tokens regardless of the distribution shape, which can be too many for confident predictions or too few for uncertain ones.
2. If you set temperature to 0.0 (or very close to 0), what decoding strategy does sampling become equivalent to?
Show Answer
As temperature approaches 0, the softmax distribution becomes infinitely peaked on the highest-logit token, making sampling equivalent to greedy decoding. All probability mass concentrates on a single token, so sampling always selects that token.

What's Next?

The discussion continues in Section 4.2a: Penalties, Combining Methods & Sampling Lab, which covers repetition / frequency / presence penalties, how to stack temperature with top-k/top-p in practice, and a hands-on lab that visualizes what each knob does to the distribution. After that, Section 4.3 turns to advanced decoding and structured generation.

Further Reading

Core Sampling Methods

Holtzman, A. et al. (2020). "The Curious Case of Neural Text Degeneration." ICLR 2020. The landmark paper introducing nucleus (top-p) sampling. Demonstrates that maximization-based decoding leads to degenerate, repetitive text and proposes sampling from the dynamic nucleus of the probability distribution instead.
Fan, A., Lewis, M. & Dauphin, Y. (2018). "Hierarchical Neural Story Generation." ACL 2018. Introduces top-k sampling for open-ended text generation in the context of creative story writing. Shows that truncating the distribution to the top k tokens produces more coherent and interesting narratives than pure random sampling.
Hewitt, J. et al. (2022). "Truncation Sampling as Language Model Desmoothing." EMNLP 2022. Provides a theoretical framework unifying top-k and top-p sampling as forms of "desmoothing" that counteract the model's tendency to spread probability mass too broadly. Offers principled guidance for choosing truncation thresholds.
Meister, C., Vieira, T. & Cotterell, R. (2023). "Locally Typical Sampling." TACL. Proposes sampling tokens whose information content is close to the expected information (entropy) of the distribution. Produces text that is statistically more similar to human writing than top-k or top-p sampling alone.