Section 32.2: Hallucination & Reliability

I am confident in my answer. I am also, statistically speaking, occasionally wrong in ways that sound entirely plausible.
A Principled Guard, Confidently Wrong AI Agent

Big Picture

Hallucination is the single biggest obstacle to LLM reliability in production. Models generate fluent, confident text that is factually wrong, internally inconsistent, or unsupported by any source. This section covers the taxonomy of hallucination types, practical detection techniques (self-consistency, citation verification, natural language inference), and mitigation strategies ranging from RAG grounding (as covered in Chapter 20) to constrained generation and calibrated abstention.

Prerequisites

Before starting, make sure you are familiar with security threats from Section 32.1. The retrieval-augmented generation fundamentals from Section 20.1 provide the primary hallucination mitigation strategy, and the Section 29.1 are essential for measuring hallucination rates in practice.

A cartoon robot confidently presenting a treasure map that leads to an obviously wrong location off a cliff, while a small detective robot with a magnifying glass fact-checks the map against a real atlas, illustrating the tension between model confidence and factual accuracy. — The most dangerous hallucinations are the ones that sound completely right. Confidence and correctness are independent variables in language model outputs.

1. Hallucination Taxonomy

1. Hallucination Taxonomy Intermediate Comparison

Type	Description	Example
Factual fabrication	Inventing facts that sound plausible	Citing a non-existent paper
Intrinsic hallucination	Contradicting the provided source text	Summarizing a document with wrong numbers
Extrinsic hallucination	Adding information not in the source	Introducing claims absent from context
Self-contradiction	Making inconsistent statements within one response	Saying "X is true" then "X is false"
Outdated knowledge	Stating facts that were true at training time but not now	Reporting a CEO who has since been replaced

Fun Fact

In 2023, a lawyer submitted a legal brief containing six case citations fabricated by ChatGPT, complete with plausible docket numbers. The judge was not amused. This incident became the unofficial mascot of the hallucination problem and the strongest argument for always verifying LLM outputs.

Key Insight

Mental Model: The Confident Witness. An LLM that hallucinates is like an eyewitness who fills in gaps in their memory with plausible details and then reports the whole account with equal confidence. The witness is not lying; they genuinely cannot distinguish what they saw from what they inferred. Similarly, the model has no internal "I am making this up" signal. Detection strategies work like a detective cross-examining the witness: ask the same question multiple ways (self-consistency), check the story against records (NLI verification), and note which details change between tellings.

Key Insight

Hallucination rates vary dramatically by domain and query type. A model that hallucinates in 5% of general-knowledge queries may hallucinate in 30% of queries about recent events or niche technical topics. The teams that measure hallucination rates per domain rather than in aggregate catch reliability problems before users do. Combining multiple detection methods (self-consistency, NLI, citation verification) yields better coverage than any single technique alone.

Fun Fact

A 2024 study asked GPT-4 to provide legal citations and found that roughly 30% of the cited cases did not exist. The model had "hallucinated" perfectly formatted court case names, docket numbers, and even plausible judge names. At least two attorneys were sanctioned by courts for filing briefs containing fabricated AI-generated citations they did not verify.

The hallucination problem connects directly to the evaluation challenge from Section 29.1: if you cannot reliably evaluate whether an output is correct, you cannot reliably detect when it is fabricated. This is why hallucination detection combines multiple strategies. No single method catches everything, but the combination of self-consistency checking, NLI-based verification, and RAG-based grounding significantly reduces the rate of undetected hallucination.

2. Self-Consistency Detection

Self-consistency checking samples multiple answers to the same question and checks whether they agree. If the model gives contradictory answers across rephrasings, this signals that at least some responses are hallucinated. Code Fragment 32.2.3 below implements this approach using an LLM judge for cross-answer comparison.

Self-Consistency Hallucination Score. Given a question q, sample n responses {r1, ..., rn} at temperature T > 0. For each pair (ri, rj), compute an agreement indicator: agree(ri, rj) = NLIentailment(ri, rj) ∧ NLIentailment(rj, ri) $$consistency(q) = (2 / n(n-1)) \cdot \Sigma _{iA consistency score near 1.0 indicates the model is confident and self-consistent (lower hallucination risk). A score near 0.0 indicates contradictory outputs (higher hallucination risk). Bidirectional NLI entailment ensures that agreement is symmetric: ri must entail rj and vice versa.


# implement self_consistency_check
# Key operations: prompt construction, API interaction
from openai import OpenAI

client = OpenAI()

def self_consistency_check(question: str, n_samples: int = 5, temperature: float = 0.8):
 """Generate multiple answers and check agreement."""
 responses = []
 for _ in range(n_samples):
 r = client.chat.completions.create(
 model="gpt-4o-mini",
 messages=[{"role": "user", "content": question}],
 temperature=temperature,
 )
 responses.append(r.choices[0].message.content.strip())

 # Use LLM to check consistency across responses
 check_prompt = f"""Given these {n_samples} answers to the same question, are they
consistent with each other? Report any contradictions.

Question: {question}

Answers:
""" + "\n".join(f"{i+1}. {r}" for i, r in enumerate(responses))

 verdict = client.chat.completions.create(
 model="gpt-4o-mini",
 messages=[{"role": "user", "content": check_prompt}],
 temperature=0.0,
 )
 return {"responses": responses, "consistency_check": verdict.choices[0].message.content}

{'claim': 'The Eiffel Tower is 330 meters tall.', 'supported': True, 'label': 'entailment', 'confidence': 0.943} {'claim': 'The Eiffel Tower was built in 1920.', 'supported': False, 'label': 'contradiction', 'confidence': 0.891}

Code Fragment 32.2.1: implement self_consistency_check


# implement calibrated_abstention
# Key operations: prompt construction, API interaction
def calibrated_abstention(question: str, context: str, threshold: float = 0.7):
 """Generate answer with confidence score; abstain if below threshold."""
 from openai import OpenAI
 client = OpenAI()

 prompt = f"""Based ONLY on the context below, answer the question.
After your answer, rate your confidence from 0.0 to 1.0.

Context: {context}
Question: {question}

Format:
Answer: [your answer]
Confidence: [0.0 to 1.0]"""

 response = client.chat.completions.create(
 model="gpt-4o-mini",
 messages=[{"role": "user", "content": prompt}],
 temperature=0.0,
 )
 text = response.choices[0].message.content

 # Parse confidence
 import re
 conf_match = re.search(r"Confidence:\s*([\d.]+)", text)
 confidence = float(conf_match.group(1)) if conf_match else 0.0

 if confidence < threshold:
 return {"answer": "I don't have enough information to answer confidently.",
 "confidence": confidence, "abstained": True}
 return {"answer": text, "confidence": confidence, "abstained": False}

Code Fragment 32.2.2: Using an LLM judge to verify response consistency by checking whether the model gives the same answer when the question is rephrased multiple ways.

Figure 32.2.1 compares three complementary detection approaches: self-consistency checking, citation verification, and NLI-based entailment scoring. Code Fragment 32.2.2 below puts this into practice.

Figure 32.2.1: Three complementary approaches detect different types of hallucination at different stages.

NLI-Based Hallucination Detection

Natural Language Inference (NLI) models classify whether a claim is entailed by, contradicts, or is neutral with respect to a source passage. This makes them a powerful automated factuality checker. Code Fragment 32.2.2 below uses a zero-shot NLI classifier to verify whether generated claims are supported by source documents.


# implement check_faithfulness
# Key operations: results display
from transformers import pipeline

nli = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")

def check_faithfulness(source: str, claim: str) -> dict:
 """Check if a claim is supported by the source using NLI."""
 result = nli(claim, candidate_labels=[
 "entailment", "contradiction", "neutral"
 ], hypothesis_template="Based on the source, this claim is {}.")
 return {
 "claim": claim,
 "supported": result["labels"][0] == "entailment",
 "label": result["labels"][0],
 "confidence": round(result["scores"][0], 3),
 }

source = "The Eiffel Tower was completed in 1889 and is 330 meters tall."
print(check_faithfulness(source, "The Eiffel Tower is 330 meters tall."))
print(check_faithfulness(source, "The Eiffel Tower was built in 1920."))

Code Fragment 32.2.3: implement check_faithfulness

3. Mitigation Strategies

This snippet demonstrates bias mitigation techniques including prompt debiasing and output calibration.

Tip

For high-stakes applications (medical, legal, financial), combine NLI-based verification with a simple heuristic: if the model's response contains any specific numbers, dates, or proper nouns, verify each one against the source documents. This "fact-check the specifics" rule catches the most dangerous hallucinations (wrong dosage, wrong date, wrong company name) while keeping verification costs manageable. Generic fluency hallucinations are less harmful and can be caught by periodic human review.

Once hallucinations are detected, several mitigation strategies are available. Figure 32.2.2 arranges these along a spectrum from lower effort (RAG grounding) to higher reliability (calibrated abstention).

Figure 32.2.2: Mitigation strategies range from RAG grounding (simple but partial) to calibrated abstention (most reliable for high-stakes use cases).

In practice, detection and mitigation work together as a pipeline. Figure 32.2.3 shows how a production system routes LLM output through detection checks and selects the appropriate mitigation response based on the confidence score and use case risk level.

Figure 32.2.3 A production hallucination pipeline routes LLM output through detection checks and selects the response strategy based on confidence score and application risk tolerance.

Warning

LLM self-reported confidence scores are not well calibrated. Models tend to express high confidence even when wrong. Use self-consistency (agreement across multiple samples) as a more reliable confidence proxy than asking the model to rate its own certainty.

Note

RAG reduces but does not eliminate hallucination. Models can hallucinate even with perfect context if the answer requires reasoning that the model performs incorrectly, or if the model ignores the context and relies on parametric knowledge instead. Always verify RAG outputs against their cited sources.

Key Insight

For high-stakes applications (medical, legal, financial), the best strategy is calibrated abstention: the system should refuse to answer when confidence is low rather than risk a convincing but wrong response. Users prefer "I don't know" over a confident hallucination that could lead to harmful decisions.

Self-Check

1. What is the difference between intrinsic and extrinsic hallucination?

Show Answer

Intrinsic hallucination contradicts the provided source text (e.g., stating wrong numbers from a document). Extrinsic hallucination introduces information not present in any source (e.g., adding fabricated claims). Intrinsic hallucinations are easier to detect because they conflict with available evidence.

2. How does self-consistency detect hallucination?

Show Answer

Self-consistency generates multiple responses to the same question at high temperature and checks whether they agree. When the model is confident and correct, different samples tend to converge on the same answer. When the model is hallucinating, samples diverge because there is no grounded answer to converge on. Low agreement signals potential hallucination.

3. Why is NLI useful for hallucination detection in RAG systems?

Show Answer

NLI (Natural Language Inference) models classify whether a hypothesis is entailed by, contradicts, or is neutral to a premise. In RAG, the retrieved context is the premise and each claim in the LLM output is a hypothesis. Claims labeled as "contradiction" indicate intrinsic hallucination; "neutral" claims may indicate extrinsic hallucination.

4. What is calibrated abstention and when should it be used?

Show Answer

Calibrated abstention is a strategy where the system refuses to answer when its confidence falls below a threshold, responding with "I don't have enough information" instead. It should be used in high-stakes domains (medical, legal, financial) where a confident but wrong answer could cause real harm, and where users would rather receive no answer than an unreliable one.

5. Why doesn't RAG completely solve the hallucination problem?

Show Answer

RAG provides relevant context, but the model can still hallucinate for several reasons: it may perform incorrect reasoning over the context, ignore the context and rely on parametric knowledge, combine information from multiple passages in invalid ways, or extrapolate beyond what the context actually states. RAG reduces but does not eliminate the fundamental tendency to generate unsupported claims.

Real-World Scenario: Reducing Hallucination in a Legal Research Assistant

Who: An AI engineering team at a legal technology company

Situation: Their LLM-powered case law research tool generated summaries with case citations. Lawyers relied on these summaries for brief preparation.

Problem: In a pilot with 50 attorneys, 12% of generated citations referred to cases that did not exist. One fabricated citation nearly made it into a court filing, threatening the company's credibility.

Dilemma: Adding citation verification through a legal database API increased response time from 3 seconds to 9 seconds. Lawyers valued speed, but a single fabricated citation could cause professional sanctions.

Decision: They implemented calibrated abstention: the system verified every citation against a case law database before displaying results, and refused to answer when verification failed.

How: Each cited case was checked against a Westlaw API. Verified citations displayed a green indicator. Unverifiable citations were replaced with a notice: "Citation could not be verified; please check manually." If more than 30% of citations failed verification, the entire response was withheld.

Result: Fabricated citation rate dropped from 12% to 0.2%. Response time increased to 7 seconds (acceptable once attorneys understood the verification step). Attorney trust scores rose from 3.1 to 4.6 out of 5.

Lesson: In high-stakes domains, users prefer slower, verified answers over fast, unreliable ones. Calibrated abstention preserves trust even when it reduces throughput.

Key Takeaways

Hallucinations come in five types: factual fabrication, intrinsic contradiction, extrinsic addition, self-contradiction, and outdated knowledge.
Self-consistency detection generates multiple samples and measures agreement; low agreement signals potential hallucination.
NLI-based detection checks whether output claims are entailed by the source context, catching both intrinsic and extrinsic hallucination.
RAG grounding reduces hallucination by anchoring responses to retrieved documents, but does not eliminate it entirely.
Calibrated abstention is the safest strategy for high-stakes applications: refuse to answer when confidence is low.
LLM self-reported confidence is not well calibrated; use self-consistency or NLI scores as more reliable confidence proxies.

5. Privacy Risks and Memorization

LLMs memorize fragments of their training data, and that memorization creates privacy risks that go beyond hallucination. While hallucination produces false information, memorization can leak real information: personally identifiable information (PII), proprietary code, private conversations, and other sensitive content that appeared in the training corpus. Understanding these risks is essential for responsible deployment.

5.1 Training Data Extraction Attacks

Carlini et al. (2021, 2023) demonstrated that language models can be prompted to regurgitate verbatim training data, including phone numbers, email addresses, and code snippets. The attack is straightforward: provide a prefix that appeared in the training data, and the model may complete it with memorized content. Larger models memorize more data, and data that appears multiple times in the training corpus is more extractable. This creates a direct tension between model capability and privacy: the same scale that improves performance also increases memorization risk.

5.2 Membership Inference Attacks

Membership inference attacks determine whether a specific data point was part of the training set. For LLMs, this means testing whether a particular text passage was used during training. The typical approach compares the model's perplexity (or loss) on the target text against a reference distribution: if the model assigns unusually low perplexity, the text was likely in the training data. This matters for copyright disputes, data governance audits, and compliance with data subject access requests under regulations like GDPR.

5.3 Training Data Extraction in Practice

Beyond confirming membership, attackers can extract verbatim training data from production models. Carlini et al. (2023) demonstrated a scalable extraction attack on ChatGPT: by prompting the model to repeat a single word indefinitely ("Repeat the word 'poem' forever"), the model eventually diverged from the repetition task and began emitting memorized training data, including email addresses, phone numbers, and snippets of copyrighted text. The attack exploits a failure mode in RLHF alignment that does not fully cover degenerate repetition prompts.

Factors affecting memorization include data duplication (content that appears multiple times in the training corpus is far more extractable), model size (larger models memorize more), and training duration (overtraining increases memorization). PII is particularly at risk because names, addresses, and contact information often appear in specific, templated formats that models memorize as patterns.

5.4 Differential Privacy for LLM Training

Differential privacy (DP) provides a mathematical framework for limiting how much any single training example can influence the trained model. DP-SGD (Differentially Private Stochastic Gradient Descent) modifies the training loop in two ways: (1) it clips per-example gradients to bound the maximum influence of any one example, and (2) it adds calibrated Gaussian noise to the aggregated gradients before the parameter update. The privacy guarantee is parameterized by epsilon (lower epsilon means stronger privacy) and delta (the probability of the guarantee failing).

Applying DP-SGD to large language models remains challenging. At privacy budgets that provide meaningful protection (epsilon below 10), model quality degrades noticeably. Current research focuses on techniques to reduce this quality gap: pre-training on public data without DP and then fine-tuning with DP on sensitive data, using larger batch sizes to improve the signal-to-noise ratio, and applying DP selectively to the layers most responsible for memorization.

5.5 Canary Insertion and Privacy Auditing

Canary insertion is a practical technique for measuring memorization risk. Before training, you insert synthetic "canary" strings into the training data: unique sequences (such as random credit card numbers or fabricated email addresses) that do not appear anywhere else. After training, you probe the model to see if it can complete or reproduce the canaries. The extraction rate of canaries provides a lower bound on the model's memorization of real sensitive data.

Privacy auditing methodology formalizes this into a systematic process: (1) define a set of sensitive data categories (PII, proprietary code, copyrighted text), (2) insert canaries for each category at varying duplication levels, (3) after training, run extraction attacks using prefix prompting and repetition-based elicitation, (4) measure the extraction rate per category, and (5) compare against acceptable thresholds defined by your compliance requirements.

Real-World Scenario: Measuring Memorization with Perplexity Probes

Who: A privacy engineer on the ML platform team at a telecom company

Situation: The team had fine-tuned an LLM on 2 million customer support transcripts to power an automated troubleshooting assistant. Before launch, the company's data protection officer required evidence that the model had not memorized customer PII.

Problem: Standard output filtering could catch structured PII (phone numbers, emails), but the team had no way to verify whether the model had memorized unstructured personal details embedded in conversation transcripts.

Decision: They built a perplexity-based memorization audit. They computed the model's perplexity on held-out customer records that were not in the training data and compared it to perplexity on records that were included. If the model assigned significantly lower perplexity to training examples, this indicated memorization. They set a threshold: any individual record with a perplexity ratio (training vs. held-out) exceeding 2.0 would trigger deduplication and retraining.

Result: The audit identified 340 customer records with perplexity ratios above 2.0, all of which appeared 5+ times in the training data. After deduplication and retraining, the maximum ratio dropped to 1.3, within the acceptable range. The data protection officer approved launch.

Lesson: Perplexity probes provide a quantitative, auditable signal for memorization that satisfies compliance reviewers in ways that qualitative testing cannot.

5.6 Mitigation Strategies

5.3 Mitigation Strategies Comparison

Strategy	Mechanism	Trade-off
Differential privacy (DP-SGD)	Adds calibrated noise to gradients during training, providing a mathematical guarantee that no single training example overly influences the model	Significant performance degradation at strong privacy budgets; current research focuses on reducing this gap
Data deduplication	Removes duplicate passages from training data before training (see Section 06.3 on data curation)	Reduces memorization of frequently repeated content; does not protect unique sensitive data
Machine unlearning	Post-hoc removal of specific data from a trained model, typically through gradient ascent on the target data or fine-tuning on modified datasets	Active research area; current methods may not fully remove information and can degrade model quality
Output filtering	Scanning model outputs for PII patterns (emails, phone numbers, SSNs) and redacting before delivery	Catches structured PII but misses unstructured sensitive information; adds latency
Guardrails and system prompts	Instructing the model to avoid reproducing sensitive data	Easily bypassed through prompt engineering; should never be the sole defense

5.7 PII Leakage in Production

Production LLM systems face PII leakage risks from two directions: the base model may reproduce training data, and user inputs may leak across sessions if conversation data is used for fine-tuning or if context management has bugs. Practical defenses include: running PII detection classifiers on both inputs and outputs, implementing strict data isolation between users, establishing clear data retention policies, and logging potential PII exposures for audit (see Section 32.1 for the broader security framework). For applications handling health, financial, or legal data, these measures are not optional; they are regulatory requirements under frameworks like HIPAA, PCI-DSS, and GDPR.

Key Insight

Privacy protection requires defense in depth: data curation before training (Section 06.3), differential privacy during training, output filtering after generation, and monitoring in production. No single technique is sufficient. The most robust deployments layer all four approaches, treating privacy as a system property rather than a feature of any individual component.

Research Frontier

Open Questions:

Can hallucination rates be reduced to acceptable levels for high-stakes applications (legal, medical, financial), or will human verification always be required?
What is the relationship between model size, training data quality, and hallucination rates? Scaling alone does not eliminate hallucinations, and the underlying mechanisms remain poorly understood.

Recent Developments (2024-2025):

Citation and attribution techniques (2024-2025), where models ground responses in retrieved sources and provide verifiable references, emerged as the most practical approach to reducing hallucination impact in production systems.
Mechanistic interpretability research (2024-2025) began identifying specific circuit patterns associated with factual recall versus confabulation, offering potential paths toward architectural solutions to hallucination.

Explore Further: Test a model on 50 factual questions with known answers, with and without RAG. Measure hallucination rates in both conditions and analyze whether hallucinations correlate with question difficulty or topic area.

Exercises

Exercise 32.2.1: Hallucination Types Conceptual

Define and give an example of each hallucination type: factual fabrication, intrinsic hallucination, extrinsic hallucination, and self-contradiction. Explain which type is most dangerous in a medical Q&A system.

Answer Sketch

Factual fabrication: inventing a non-existent research paper. Intrinsic hallucination: summarizing a patient record but changing a dosage number. Extrinsic hallucination: adding a drug interaction not mentioned in the provided context. Self-contradiction: stating a drug is safe and then listing it as contraindicated. Intrinsic hallucination is most dangerous in medical settings because it subtly alters factual content the user trusts as coming from a real source.

Exercise 32.2.2: Self-Consistency Detection Coding

Implement a self-consistency hallucination detector: generate 5 responses to the same question (with temperature > 0), compare them, and flag claims that appear in fewer than 3 of the 5 responses as potentially hallucinated.

Answer Sketch

Call the LLM 5 times with the same prompt at temperature=0.7. For each response, extract key claims (using an LLM to decompose into atomic statements). For each claim, check how many of the 5 responses contain a semantically equivalent claim (using embedding similarity). Claims appearing in fewer than 3 responses are inconsistent and flagged. This works because hallucinated facts are rarely consistent across samples, while true facts tend to be stable.

Exercise 32.2.3: RAG Grounding Effectiveness Analysis

A RAG system reduces hallucination from 15% to 5%. Analyze why the remaining 5% still occurs. Describe three failure modes where RAG does not prevent hallucination.

Answer Sketch

(1) The retrieved context is relevant but incomplete, so the model fills in gaps from its parametric memory (often incorrectly). (2) The retrieved context is wrong or outdated, and the model faithfully generates answers based on bad context. (3) The model over-generates beyond the context, adding plausible-sounding details not in any retrieved document. Mitigation: use faithfulness scoring to catch case 3, improve retrieval quality for case 2, and train the model to say "I don't have enough information" for case 1.

Exercise 32.2.4: Calibrated Abstention Conceptual

Explain the concept of calibrated abstention: when and how should an LLM refuse to answer rather than risk a hallucinated response? Design a set of rules for a legal Q&A system.

Answer Sketch

Calibrated abstention means the model declines to answer when its confidence is below a threshold. For a legal Q&A system: (1) Abstain if the retrieved context does not contain information about the specific jurisdiction. (2) Abstain if the question requires interpretation of case law not in the knowledge base. (3) Abstain if the question asks for specific legal advice (recommend consulting a lawyer). (4) Abstain if the self-consistency score across multiple generations is below 0.6. Implementation: add a confidence estimation step before the final response and route low-confidence queries to human experts.

Exercise 32.2.5: Citation Verification Pipeline Coding

Design a citation verification system that checks whether statements in an LLM response are supported by the cited sources. Outline the architecture, including claim extraction, source retrieval, and entailment checking.

Answer Sketch

Architecture: (1) Claim extraction: use an LLM to decompose the response into (claim, citation) pairs. (2) Source retrieval: fetch the cited document or passage. (3) Entailment checking: use an NLI model or LLM to classify whether the source entails, contradicts, or is neutral toward each claim. (4) Scoring: compute the fraction of claims that are entailed by their cited sources. (5) Flagging: mark claims that are contradicted or unsupported. This pipeline can run as an output guardrail or as a batch evaluation process.

What Comes Next

In the next section, Section 32.3: Bias, Fairness & Ethics, we examine bias, fairness, and ethics in LLM systems, addressing the social responsibilities of deploying AI.