Part V: Retrieval and Conversation
Chapter 21: Building Conversational AI Systems

Multi-Turn Dialogue & Conversation Flows

A single brilliant answer means nothing if the conversation that produced it makes no sense.

Echo Echo, Coherence-Obsessed AI Agent
Big Picture

Real conversations are messy. Users change their minds, ask for clarification, jump between topics, give ambiguous instructions, and sometimes say things the system cannot handle. A conversational AI system that only works for the "happy path" will fail in practice. Building on the memory systems from Section 21.3, this section covers the patterns and strategies for handling the full complexity of multi-turn dialogue, including clarification and correction flows, topic management, fallback hierarchies, human handoff, and the critical engineering challenge of managing context window overflow in long conversations.

Prerequisites

Multi-turn conversation evaluation builds on the dialogue architecture from Section 21.1 and the memory management strategies in Section 21.3. Familiarity with general LLM evaluation concepts from Section 29.1 provides helpful context, though this section focuses specifically on conversation-level metrics and testing patterns.

1. Conversation Repair Patterns

Conversation repair refers to the mechanisms a dialogue system uses to recover from misunderstandings, ambiguity, and errors. In human conversation, repair happens naturally through clarification questions, corrections, and confirmations. Production systems can combine these patterns with observability tooling to track which repair patterns are triggered most frequently. Building these patterns into a conversational AI system is essential for robust performance.

Fun Fact

Researchers at Stanford found that users correct chatbot misunderstandings an average of 3.2 times before giving up and rephrasing their entire request from scratch. The lesson: a good clarification prompt after the first confusion saves two rounds of user frustration.

Clarification Strategies

Key Insight

The optimal clarification threshold is not fixed; it depends on the cost of getting it wrong. A banking bot that might transfer money to the wrong account should clarify aggressively (low confidence threshold). A casual FAQ bot that might give a slightly imprecise answer can proceed more boldly (high confidence threshold). Calibrate your clarification trigger to the stakes of the action, not to a universal accuracy target.

When a user's message is ambiguous or incomplete, the system needs to ask for clarification rather than guess. The key design challenge is detecting when clarification is needed versus when the system should proceed with its best interpretation. Over-clarifying is annoying; under-clarifying leads to errors. Code Fragment 21.4.1 below puts this into practice.


# Define ClarificationType, ConversationRepairManager; implement detect_clarification_need, __init__, process_message
# Key operations: prompt construction, API interaction
from openai import OpenAI
from enum import Enum
import json

client = OpenAI()

class ClarificationType(Enum):
 NONE_NEEDED = "none_needed"
 AMBIGUOUS_REFERENCE = "ambiguous_reference"
 MISSING_INFORMATION = "missing_information"
 CONFLICTING_REQUEST = "conflicting_request"
 OUT_OF_SCOPE = "out_of_scope"
 UNCLEAR_INTENT = "unclear_intent"

def detect_clarification_need(
 user_message: str,
 conversation_history: list[dict],
 available_actions: list[str]
) -> dict:
 """Determine if clarification is needed before proceeding."""

 prompt = f"""Analyze whether this user message needs clarification
before the system can act. Consider the conversation history.

Available system actions: {', '.join(available_actions)}

Conversation history (last 3 turns):
{json.dumps(conversation_history[-6:], indent=2)}

Current user message: "{user_message}"

Return JSON with:
- needs_clarification: true/false
- type: one of [none_needed, ambiguous_reference, missing_information,
 conflicting_request, out_of_scope, unclear_intent]
- confidence: 0.0 to 1.0 (how confident the system is in its interpretation)
- best_interpretation: what the system thinks the user means
- clarification_question: question to ask if clarification needed
- alternatives: list of possible interpretations (if ambiguous)"""

 response = client.chat.completions.create(
 model="gpt-4o",
 messages=[{"role": "user", "content": prompt}],
 response_format={"type": "json_object"},
 temperature=0
 )
 return json.loads(response.choices[0].message.content)

class ConversationRepairManager:
 """Handles clarification, correction, and repair in dialogue."""

 def __init__(self, confidence_threshold: float = 0.75):
 self.confidence_threshold = confidence_threshold
 self.pending_clarification: dict = None
 self.correction_history: list[dict] = []

 def process_message(self, user_message: str, history: list,
 actions: list[str]) -> dict:
 """Decide whether to act, clarify, or handle a correction."""

 # Check if this is a correction of something previous
 if self._is_correction(user_message, history):
 return self._handle_correction(user_message, history)

 # Check if this answers a pending clarification
 if self.pending_clarification:
 return self._resolve_clarification(user_message)

 # Analyze the new message
 analysis = detect_clarification_need(
 user_message, history, actions
 )

 if (analysis["needs_clarification"]
 and analysis["confidence"] < self.confidence_threshold):
 self.pending_clarification = analysis
 return {
 "action": "clarify",
 "question": analysis["clarification_question"],
 "alternatives": analysis.get("alternatives", [])
 }

 return {
 "action": "proceed",
 "interpretation": analysis["best_interpretation"],
 "confidence": analysis["confidence"]
 }

 def _is_correction(self, message: str, history: list) -> bool:
 """Detect if the user is correcting a previous statement."""
 correction_markers = [
 "no, i meant", "actually,", "sorry, i meant",
 "not that", "i said", "no no", "correction:",
 "let me rephrase", "what i meant was",
 "change that to", "instead of"
 ]
 lower = message.lower().strip()
 return any(lower.startswith(m) for m in correction_markers)

 def _handle_correction(self, message: str, history: list) -> dict:
 """Process a user correction and update state."""
 self.correction_history.append({
 "original_context": history[-2:] if len(history) >= 2 else [],
 "correction": message
 })
 return {
 "action": "correct",
 "message": message,
 "instruction": (
 "The user is correcting their previous statement. "
 "Update your understanding accordingly."
 )
 }

 def _resolve_clarification(self, answer: str) -> dict:
 """Resolve a pending clarification with the user's answer."""
 resolved = {
 "action": "proceed",
 "original_question": self.pending_clarification,
 "clarification_answer": answer,
 "interpretation": (
 f"Original: {self.pending_clarification['best_interpretation']}. "
 f"Clarified with: {answer}"
 )
 }
 self.pending_clarification = None
 return resolved
Code Fragment 21.4.1: Detecting and handling user corrections in multi-turn dialogue, updating previously extracted information when the user revises an earlier statement.
Note: The Clarification Threshold

The confidence threshold for triggering clarification is one of the most important tuning parameters in a conversational system. Set it too low (e.g., 0.5) and the system asks too many questions, frustrating users who gave clear instructions. Set it too high (e.g., 0.95) and the system proceeds with wrong interpretations. Start with 0.75, then adjust based on user feedback. Task-critical applications (medical, financial) should use a lower threshold; casual chatbots should use a higher one.

2. Topic Management

In multi-turn conversations, users frequently switch between topics. They might start asking about one product, pivot to ask about shipping policies, and then return to the original product question. A robust system needs to detect topic switches, maintain context for each topic, and resume prior topics gracefully when the user returns to them. Figure 21.4.1 shows how the topic stack tracks context switches. Code Fragment 21.4.2 below puts this into practice.

T1 Laptop T2 Laptop T3 Shipping T4 Shipping T5 Returns T6 Laptop T7 Laptop switch switch resume Topic Stack Over Time: Laptop at T2 Shipping Laptop (saved) at T4 Returns Shipping Laptop at T5 Laptop (resumed) at T6
Figure 21.4.1: Topic stack management showing how the system tracks topic switches, saves context for suspended topics, and resumes them when the user returns.

# Define TopicContext, TopicManager; implement __init__, detect_topic_change, switch_topic
# Key operations: prompt construction, API interaction
from dataclasses import dataclass, field
from typing import Optional

@dataclass
class TopicContext:
 """Context for a single conversation topic."""
 topic_name: str
 summary: str = ""
 turns: list[dict] = field(default_factory=list)
 state: dict = field(default_factory=dict)
 is_resolved: bool = False

class TopicManager:
 """Manages topic tracking and switching in conversations."""

 def __init__(self):
 self.topic_stack: list[TopicContext] = []
 self.resolved_topics: list[TopicContext] = []

 def detect_topic_change(self, user_message: str,
 current_topic: Optional[TopicContext]) -> dict:
 """Detect if the user is switching, resuming, or staying on topic."""
 current_name = current_topic.topic_name if current_topic else "None"
 saved_topics = [t.topic_name for t in self.topic_stack[:-1]] \
 if len(self.topic_stack) > 1 else []

 prompt = f"""Given the current conversation topic and the user's new message,
determine the topic action.

Current topic: {current_name}
Saved (paused) topics: {saved_topics}
User message: "{user_message}"

Return JSON with:
- action: "continue" (same topic), "switch" (new topic), "resume" (back to saved topic)
- topic_name: name of the topic (new name if switch, existing if resume)
- reason: brief explanation"""

 response = client.chat.completions.create(
 model="gpt-4o-mini",
 messages=[{"role": "user", "content": prompt}],
 response_format={"type": "json_object"},
 temperature=0
 )
 return json.loads(response.choices[0].message.content)

 def switch_topic(self, new_topic_name: str) -> TopicContext:
 """Switch to a new topic, preserving the current one."""
 new_topic = TopicContext(topic_name=new_topic_name)
 self.topic_stack.append(new_topic)
 return new_topic

 def resume_topic(self, topic_name: str) -> Optional[TopicContext]:
 """Resume a previously paused topic."""
 for i, topic in enumerate(self.topic_stack):
 if topic.topic_name == topic_name:
 # Move to top of stack
 resumed = self.topic_stack.pop(i)
 self.topic_stack.append(resumed)
 return resumed
 return None

 def get_current_topic(self) -> Optional[TopicContext]:
 """Return the currently active topic."""
 return self.topic_stack[-1] if self.topic_stack else None

 def get_topic_context_string(self) -> str:
 """Generate context about active and paused topics."""
 if not self.topic_stack:
 return "No active topics."

 current = self.topic_stack[-1]
 parts = [f"Current topic: {current.topic_name}"]

 if current.summary:
 parts.append(f"Topic context: {current.summary}")

 paused = self.topic_stack[:-1]
 if paused:
 paused_names = [t.topic_name for t in paused]
 parts.append(f"Paused topics: {', '.join(paused_names)}")

 return " | ".join(parts)
Code Fragment 21.4.2: The TopicContext class provides __init__, detect_topic_change, switch_topic and related methods for this workflow in a complete, runnable example.

3. Guided Conversation Flows

Some conversations need to follow a structured sequence of steps while still feeling natural. Onboarding flows, troubleshooting wizards, and intake forms all benefit from a guided approach where the system steers the conversation through required stages while allowing the user to ask questions or deviate temporarily. Code Fragment 21.4.3 below puts this into practice.


# Define FlowStep, GuidedFlowEngine; implement __init__, get_current_prompt, process_response
# Key operations: prompt construction
from dataclasses import dataclass, field
from typing import Callable, Optional

@dataclass
class FlowStep:
 """A single step in a guided conversation flow."""
 id: str
 prompt: str
 validation: Optional[Callable] = None
 next_step: Optional[str] = None
 branches: dict = field(default_factory=dict) # condition -> step_id
 required: bool = True
 collected_value: Optional[str] = None

class GuidedFlowEngine:
 """Manages structured conversation flows with branching."""

 def __init__(self, steps: list[FlowStep]):
 self.steps = {s.id: s for s in steps}
 self.current_step_id: str = steps[0].id
 self.completed_steps: list[str] = []
 self.flow_data: dict = {}
 self.is_complete = False
 self.deviation_stack: list[str] = []

 def get_current_prompt(self) -> str:
 """Get the prompt for the current step."""
 step = self.steps[self.current_step_id]
 return step.prompt

 def process_response(self, user_response: str) -> dict:
 """Process user response for the current step."""
 step = self.steps[self.current_step_id]

 # Validate if validator exists
 if step.validation:
 is_valid, error_msg = step.validation(user_response)
 if not is_valid:
 return {
 "action": "retry",
 "message": error_msg,
 "step": step.id
 }

 # Store the response
 step.collected_value = user_response
 self.flow_data[step.id] = user_response
 self.completed_steps.append(step.id)

 # Determine next step (branching logic)
 next_id = self._get_next_step(step, user_response)

 if next_id is None:
 self.is_complete = True
 return {
 "action": "complete",
 "data": self.flow_data,
 "message": "Flow completed successfully."
 }

 self.current_step_id = next_id
 return {
 "action": "next",
 "prompt": self.steps[next_id].prompt,
 "step": next_id,
 "progress": len(self.completed_steps) / len(self.steps)
 }

 def handle_deviation(self, user_message: str) -> dict:
 """Handle when the user goes off-script mid-flow."""
 # Save current position
 self.deviation_stack.append(self.current_step_id)
 return {
 "action": "deviation",
 "saved_step": self.current_step_id,
 "instruction": (
 "The user has asked something outside the current flow. "
 "Answer their question, then guide them back to the flow. "
 f"Current step was: {self.steps[self.current_step_id].prompt}"
 )
 }

 def resume_flow(self) -> dict:
 """Resume the flow after a deviation."""
 if self.deviation_stack:
 self.current_step_id = self.deviation_stack.pop()
 step = self.steps[self.current_step_id]
 return {
 "action": "resume",
 "prompt": (
 f"Now, back to where we were. {step.prompt}"
 ),
 "step": step.id
 }

 def _get_next_step(self, step: FlowStep,
 response: str) -> Optional[str]:
 """Determine the next step based on response and branches."""
 # Check branches first
 for condition, target_id in step.branches.items():
 if condition.lower() in response.lower():
 return target_id
 # Fall back to default next
 return step.next_step

# Example: Troubleshooting flow
def validate_yes_no(response: str) -> tuple[bool, str]:
 if response.lower().strip() in ["yes", "no", "y", "n"]:
 return True, ""
 return False, "Please answer yes or no."

troubleshooting_flow = GuidedFlowEngine([
 FlowStep(
 id="start",
 prompt="Is your device currently powered on?",
 validation=validate_yes_no,
 branches={"no": "power_check", "yes": "connectivity"}
 ),
 FlowStep(
 id="power_check",
 prompt="Please try holding the power button for 10 seconds. Did it turn on?",
 validation=validate_yes_no,
 branches={"no": "hardware_issue", "yes": "connectivity"}
 ),
 FlowStep(
 id="connectivity",
 prompt="Can you see the Wi-Fi icon in the status bar?",
 validation=validate_yes_no,
 branches={"no": "wifi_fix", "yes": "app_check"}
 ),
 FlowStep(
 id="wifi_fix",
 prompt="Please go to Settings > Wi-Fi and toggle it off and on. Did that help?",
 validation=validate_yes_no,
 next_step="app_check"
 ),
 FlowStep(
 id="app_check",
 prompt="Which app is experiencing the issue?",
 next_step=None # End of flow
 ),
 FlowStep(
 id="hardware_issue",
 prompt="It sounds like there may be a hardware issue. I will connect you with our repair team.",
 next_step=None
 ),
])
Code Fragment 21.4.3: Building a structured information extraction pipeline with field-level validators that verify each extracted value before committing it to the state.

4. Fallback Strategies and Human Handoff

Every conversational system encounters situations it cannot handle. The quality of the fallback experience often determines user satisfaction more than the happy-path experience. A well-designed fallback hierarchy moves through increasingly robust recovery strategies before resorting to human handoff.

Key Insight

The best fallback strategies are invisible when they work. A clarification question that resolves the ambiguity, a topic redirect that moves the conversation to something the system can help with, or a graceful acknowledgment that narrows the user's request are all fallback strategies that the user may not even recognize as error recovery. The worst fallback is a generic "I don't understand" that provides no path forward. Figure 21.4.2 illustrates the fallback strategy hierarchy from least to most disruptive.

L1 Clarification Question "Did you mean X or Y?" L2 Rephrasing Request "Could you rephrase that? I want to help but I am not sure what you need." L3 Capability Redirect "I can help with A, B, or C. Which is closest to what you need?" L4 Graceful Acknowledgment "I do not have that information, but here is what I can tell you..." L5 Human Handoff "Let me connect you with a human agent who can assist." Least disruptive Most disruptive
Figure 21.4.2: Fallback strategy hierarchy from least disruptive (clarification) to most disruptive (human handoff). Systems should exhaust lower levels before escalating.

5. Context Window Overflow Management

As conversations grow long, the context window fills up. When the combined size of the system prompt, memory context, conversation history, and the new user message exceeds the model's context limit, something must be evicted. The strategy for what to remove and when to remove it has a significant impact on conversation quality.

Priority-Based Eviction

Priority-based eviction assigns importance scores to different types of content in the context window. When space runs out, the lowest-priority content is evicted first. System prompts and safety instructions always have the highest priority; routine conversation turns have the lowest. Code Fragment 21.4.4 below puts this into practice.


# Define ContextPriority, ContextBlock, ContextBudgetManager; implement __init__, add_block, build_context
# Key operations: results display, retrieval pipeline, RAG pipeline
import tiktoken
from dataclasses import dataclass
from enum import IntEnum

class ContextPriority(IntEnum):
 """Priority levels for context window content.
 Higher values are evicted last."""
 SYSTEM_PROMPT = 100 # Never evict
 SAFETY_RULES = 95 # Almost never evict
 USER_PROFILE = 80 # High value, compact
 ACTIVE_TASK_STATE = 75 # Critical for current task
 KEY_FACTS = 70 # Important remembered facts
 RETRIEVED_CONTEXT = 60 # RAG results
 RECENT_TURNS = 50 # Last few conversation turns
 SUMMARY = 40 # Compressed conversation history
 OLDER_TURNS = 20 # Older conversation messages
 EXAMPLES = 10 # Few-shot examples (first to go)

@dataclass
class ContextBlock:
 """A block of content in the context window."""
 content: str
 priority: ContextPriority
 token_count: int
 is_evictable: bool = True
 label: str = ""

class ContextBudgetManager:
 """Manages context window allocation with priority-based eviction."""

 def __init__(self, max_tokens: int = 128000,
 reserve_for_output: int = 4096):
 self.max_tokens = max_tokens - reserve_for_output
 self.encoder = tiktoken.encoding_for_model("gpt-4o")
 self.blocks: list[ContextBlock] = []

 def add_block(self, content: str, priority: ContextPriority,
 label: str = "", evictable: bool = True) -> None:
 """Add a content block to the context."""
 tokens = len(self.encoder.encode(content))
 self.blocks.append(ContextBlock(
 content=content, priority=priority,
 token_count=tokens, is_evictable=evictable,
 label=label
 ))

 def build_context(self) -> list[dict]:
 """Build the final context, evicting low-priority content if needed."""
 total = sum(b.token_count for b in self.blocks)

 if total <= self.max_tokens:
 # Everything fits
 return self._blocks_to_messages()

 # Need to evict. Sort evictable blocks by priority (ascending)
 evictable = [b for b in self.blocks if b.is_evictable]
 evictable.sort(key=lambda b: b.priority)

 tokens_to_free = total - self.max_tokens
 freed = 0
 evicted_labels = []

 for block in evictable:
 if freed >= tokens_to_free:
 break
 self.blocks.remove(block)
 freed += block.token_count
 evicted_labels.append(
 f"{block.label} ({block.token_count} tokens)"
 )

 print(f"Evicted {len(evicted_labels)} blocks: "
 f"{', '.join(evicted_labels)}")

 return self._blocks_to_messages()

 def get_budget_report(self) -> dict:
 """Report on how the context budget is allocated."""
 total = sum(b.token_count for b in self.blocks)
 by_priority = {}
 for b in self.blocks:
 name = b.priority.name
 by_priority[name] = by_priority.get(name, 0) + b.token_count

 return {
 "total_tokens": total,
 "max_tokens": self.max_tokens,
 "utilization": total / self.max_tokens,
 "allocation": by_priority,
 "blocks": len(self.blocks)
 }

 def _blocks_to_messages(self) -> list[dict]:
 """Convert blocks to chat message format."""
 # Sort by priority (highest first) for message ordering
 sorted_blocks = sorted(
 self.blocks, key=lambda b: b.priority, reverse=True
 )
 messages = []
 for block in sorted_blocks:
 role = "system" if block.priority >= 70 else "user"
 messages.append({"role": role, "content": block.content})
 return messages
Code Fragment 21.4.4: Managing a priority queue of pending actions so the most time-sensitive or high-value tasks are processed first in the conversation flow.
Warning: Never Evict Safety Content

System prompts containing safety rules, behavioral constraints, and guardrails should never be evictable. If the context window fills up and safety instructions are removed, the model may exhibit unexpected or harmful behavior. Always mark safety-critical content with the highest priority and set is_evictable=False. This is especially important for customer-facing applications where the safety prompt may contain refusal instructions or compliance requirements (see Chapter 32 for a full treatment of production safety).

6. Comparing Conversation Flow Strategies

6. Comparing Conversation Flow Strategies Intermediate
Strategy Use Case User Experience Implementation Complexity
Free-form Open-ended chat, creative writing Natural, flexible Low (model handles flow)
Guided flow Onboarding, troubleshooting, intake Structured, predictable Medium (step definitions)
Hybrid flow Customer support with tasks Balanced High (routing + flows)
Clarification-first High-stakes, low-error tasks Thorough but slower Medium (detection logic)
Progressive disclosure Complex products, education Gradual, not overwhelming Medium (step sequencing)
Self-Check
Q1: What is "conversation repair" and why is it important?
Show Answer
Conversation repair refers to the mechanisms a dialogue system uses to recover from misunderstandings, ambiguity, and errors. It includes clarification questions, correction handling, and confirmation flows. It is important because real conversations are messy: users give ambiguous input, change their minds, and make mistakes. A system without repair mechanisms will either make wrong assumptions (leading to bad outcomes) or refuse to proceed (frustrating users).
Q2: Why should fallback strategies be organized as a hierarchy?
Show Answer
A hierarchy ensures the system uses the least disruptive recovery strategy first before escalating to more disruptive ones. Clarification questions are less disruptive than rephrasing requests, which are less disruptive than human handoff. By exhausting lower levels first, the system resolves most issues quickly and naturally, only resorting to expensive operations (like connecting to a human agent) when truly necessary. Jumping straight to human handoff for every ambiguity would be wasteful and frustrating.
Q3: What is a topic stack and how does it help manage multi-turn conversations?
Show Answer
A topic stack is a data structure that tracks the conversation topics in order, with the current topic on top. When a user switches to a new topic, the previous topic is pushed down the stack (saved but not active). When the user returns to a previous topic, it is popped from the stack and restored as the active topic. This allows the system to maintain context for multiple interleaved topics and resume them seamlessly, rather than losing context when the user temporarily changes subject.
Q4: In priority-based context eviction, why should few-shot examples be among the first content evicted?
Show Answer
Few-shot examples serve as behavioral guidance that is most important at the start of a conversation. As the conversation progresses, the model has already established its response patterns through actual exchanges, making the examples redundant. Evicting examples first preserves more valuable content like the system prompt (behavioral rules), user profile (personalization), active task state (current progress), and recent conversation turns (immediate context). The model can maintain good behavior without examples once it has enough real conversation history to follow.
Q5: How does a guided flow engine handle user deviations mid-flow?
Show Answer
When a user asks a question or makes a comment outside the current flow step, the engine saves the current step position to a deviation stack, answers the user's off-script question, and then guides them back to the saved step. This preserves the flow progress while allowing natural conversation. The key is to acknowledge the deviation, handle it, and then smoothly return with a transition like "Now, back to where we were..." rather than rigidly refusing to answer anything off-script.
Key Takeaways
Real-World Scenario: Handling Complex Multi-Turn Flows in a Travel Booking Agent

Who: A conversational AI team at an online travel agency processing 200,000 bookings per month

Situation: Customers frequently changed requirements mid-conversation ("Actually, make it two rooms instead of one," or "Can we fly out a day earlier?"). The booking flow involved interdependent slots: changing the departure date affected flight availability, hotel pricing, and car rental schedules.

Problem: The linear slot-filling approach treated each change as a reset, forcing customers to re-confirm details they had already provided. A 5-slot booking that should take 8 turns often ballooned to 20+ turns when customers revised requirements.

Dilemma: Allowing free-form mid-conversation edits risked creating inconsistent booking states (e.g., a hotel checkout date before the check-in date). Strict validation after every change felt robotic and slowed the conversation.

Decision: They implemented a dependency graph for booking slots. When a slot changed, only dependent slots were re-validated. Independent slots (e.g., meal preferences) were preserved. Batch validation ran once before the final confirmation step.

How: The conversation state was stored as a structured JSON object with slot values, confidence scores, and dependency edges. The LLM received this state object in every turn and was instructed to output only the delta (changed slots). A rules engine propagated changes through dependencies.

Result: Average turns-to-completion dropped from 14 to 9. The "started over" frustration metric fell by 56%. Booking completion rate improved from 67% to 81% for multi-change conversations.

Lesson: Multi-turn systems that track slot dependencies and propagate changes selectively create a much smoother user experience than systems that either ignore changes or force a full restart.

Research Frontier

LLM-as-judge for conversations uses a separate LLM to evaluate dialogue quality across dimensions like coherence, helpfulness, and persona consistency, reducing the need for expensive human evaluation. Automated red-teaming generates adversarial conversation flows designed to trigger safety failures, persona breaks, or hallucinations. Conversation simulation frameworks (e.g., LMSYS Chatbot Arena, MT-Bench) are standardizing how we compare conversational systems. Research into preference-based evaluation is developing methods that directly optimize for user satisfaction rather than proxy metrics like BLEU or perplexity.

Exercises

These exercises cover multi-turn conversation management, repair patterns, and evaluation.

Exercise 21.4.1: Conversation repair Conceptual

Name three types of conversation repair patterns and give an example of each.

Show Answer

(a) Self-correction: "Wait, I meant Tuesday, not Monday." (b) Clarification request: "Can you be more specific about which account?" (c) Confirmation check: "Just to confirm, you want to cancel the subscription?"

Exercise 21.4.2: Topic management Conceptual

A user is discussing billing, then switches to a technical question, then asks to go back to billing. How should the topic management system handle this sequence?

Show Answer

Use a topic stack: push "billing" context when the conversation starts, push "technical" when the user switches (saving billing context), pop "technical" when user says "go back to billing" and restore the saved billing context. This preserves state for each topic.

Exercise 21.4.3: Guided flows Conceptual

Compare free-form conversation with guided conversation flows. When should a system switch from free-form to a guided flow?

Show Answer

Free-form is good for open-ended queries and exploration. Switch to guided flows when the task requires specific information in a specific order (e.g., filing a claim, making a reservation). The trigger is usually an identified intent that maps to a known structured task.

Exercise 21.4.4: Fallback hierarchy Conceptual

List the fallback strategies from least to most disruptive. Why should the system exhaust lower-level strategies before escalating?

Show Answer

Least to most disruptive: (1) clarification question, (2) offer suggestions, (3) rephrase and retry, (4) narrow the scope, (5) offer alternative channels, (6) escalate to human. Lower levels preserve conversation flow; escalation breaks it and adds cost.

Exercise 21.4.5: Context overflow Conceptual

A customer support conversation has reached 50 turns and the context window is full. Describe a strategy that keeps the conversation coherent without losing critical information from earlier turns.

Show Answer

Maintain a structured summary of the conversation so far (key facts, decisions, open issues) that is updated every 10 turns. Use vector memory for specific details. The context window contains: system prompt + structured summary + last 5 turns + any retrieved memories relevant to the current question.

Exercise 21.4.6: Repair detection Coding

Write a classifier that detects when a user is correcting a misunderstanding (e.g., "No, I meant...", "That is not what I asked"). Test on 20 example utterances.

Exercise 21.4.7: Topic tracker Coding

Implement a topic stack that detects topic switches, saves the context of the previous topic, and resumes it when the user returns. Test with a conversation that switches between 3 topics.

Exercise 21.4.8: Guided flow engine Coding

Build a simple guided conversation flow engine that walks the user through a multi-step process (e.g., filing a support ticket). Handle out-of-order responses and missing information gracefully.

Exercise 21.4.9: Conversation evaluator Coding

Build an automated conversation quality evaluator that scores a multi-turn dialogue on: (a) task completion, (b) coherence, (c) repair effectiveness, and (d) user satisfaction estimation. Use LLM-as-judge for each metric.

What Comes Next

In the next section, Section 21.5: Voice & Multimodal Interfaces, we cover voice and multimodal interfaces, extending conversational AI beyond text to speech and visual interaction.

References & Further Reading

Budzianowski, P. et al. (2018). "MultiWOZ: A Large-Scale Multi-Domain Wizard-of-Oz Dataset for Task-Oriented Dialogue Modelling." EMNLP 2018.

The most widely-used benchmark for multi-domain task-oriented dialogue. Contains thousands of annotated conversations across multiple services. Essential context for dialogue system evaluation.

Paper

He, W. et al. (2022). "GALAXY: A Generative Pre-trained Model for Task-Oriented Dialog with Semi-Supervised Learning." AAAI 2022.

Combines pre-training with explicit policy injection for task-oriented dialogue. Achieves strong performance with limited labeled data. Relevant for teams building specialized dialogue agents.

Paper

Andreas, J. et al. (2020). "Task-Oriented Dialogue as Dataflow Synthesis." TACL.

Reconceptualizes dialogue as program synthesis over dataflow graphs. Provides a principled framework for complex multi-turn interactions. Important theoretical contribution to dialogue modeling.

Paper

Lee, Y. et al. (2023). "Prompted LLMs as Chatbot Chapters for Long Open-domain Conversation." ACL Findings 2023.

Explores using prompted LLMs as modular components in long conversations. Addresses challenges of maintaining coherence across many turns. Practical insights for building long-running chatbots.

Paper

Rasa CALM (Conversational AI with Language Models).

Rasa's approach to combining traditional dialogue management with LLM capabilities. Offers structured conversation flows with LLM flexibility. Best for enterprise teams needing predictable multi-turn behavior.

Tool

OpenAI Function Calling Guide.

Official guide for implementing function calling in multi-turn conversations. Shows how to maintain state and manage tool interactions across turns. Practical reference for building tool-augmented chatbots.

Tool