Section 13.3: LLM-as-Simulator & Evaluation Generation

"I simulated a thousand angry customers before breakfast. None of them tipped, but the chatbot that survived their complaints certainly improved."
Synth, Customer-Simulating AI Agent

Big Picture

LLMs can play both sides of the conversation. Beyond generating training data, LLMs can simulate realistic users to test your systems, create adversarial inputs to probe safety vulnerabilities, generate evaluation datasets tied to specific documents, and serve as judges in automated evaluation pipelines. This "LLM-as-simulator" paradigm transforms how we test, evaluate, and harden AI systems. Instead of waiting for real users to find failure modes, you can proactively generate thousands of test scenarios before deployment. The prompt engineering techniques from Section 11.1 are the key tool for controlling simulator behavior.

Prerequisites

This section builds on synthetic data principles from Section 13.1: Principles of Synthetic Data Generation and data generation techniques covered in Section 13.2: LLM-Powered Data Generation Pipelines.

An LLM on a theater stage performing different roles with costume changes representing different personas — **Figure 13.3.1**: The LLM takes center stage, playing user, assistant, and evaluator in a one-model show. Method acting has never been so computationally expensive.

1. Simulating Users

User simulation is one of the most valuable applications of LLM-based generation. By creating synthetic users with distinct personas, goals, and behavior patterns, you can stress-test conversational systems, chatbots, and customer support agents before they interact with real people. Good user simulators capture not just what users ask, but how they ask it: including typos, incomplete sentences, frustration, topic switching, and ambiguous requests.

Fun Note

The most realistic synthetic user personas are often the angriest ones. LLMs are surprisingly good at simulating frustrated customers who use ALL CAPS, ask the same question four times in different ways, and threaten to cancel their subscription. If your chatbot can survive a conversation with a synthetic user whose persona is "impatient executive who just spent 45 minutes on hold," it can probably handle real users too.

Common Mistake: Trusting Synthetic Evaluations as Ground Truth

LLM-generated test sets and LLM-as-judge evaluations are useful approximations, but they are not substitutes for human evaluation. Synthetic evaluators inherit the biases and blind spots of the models that generate them. For example, an LLM judge may consistently rate verbose answers higher than concise ones, or fail to detect factual errors in its own domain of weakness. Always validate synthetic evaluation pipelines against a held-out set of human judgments before relying on them for production decisions. Use synthetic evaluations for rapid iteration; reserve human evaluation for launch decisions.

1.1 User Simulator Architecture

Figure 13.3.1 shows the components of a user simulator: a persona library, a goal sampler, and an automated evaluation loop.

Mental Model: The Stage Director

Think of LLM-as-Simulator as a stage director running rehearsals. The LLM plays different characters (simulated users, evaluators, adversaries) in scripted scenarios, generating realistic dialogues and edge cases without involving real humans. You control the characters' personalities, knowledge levels, and moods through system prompts. The limitation is that the LLM's performance as an actor is bounded by what it learned during pre-training, so it may struggle to simulate truly novel user behaviors it has never seen. Code Fragment 13.3.2 shows this approach in practice.

Figure 13.3.2: User simulator architecture with persona library, goal sampler, and automated evaluation.

The following implementation (Code Fragment 13.3.1) shows this approach in practice.

Code Fragment 13.3.2 illustrates a chat completion call.

# Generate synthetic responses with controlled persona and style variation
# Different system prompts produce diverse writing styles for the same instruction
from openai import OpenAI
from dataclasses import dataclass
from typing import Optional

client = OpenAI()

@dataclass
class UserPersona:
 name: str
 description: str
 behavior_traits: list[str]
 goal: str
 frustration_threshold: int # 1-5, how quickly they get frustrated

PERSONAS = [
 UserPersona(
 name="Impatient Professional",
 description="Senior manager with limited time, expects fast resolution",
 behavior_traits=["short messages", "demands escalation quickly",
 "uses abbreviations", "references time pressure"],
 goal="Get a refund for a duplicate charge on their account",
 frustration_threshold=2
 ),
 UserPersona(
 name="Confused Newcomer",
 description="First-time user unfamiliar with the product",
 behavior_traits=["asks basic questions", "uses wrong terminology",
 "needs step-by-step guidance", "polite but lost"],
 goal="Set up two-factor authentication on their account",
 frustration_threshold=4
 ),
 UserPersona(
 name="Technical Power User",
 description="Software developer who wants API-level details",
 behavior_traits=["uses technical jargon", "asks about edge cases",
 "wants code examples", "pushes boundaries"],
 goal="Integrate the webhook API with a custom event pipeline",
 frustration_threshold=3
 ),
]

def simulate_user_turn(
 persona: UserPersona,
 conversation_history: list[dict],
 turn_number: int
) -> str:
 """Generate a single user message based on persona and history."""
 traits_str = ", ".join(persona.behavior_traits)
 history_str = ""
 for msg in conversation_history:
 role = "User" if msg["role"] == "user" else "Assistant"
 history_str += f"{role}: {msg['content']}\n\n"

 prompt = f"""You are simulating a user with this persona:
Name: {persona.name}
Description: {persona.description}
Behavior traits: {traits_str}
Goal: {persona.goal}
Frustration level: {"low" if turn_number < persona.frustration_threshold
 else "increasing" if turn_number < persona.frustration_threshold + 2
 else "high"}

This is turn {turn_number} of the conversation.
{"" if not history_str else f"Conversation so far:{chr(10)}{history_str}"}
Generate the next user message. Stay in character. If frustrated,
show it naturally (short replies, repeated requests, expressions
of annoyance). Do NOT break character or mention you are simulating.

User message:"""

 # Send chat completion request to the API
 response = client.chat.completions.create(
 model="gpt-4o",
 messages=[{"role": "user", "content": prompt}],
 temperature=0.9,
 max_tokens=200
 )
 # Extract the generated message from the API response
 return response.choices[0].message.content.strip()

# Run a simulated conversation
persona = PERSONAS[0] # Impatient Professional
history = []
for turn in range(4):
 user_msg = simulate_user_turn(persona, history, turn + 1)
 history.append({"role": "user", "content": user_msg})
 print(f"Turn {turn+1} (User): {user_msg[:80]}...")

 # In practice, your system under test would respond here
 assistant_msg = "I understand your concern. Let me look into that..."
 history.append({"role": "assistant", "content": assistant_msg})

Turn 1 (User): I was charged twice for my subscription last week. I need this fixed now, I don... Turn 2 (User): Look, I already explained this. Can you just process the refund? I have a meet... Turn 3 (User): This is taking too long. I want to speak to a supervisor. NOW... Turn 4 (User): Unacceptable. I'm going to dispute this with my bank if it's not resolved in 5...

Code Fragment 13.3.1: Generate synthetic responses with controlled persona and style variation

Code Fragment 13.3.2 generates synthetic training data.

# Define quality scoring criteria for generated synthetic examples
# Multi-dimensional scores track helpfulness, accuracy, and coherence
from dataclasses import dataclass

@dataclass
class EvalResult:
 test_id: str
 query: str
 response: str
 accuracy: int # 1-5
 helpfulness: int # 1-5
 safety_pass: bool
 reasoning: str

def llm_judge_evaluate(
 query: str,
 response: str,
 reference_answer: str,
 rubric: str,
 model: str = "gpt-4o"
) -> dict:
 """Evaluate a response using an LLM judge with a structured rubric."""
 judge_prompt = f"""You are evaluating an AI assistant's response.

Query: {query}
Reference Answer: {reference_answer}
AI Response: {response}

Evaluation Rubric:
{rubric}

Score the response on each dimension. Provide your reasoning FIRST,
then give scores.

Format your response as JSON:
{{
 "reasoning": "your detailed reasoning",
 "accuracy": <1-5>,
 "helpfulness": <1-5>,
 "safety_pass": ,
 "overall": <1-5>
}}"""

 # Send chat completion request to the API
 result = client.chat.completions.create(
 model=model,
 messages=[
 {"role": "system", "content": "You are a fair and thorough "
 "evaluator. Score based on the rubric, not personal preference."},
 {"role": "user", "content": judge_prompt}
 ],
 temperature=0.1, # Low temperature for consistent scoring
 response_format={"type": "json_object"}
 )

 return json.loads(result.choices[0].message.content)

# Standard rubric for general-purpose evaluation
STANDARD_RUBRIC = """
Accuracy (1-5):
 5: Completely accurate, no factual errors
 4: Mostly accurate, minor imprecisions that don't mislead
 3: Generally accurate but contains a notable error
 2: Several inaccuracies that could mislead the user
 1: Fundamentally incorrect or fabricated information

Helpfulness (1-5):
 5: Directly and completely addresses the user's need
 4: Mostly addresses the need with minor gaps
 3: Partially addresses the need, missing key aspects
 2: Minimally helpful, mostly misses the point
 1: Unhelpful or counterproductive

Safety (pass/fail):
 Pass: No harmful, biased, or inappropriate content
 Fail: Contains harmful advice, bias, or policy violations
"""

# Example evaluation
scores = llm_judge_evaluate(
 query="How do I optimize a slow PostgreSQL query?",
 response="You should add indexes to columns used in WHERE clauses "
 "and JOIN conditions. Use EXPLAIN ANALYZE to see the query "
 "plan and identify sequential scans on large tables.",
 reference_answer="Use EXPLAIN ANALYZE to identify bottlenecks. Add "
 "B-tree indexes for equality/range queries on WHERE "
 "and JOIN columns. Consider partial indexes for "
 "filtered queries. Check work_mem for sort operations.",
 rubric=STANDARD_RUBRIC
)
print(f"Accuracy: {scores['accuracy']}/5")
print(f"Helpfulness: {scores['helpfulness']}/5")
print(f"Safety: {'PASS' if scores['safety_pass'] else 'FAIL'}")

Accuracy: 4/5 Helpfulness: 4/5 Safety: PASS

Code Fragment 13.3.2: Define quality scoring criteria for generated synthetic examples

Note

Known biases in LLM-as-judge: LLM judges exhibit several systematic biases. Position bias favors the first response in pairwise comparisons. Verbosity bias favors longer, more detailed responses even when concise answers are better. Self-enhancement bias causes models to prefer outputs that match their own style. Mitigate these by randomizing presentation order, calibrating against human scores on a held-out set, and using multiple judge models.

Self-Check

Q1: What are the key components of a user simulator for testing conversational AI?

Show Answer

A user simulator consists of three main components: (1) a persona library defining diverse user types with behavior traits, communication styles, and frustration thresholds; (2) a goal sampler that assigns realistic objectives to each simulated user; and (3) a turn-by-turn message generator that stays in character, builds on conversation history, and naturally escalates frustration when the goal is not being met. An evaluator component then assesses whether the system under test handled the interaction successfully.

Q2: How do you validate that synthetic RAG test questions actually require retrieval?

Show Answer

Test whether an LLM can answer the generated questions without access to the source document. Questions that the model answers correctly from parametric knowledge alone are useless for RAG evaluation because they do not test the retrieval component. Good RAG test questions should reference specific details, statistics, or conclusions that appear only in the target document and cannot be inferred from general knowledge.

Q3: What are three known biases in LLM-as-judge evaluation, and how can you mitigate them?

Show Answer

Three known biases are: (1) Position bias, which favors the first response in pairwise comparisons (mitigate by randomizing presentation order); (2) Verbosity bias, which favors longer responses even when concise answers are better (mitigate by including conciseness in the rubric); and (3) Self-enhancement bias, where models prefer outputs matching their own style (mitigate by using a different model as judge than the one being evaluated, and calibrating against human scores on a held-out set).

Q4: Why is synthetic A/B testing useful even though it cannot replace real user testing?

Show Answer

Synthetic A/B testing helps prioritize which experiments to run with real users, catches obvious regressions early (before deployment), and provides fast directional signal at low cost. It can screen out clearly inferior variants without spending time and money on real-user experiments. However, it cannot capture real user preferences, behavioral patterns, or satisfaction accurately, so it serves as a pre-filter rather than a replacement for real A/B tests.

Q5: What safety precautions should be taken when handling red-team datasets?

Show Answer

Red-team datasets require: (1) access controls to limit who can view and use them; (2) clear labeling as safety evaluation materials to prevent confusion with regular training data; (3) separate storage repositories with appropriate security policies; (4) ensuring they are never accidentally included in training data; and (5) documentation of purpose, generation methodology, and intended use. Many organizations maintain completely separate infrastructure for red-team content.

Key Insight

Synthetic evaluation data solves the "grading your own homework" problem, but only with safeguards. When you use an LLM to both generate test data and evaluate responses, you risk circular validation where the same biases appear in both generation and evaluation. The fix is to use structurally different models for generation and judging, validate a random sample against human judgments, and always include held-out real data in your evaluation mix. Connect this to the full LLM-as-judge evaluation framework in Chapter 25 for production-grade evaluation pipelines.

Key Takeaways

User simulators combine persona libraries, goal samplers, and turn-by-turn generation to stress-test conversational systems with diverse, realistic user behavior before deployment.
Synthetic RAG test sets generate document-grounded QA pairs, but must be validated to ensure questions genuinely require retrieval rather than relying on parametric knowledge.
Red-teaming at scale uses LLMs to generate diverse adversarial inputs across categories like jailbreaks, bias elicitation, hallucination probes, and privacy extraction. These datasets require careful access control.
Synthetic A/B testing provides fast, cheap directional signal for comparing system variants, serving as a pre-filter before costly real-user experiments.
LLM-as-judge evaluation harnesses enable automated scoring across dimensions like accuracy, helpfulness, and safety, but require mitigation of position bias, verbosity bias, and self-enhancement bias.

Fun Fact

Using one LLM to generate test cases for another LLM is like asking one student to write the exam for another. It works surprisingly well, as long as you verify the answer key independently.

Real-World Scenario: Building a Synthetic User Simulator for a Travel Booking Chatbot

Who: A conversational AI team at an online travel agency testing a new booking chatbot before launch.

Situation: The chatbot needed to handle flight searches, hotel bookings, cancellations, and multi-leg trip planning. Real user testing would only be possible after launch, but the team needed to identify failure modes beforehand.

Problem: Manual QA testers covered only 120 conversation scenarios in a week. The team estimated at least 2,000 diverse scenarios were needed to achieve adequate coverage across booking types, edge cases, and user personas.

Dilemma: They could hire more QA testers (expensive, slow), script deterministic test conversations (limited diversity, misses emergent failures), or build an LLM-powered user simulator that generated diverse, goal-oriented conversations at scale.

Decision: They built a user simulator combining a persona library (40 traveler profiles varying by experience level, trip type, budget, and communication style) with goal templates (15 booking scenarios with success and failure paths).

How: Each simulation sampled a persona and goal, then ran a multi-turn conversation between the simulator (playing the user) and the chatbot. The simulator tracked goal progress across turns and introduced realistic complications (changing dates, adding travelers, asking about refund policies). A separate LLM-as-judge scored each conversation on task completion, coherence, and user satisfaction.

Result: They generated 3,500 synthetic conversations in 8 hours, uncovering 47 distinct failure modes. The most critical: the chatbot failed to handle currency conversions for international bookings (23% of multi-currency scenarios), and it lost context when users modified bookings mid-conversation (18% failure rate). Both issues were fixed before launch.

Lesson: User simulators with diverse personas and structured goals uncover failure modes that scripted tests miss; the combination of persona diversity and goal tracking produces realistic, challenging conversations at a fraction of manual testing cost.

Research Frontier

LLM-as-judge frameworks are evolving toward multi-agent evaluation panels where several models debate and vote on quality scores, reducing individual model biases. Research into synthetic benchmark contamination has revealed that models trained on synthetic evaluation data can appear to improve on benchmarks without genuine capability gains.

The frontier challenge is creating simulation environments where LLMs role-play diverse user populations with realistic error patterns, enabling stress-testing of systems before real deployment.

Exercises

Exercise 13.3.1: LLM as simulator Conceptual

Explain the concept of using an LLM to simulate user behavior for testing conversational AI systems. How does this differ from traditional scripted test cases?

Answer Sketch

An LLM simulator takes a persona description and interacts with the system under test as that persona would, generating natural, varied responses rather than following a fixed script. Unlike scripted tests, the simulator adapts to the system's responses, explores unexpected conversation paths, and introduces realistic noise (typos, topic changes, ambiguity). This provides broader test coverage and catches edge cases that scripted tests miss.

Exercise 13.3.2: Evaluation data generation Coding

Write a prompt that generates a diverse test suite of 20 question-answer pairs for evaluating a RAG system about a company's return policy. Include easy, medium, and hard questions.

Answer Sketch

Prompt: 'Generate 20 questions about a company return policy. Include: 7 easy factual questions (direct lookup), 7 medium questions (require combining info from multiple sections), 6 hard questions (edge cases, exceptions, or ambiguous scenarios). For each, provide: the question, expected answer, difficulty level, and which policy sections are relevant.' This creates a stratified evaluation set that tests different retrieval and reasoning capabilities.

Exercise 13.3.3: Persona-based simulation Conceptual

Describe how to create diverse user personas for LLM simulation. What attributes should a persona include, and why does persona diversity matter for testing?

Answer Sketch

A persona should include: name, technical skill level, communication style (verbose/concise, formal/casual), emotional state (patient, frustrated, confused), and specific goals or scenarios. Diversity matters because real users vary widely: a frustrated non-technical user interacts very differently from a calm expert. Testing with diverse personas ensures the system handles the full range of real interactions, not just the 'happy path' of cooperative, clear users.

Exercise 13.3.4: Synthetic benchmark creation Coding

Design a pipeline that generates a synthetic benchmark for evaluating text summarization. It should produce 50 document-summary pairs with known quality properties (length, coverage, factual accuracy).

Answer Sketch

Step 1: Generate 50 diverse source documents (articles, reports, emails) using an LLM with varied topics and lengths. Step 2: For each document, generate a gold-standard summary with explicit constraints: 'Summarize in 2 to 3 sentences covering all key points.' Step 3: Generate deliberate error variants: (a) summaries missing key facts, (b) summaries with hallucinated details, (c) summaries that are too long. Label each variant with its defect type. This creates a benchmark that tests both quality detection and summarization capability.

Exercise 13.3.5: Simulation fidelity Analysis

You use an LLM to simulate 1,000 customer conversations for testing a support chatbot. How would you validate that the simulated conversations are realistic compared to real customer interactions?

Answer Sketch

Validation approaches: (1) Compare topic distribution: cluster real and synthetic conversations and check if they cover the same topics at similar frequencies. (2) Measure conversation statistics: turn count, message length, vocabulary overlap with real data. (3) Have human evaluators rate a blind mix of real and synthetic conversations for realism (Turing-test style). (4) Check that edge cases and failure modes observed in real conversations also appear in synthetic ones.

What Comes Next

In the next section, Section 13.4: Quality Assurance & Data Curation, we focus on quality assurance and data curation, the critical step of validating and filtering synthetic data. The evaluation methodology connects to the Section 29.1 and the RAG evaluation approaches in Section 20.4.

References and Further Reading

User Simulation and Dialogue

Kim, S. et al. (2023). SODA: Million-Scale Dialogue Distillation with Social Commonsense Contextualization. EMNLP 2023.

Demonstrates million-scale dialogue generation grounded in social commonsense knowledge graphs, producing conversations that are more natural and diverse than template-based approaches. The commonsense grounding technique directly applies to the persona-driven user simulation patterns in this section.

Paper

Perez, E. et al. (2022). Red Teaming Language Models with Language Models. EMNLP 2022.

The foundational paper on using LLMs to automatically generate adversarial inputs for red-teaming other LLMs. This technique is central to the safety testing and red-team dataset generation covered in this section. Essential reading for teams building automated safety evaluation harnesses.

Paper

Evaluation and Judging

Zheng, L. et al. (2023). Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. NeurIPS 2023.

Establishes the LLM-as-Judge paradigm with rigorous analysis of when LLM judges agree with human preferences and when they diverge. This paper underpins the automated evaluation generation techniques in this section. Required reading for anyone building LLM-based evaluation pipelines.

Paper

Shankar, S. et al. (2024). Who Validates the Validators? Aligning LLM-Assisted Evaluation of LLM Outputs with Human Preferences.

Addresses the meta-question of how to validate LLM judges themselves, proposing calibration techniques and human-in-the-loop validation protocols. Directly relevant to ensuring the evaluation datasets generated by LLM simulators are trustworthy. Recommended for teams deploying automated evaluation at scale.

Paper

Chen, A. et al. (2024). RAGAS: Automated Evaluation of Retrieval Augmented Generation.

Presents a framework for automatically generating evaluation datasets for RAG systems, including synthetic question-answer pairs grounded in source documents. The RAGAS approach is a practical example of the evaluation generation patterns discussed in this section. Useful for teams building RAG evaluation harnesses.

Framework

Anil, R. et al. (2024). Many-Shot Jailbreaking. Anthropic Research.

Reveals how long-context models can be jailbroken by embedding many harmful examples in the prompt, highlighting risks when LLM simulators generate adversarial content at scale. Understanding this attack vector is critical when building red-teaming datasets, as the generated data itself can become an attack surface.

Paper