"The best judge of a conversation is someone who actually has one."
Eval, Conversationally Picky AI Agent
Static benchmarks saturate, leak into training data, and fail to capture what real users care about. Arena-style evaluation solves these problems by collecting live, open-ended pairwise comparisons from real users and converting them into robust model rankings via statistical models like Elo and Bradley-Terry. Chatbot Arena (LMSYS) pioneered this approach and has become the most trusted public leaderboard for LLM quality. This section explains how arena evaluation works, the mathematics behind pairwise ranking, how to build your own evaluation arena, and the tradeoffs between crowdsourced and expert evaluation. These techniques complement the static benchmarks and LLM-as-Judge methods covered in Section 29.1 and the experimental design principles from Section 29.2.
Prerequisites
Before starting, make sure you are familiar with evaluation fundamentals as covered in Section 29.1: LLM Evaluation Fundamentals, experimental design from Section 29.2, and LLM-as-Judge methods from Section 29.1. Understanding the limitations of static benchmarks discussed in Section 07.1 provides useful context for why arena evaluation is needed.
1. Why Static Benchmarks Fail
Model A scores 92% on MMLU. Model B scores 89%. You deploy Model A, and users overwhelmingly prefer Model B. How is that possible? Because MMLU measures factual recall through multiple-choice questions, while your users care about helpfulness, conversational tone, and the ability to handle ambiguous requests. The benchmark measured the wrong thing. Worse, Model A's score may have been inflated by benchmark contamination: its training data (the pretraining process from Chapter 06) likely included MMLU questions, so it was partially memorizing answers rather than demonstrating genuine capability.
Arena-style evaluation solves all three problems that plague static benchmarks: contamination (real users submit novel prompts), saturation (open-ended tasks have no performance ceiling), and construct validity (the judges are the actual users you care about). By the end of this section, you will understand how Chatbot Arena (LMSYS) works, the Elo and Bradley-Terry mathematics behind pairwise ranking, and how to build your own evaluation arena for internal model selection. We start with the structural failures of static benchmarks, then build toward the arena alternative.
Mental Model: The Blind Taste Test. Arena evaluation is the blind taste test of AI. Just as Pepsi and Coke look identical in unmarked cups, two LLM responses appear side by side with no brand labels. Users judge purely on quality, not reputation. This eliminates the "halo effect" where people prefer the response they think came from a famous model. The Elo rating system then converts thousands of these pairwise comparisons into a global ranking, exactly as chess uses individual match results to rank players worldwide. The analogy holds well, though arena evaluations have a known limitation: users tend to prefer longer, more verbose responses even when a concise answer would be more useful.
Static benchmarks measure what a model knows; arena evaluations measure what a model does for people. Both are valuable, but for production model selection, arena rankings tend to correlate more strongly with user satisfaction than any single benchmark score.
2. Chatbot Arena and the LMSYS Methodology
Chatbot Arena, developed by the LMSYS Org at UC Berkeley, is the most influential arena-style evaluation platform for LLMs. The methodology is elegantly simple. A user visits the platform and types a prompt. Two anonymous models generate responses side by side. The user reads both responses and votes for the better one (or declares a tie). The user never knows which models are being compared until after voting. This blind evaluation eliminates brand bias entirely.
The platform has collected millions of pairwise votes across hundreds of models since its launch in 2023. Each vote becomes a data point in a statistical model that produces a global ranking. The key design decisions that make the arena trustworthy include:
- Anonymity: Model identities are hidden during evaluation, preventing users from voting based on reputation rather than output quality.
- Diversity of prompts: Users bring their own questions, covering everything from creative writing to technical debugging, producing a distribution of tasks that reflects real usage.
- No cherry-picking: Models cannot selectively show their best outputs; every response to every query is eligible for comparison.
- Continuous data collection: New votes arrive constantly, allowing rankings to update as models improve and new models enter the arena.
- Category breakdowns: The platform segments results by task type (coding, math, creative writing, instruction following), revealing that models often have different strengths.
The arena approach highlights a deeper truth about LLM evaluation: the most meaningful evaluation data comes from the people who actually use these systems, not from curated benchmark datasets. Benchmark authors try to anticipate what matters, but real users bring the full diversity of needs, phrasings, and edge cases that no benchmark can capture. This is also why combining arena results with your own domain-specific evaluation (see the testing strategies from Section 29.4) gives a more complete picture than either approach alone.
3. Elo Ratings and Bradley-Terry Models
The mathematical backbone of arena evaluation is the Bradley-Terry model, which estimates the probability that one model will be preferred over another based on latent "strength" parameters. The closely related Elo rating system (originally designed for chess) provides an intuitive scoring framework that maps directly onto the Bradley-Terry model.
The Bradley-Terry Model
Given two models i and j with strength parameters γi and γj, the Bradley-Terry model defines the probability that model i beats model j as:
By taking the log of the strength parameters (defining λi = log γi), this becomes a logistic model:
The parameters λ are estimated via maximum likelihood on the observed pairwise comparison data. The resulting scores can be scaled to an Elo-like rating where a difference of 400 points corresponds to a 10:1 win ratio. Code Fragment 29.8.2 below puts this into practice.
# implement fit_bradley_terry, neg_log_likelihood
# Key operations: results display
import numpy as np
from scipy.optimize import minimize
def fit_bradley_terry(matchups: list[tuple], n_models: int) -> np.ndarray:
"""Fit Bradley-Terry model to pairwise comparison data.
Args:
matchups: list of (winner_id, loser_id) tuples
n_models: total number of models in the arena
Returns:
Array of Elo-scaled ratings for each model
"""
# Negative log-likelihood of the Bradley-Terry model
def neg_log_likelihood(params):
nll = 0.0
for winner, loser in matchups:
# Log probability that winner beats loser
diff = params[winner] - params[loser]
nll -= diff - np.log(1 + np.exp(diff))
# L2 regularization to prevent unbounded parameters
nll += 0.01 * np.sum(params ** 2)
return nll
# Initialize all models at equal strength
init_params = np.zeros(n_models)
result = minimize(neg_log_likelihood, init_params, method="L-BFGS-B")
# Convert to Elo scale: 400 points = 10x win probability
elo_ratings = result.x * (400 / np.log(10)) + 1500
return elo_ratings
# Example: 4 models with simulated pairwise outcomes
# Model 0 is strongest, model 3 is weakest
matchups = [
(0, 1), (0, 1), (0, 2), (0, 3), (0, 3), # Model 0 wins
(1, 2), (1, 2), (1, 3), (1, 3), # Model 1 wins
(2, 3), (2, 3), (2, 3), # Model 2 wins
(1, 0), (2, 1), (3, 2), # Upsets (noise)
]
ratings = fit_bradley_terry(matchups, n_models=4)
model_names = ["GPT-4o", "Claude-3.5", "Llama-3-70B", "Mistral-7B"]
for name, rating in sorted(zip(model_names, ratings), key=lambda x: -x[1]):
print(f" {name:<15} Elo: {rating:.0f}")
Elo Rating Updates
While the full Bradley-Terry fit (Code Fragment 29.8.2) produces the most accurate ratings, many arena systems use online Elo updates for computational efficiency. After each match, the winner's rating increases and the loser's rating decreases by an amount proportional to how surprising the outcome was. If a highly rated model loses to a much lower-rated one, the rating change is large; if the favorite wins as expected, the change is small.
# implement elo_update
# Key operations: results display
def elo_update(
rating_a: float,
rating_b: float,
outcome: float, # 1.0 = A wins, 0.0 = B wins, 0.5 = tie
k: float = 32.0 # K-factor controls update magnitude
) -> tuple[float, float]:
"""Compute updated Elo ratings after a single match.
Returns updated ratings for both players.
"""
# Expected score for player A (logistic curve)
expected_a = 1.0 / (1.0 + 10 ** ((rating_b - rating_a) / 400))
expected_b = 1.0 - expected_a
# Update ratings based on surprise factor
new_a = rating_a + k * (outcome - expected_a)
new_b = rating_b + k * ((1 - outcome) - expected_b)
return round(new_a, 1), round(new_b, 1)
# Scenario 1: Expected outcome (strong model wins)
a1, b1 = elo_update(1600, 1400, outcome=1.0)
print(f"Expected win: 1600 vs 1400 -> {a1} vs {b1}")
# Scenario 2: Upset (weak model wins)
a2, b2 = elo_update(1600, 1400, outcome=0.0)
print(f"Upset: 1600 vs 1400 -> {a2} vs {b2}")
# Scenario 3: Tie between equal models
a3, b3 = elo_update(1500, 1500, outcome=0.5)
print(f"Tie (equal): 1500 vs 1500 -> {a3} vs {b3}")
Online Elo updates and maximum-likelihood Bradley-Terry estimation are mathematically related but not identical. Elo updates are order-dependent (processing the same matches in a different sequence produces slightly different ratings), while the full Bradley-Terry fit is order-independent. For leaderboards with thousands of votes, LMSYS uses the full Bradley-Terry fit with bootstrap confidence intervals, reserving online Elo for real-time display.
When running pairwise model comparisons (arena-style), randomize which model appears on the left versus right side of the screen. Human evaluators exhibit a measurable position bias, preferring the first response they read by 3 to 5 percentage points. Randomizing position and tracking position-corrected win rates eliminates this artifact from your results.
4. Building Your Own Evaluation Arena
While Chatbot Arena provides a public leaderboard, many organizations need a private arena for comparing models on domain-specific tasks. Building one requires four components: a comparison interface, a matchmaking system, a vote collection pipeline, and a ranking engine. The following code shows a minimal but functional arena backend.
# Define ArenaMatch, EvaluationArena; implement __init__, create_match, record_vote
# Key operations: prompt construction, evaluation logic
import random
import hashlib
from dataclasses import dataclass, field
from datetime import datetime
@dataclass
class ArenaMatch:
"""A single pairwise comparison in the arena."""
match_id: str
prompt: str
model_a: str # Internal model name (hidden from voter)
model_b: str
response_a: str
response_b: str
winner: str = "" # "A", "B", or "tie"
timestamp: str = ""
voter_id: str = ""
class EvaluationArena:
"""Lightweight arena for pairwise model comparison.
Handles matchmaking, vote collection, and ranking.
"""
def __init__(self, models: dict[str, callable]):
# models: mapping from model_name to callable(prompt) -> response
self.models = models
self.matches: list[ArenaMatch] = []
self.ratings = {name: 1500.0 for name in models}
def create_match(self, prompt: str) -> ArenaMatch:
"""Select two random models and generate responses."""
model_a, model_b = random.sample(list(self.models.keys()), 2)
# Randomize display order to prevent position bias
if random.random() > 0.5:
model_a, model_b = model_b, model_a
match = ArenaMatch(
match_id=hashlib.md5(f"{prompt}{datetime.now()}".encode()).hexdigest()[:12],
prompt=prompt,
model_a=model_a,
model_b=model_b,
response_a=self.models[model_a](prompt),
response_b=self.models[model_b](prompt),
)
return match
def record_vote(self, match: ArenaMatch, winner: str, voter_id: str):
"""Record a human vote and update Elo ratings."""
match.winner = winner
match.voter_id = voter_id
match.timestamp = datetime.now().isoformat()
self.matches.append(match)
# Convert vote to outcome for Elo update
if winner == "A":
outcome = 1.0
elif winner == "B":
outcome = 0.0
else:
outcome = 0.5 # tie
# Update ratings for both models
ra, rb = elo_update(
self.ratings[match.model_a],
self.ratings[match.model_b],
outcome
)
self.ratings[match.model_a] = ra
self.ratings[match.model_b] = rb
def leaderboard(self) -> list[tuple[str, float, int]]:
"""Return models ranked by Elo rating with match counts."""
counts = {name: 0 for name in self.models}
for m in self.matches:
counts[m.model_a] += 1
counts[m.model_b] += 1
board = [(name, self.ratings[name], counts[name])
for name in self.models]
return sorted(board, key=lambda x: -x[1])
Who: A legal technology company evaluating three LLMs for contract review.
Situation: Standard benchmarks showed all three models scoring within 2% of each other on MMLU legal subcategories. The team needed a more discriminating evaluation method.
Problem: Legal professionals cared about nuanced qualities (citing relevant clauses, flagging ambiguous language, maintaining appropriate hedging) that no benchmark measured.
Decision: They built an internal arena using the pattern from Code Fragment 29.8.4, populated with 200 real contract excerpts, and had 12 lawyers evaluate pairs over two weeks.
How: Each lawyer completed 30 comparisons per day. The arena randomized model pairs and display order. After 1,400 total votes, the Bradley-Terry fit produced clear separation: the top model had an Elo of 1587 while the other two scored 1492 and 1421.
Result: The winning model was not the one with the highest MMLU score. Its advantage came from better hedging language and more precise clause references, qualities invisible to static benchmarks.
Lesson: Domain-specific arenas with expert evaluators reveal quality differences that general benchmarks cannot detect, especially for specialized professional tasks.
5. Crowdsourced vs. Expert Evaluation
Arena-style evaluation can use either crowd workers (general users, Amazon Mechanical Turk workers) or domain experts (lawyers, doctors, engineers). Each approach has distinct tradeoffs that affect the reliability and applicability of the resulting rankings.
| Dimension | Crowdsourced | Expert |
|---|---|---|
| Cost per vote | Low ($0.10 to $0.50) | High ($5 to $50+) |
| Throughput | Thousands of votes per day | Tens to hundreds per day |
| Task coverage | Broad, general knowledge | Deep, domain-specific |
| Noise level | Higher (inconsistent quality) | Lower (calibrated judgment) |
| Gaming risk | Higher (spam, random clicks) | Lower (reputation at stake) |
| Factual accuracy | Cannot verify specialized claims | Can catch subtle errors |
| Best for | General chat, creative tasks | Medical, legal, technical tasks |
The LMSYS Chatbot Arena uses open crowdsourcing, which works well for general-purpose evaluation because most prompts involve common knowledge, creative writing, or general reasoning. For domain-specific applications, however, crowd evaluators may prefer the more fluent or confident response even when it contains factual errors that only an expert would catch. A medical chatbot arena evaluated by non-medical crowd workers, for example, could rank a confidently wrong model above a cautiously correct one.
The optimal evaluation strategy often combines both approaches: use crowdsourced evaluation for high-volume, general-purpose ranking, then validate the top candidates with expert evaluation on domain-specific criteria. This two-stage approach gets the breadth of crowd evaluation and the depth of expert judgment without paying expert rates for every comparison.
6. Contamination and Gaming Concerns
While arena-style evaluation is more robust than static benchmarks, it is not immune to manipulation. Several attack vectors deserve attention when designing or interpreting arena results.
Style Over Substance
Users tend to prefer responses that are longer, more detailed, and more confidently stated. This creates an incentive for model providers to optimize for stylistic appeal rather than factual correctness. Research has shown that formatting (bullet points, headers, bold text) significantly influences human preference even when the underlying content is identical. The LMSYS team has documented this "style control" phenomenon and publishes style-controlled rankings that attempt to separate substance from presentation.
Sybil Attacks and Vote Manipulation
In an open arena, a motivated actor could create multiple accounts and systematically vote for a particular model. Defenses include rate limiting, CAPTCHAs, vote consistency analysis (flagging voters whose preferences are statistically implausible), and requiring users to submit genuine prompts before voting.
Prompt Steering
If a model provider knows which prompts will be tested, they can specifically optimize for those cases. Open arenas mitigate this by drawing prompts from users rather than a fixed set, but closed evaluations with small prompt pools remain vulnerable. The solution is to maintain a large, continuously growing prompt corpus and never publish the full set.
Some model providers have been caught optimizing specifically for arena-style evaluation by detecting when their model is in a pairwise comparison (through prompt patterns or API metadata) and switching to a higher-quality but more expensive inference mode. Always verify that models serve the same quality in production as they do in evaluation settings.
7. Using Arena Results for Model Selection
Arena rankings provide a powerful signal for model selection, but interpreting them correctly requires understanding confidence intervals, category breakdowns, and the limitations of aggregate scores. A model ranked third overall might be the best choice for your specific use case if it leads in the relevant category.
The following code demonstrates how to compute bootstrap confidence intervals on arena ratings and use them to make statistically grounded model selection decisions.
# implement bootstrap_arena_ratings
# Key operations: results display
import numpy as np
from collections import defaultdict
def bootstrap_arena_ratings(
matches: list[tuple[str, str, str]], # (model_a, model_b, winner)
n_bootstrap: int = 1000
) -> dict[str, dict]:
"""Compute arena ratings with bootstrap confidence intervals.
Returns a dict mapping model name to rating statistics.
"""
all_models = set()
for a, b, _ in matches:
all_models.add(a)
all_models.add(b)
model_list = sorted(all_models)
bootstrap_ratings = defaultdict(list)
for _ in range(n_bootstrap):
# Resample matches with replacement
sample = [matches[i] for i in
np.random.randint(0, len(matches), len(matches))]
# Compute Elo ratings for this bootstrap sample
ratings = {m: 1500.0 for m in model_list}
for a, b, winner in sample:
outcome = 1.0 if winner == a else (0.0 if winner == b else 0.5)
new_a, new_b = elo_update(ratings[a], ratings[b], outcome, k=4)
ratings[a] = new_a
ratings[b] = new_b
for m in model_list:
bootstrap_ratings[m].append(ratings[m])
# Compute summary statistics
results = {}
for m in model_list:
vals = bootstrap_ratings[m]
results[m] = {
"median": round(np.median(vals), 1),
"ci_lower": round(np.percentile(vals, 2.5), 1),
"ci_upper": round(np.percentile(vals, 97.5), 1),
}
return results
# Print results with confidence intervals
# results = bootstrap_arena_ratings(matches)
# for model, stats in sorted(results.items(), key=lambda x: -x[1]["median"]):
# print(f" {model:<15} {stats['median']:.0f} [{stats['ci_lower']:.0f}, {stats['ci_upper']:.0f}]")
When selecting a model based on arena results, always check the confidence intervals. If two models' 95% intervals overlap, their performance difference may not be meaningful. In that case, make your decision based on secondary criteria: cost, latency, licensing terms, or performance on your specific task category rather than the overall ranking.
8. Open-Source Arena Frameworks
Several open-source projects make it possible to deploy your own evaluation arena without building everything from scratch. These range from full-featured platforms to lightweight libraries.
- FastChat (LMSYS): The open-source codebase behind Chatbot Arena. Includes the Gradio-based comparison interface, model serving infrastructure, and Elo computation scripts. Best suited for organizations that want to replicate the full LMSYS experience.
- Alpaca Eval: An automated arena that uses an LLM judge (GPT-4) instead of human evaluators. Produces a "win rate" against a reference model. Useful for fast iteration when human evaluation is too slow, though it inherits the biases of the judge model.
- Open LLM Leaderboard (Hugging Face): While primarily a static benchmark aggregator, it provides the infrastructure for standardized model evaluation and comparison. The community has built arena-style extensions on top of it.
- MT-Bench: A curated set of 80 multi-turn questions with LLM-as-Judge evaluation, designed to complement arena evaluation with reproducible comparisons on a fixed prompt set.
- Evaluation Harness (EleutherAI): A framework for running static benchmarks in a standardized way. While not an arena, it integrates well with arena-style workflows as a complementary evaluation layer.
The choice of framework depends on your evaluation needs. For teams that already have a web application, adding a simple A/B comparison page (using the patterns from Code Fragment 29.8.4) may be faster than deploying a full arena platform. For organizations evaluating many models at scale, FastChat provides the most complete solution.
The Chatbot Arena leaderboard has become so influential that some researchers call Elo ratings "the new MMLU." Model release announcements now routinely cite their Arena ranking alongside (and sometimes instead of) traditional benchmark scores, reflecting a broader shift toward preference-based evaluation in the LLM community.
1. Why does arena-style evaluation resist data contamination better than static benchmarks?
Show Answer
2. In the Bradley-Terry model, what does it mean when two models have a rating difference of 400 Elo points?
Show Answer
3. When should you prefer expert evaluation over crowdsourced evaluation in an arena?
Show Answer
4. Why are bootstrap confidence intervals important when interpreting arena rankings?
Show Answer
5. What is the "style over substance" problem in arena evaluation, and how does LMSYS address it?
Show Answer
- Arena evaluation captures what benchmarks miss. By using real users with real prompts and blind pairwise comparison, arenas measure the qualities that actually matter for production deployment: helpfulness, nuance, and user preference.
- Bradley-Terry and Elo provide principled ranking. These statistical models convert noisy pairwise votes into stable, interpretable ratings with well-defined confidence intervals.
- Always check confidence intervals. A model ranked #3 with overlapping confidence intervals to #1 may be statistically indistinguishable. Use secondary criteria (cost, latency, domain performance) to break ties.
- Match your evaluators to your domain. Crowdsourced evaluation works for general tasks; expert evaluation is essential for specialized domains where factual correctness requires domain knowledge to verify.
- Arena results are not immune to gaming. Style bias, vote manipulation, and selective optimization are real threats. Design your arena with randomization, anonymization, and statistical controls to mitigate them.
- You can build your own arena. The core components (matchmaking, blind display, vote collection, Elo computation) are straightforward to implement, and open-source frameworks like FastChat provide production-ready scaffolding.
Open Questions:
- Can arena-style evaluations scale beyond conversational tasks to code generation, reasoning, and multi-modal outputs? Extending pairwise comparison to complex outputs is logistically challenging.
- How can arena evaluations resist manipulation (e.g., providers optimizing specifically for arena-style prompts)?
Recent Developments (2024-2025):
- LMSYS Chatbot Arena expanded to 2 million+ votes by early 2025, and arena-style evaluation methods were adopted for specialized domains including code (BigCodeBench), vision, and multi-turn conversations.
- Automated red-teaming tools (2024-2025) like HarmBench and StrongReject systematized adversarial evaluation, enabling scalable safety testing that complements crowd-sourced evaluation.
Explore Further: Set up a small private arena (using open-source tools like FastChat) among 3-4 models for a specific use case. Collect 50+ pairwise comparisons and compute Elo ratings. Compare your rankings against public leaderboards.
Exercises
Explain three structural problems with static benchmarks (contamination, saturation, construct validity) and how arena-style evaluation addresses each one.
Answer Sketch
Contamination: benchmark questions leak into training data, inflating scores. Arena: users submit novel prompts that cannot be pre-trained on. Saturation: models approach 100% on benchmarks, making differentiation impossible. Arena: open-ended tasks have no performance ceiling. Construct validity: benchmarks may not measure what users care about. Arena: real users judge on their actual use cases, directly measuring user satisfaction.
Model A has an Elo rating of 1200 and Model B has 1100. Calculate the expected win probability for each model. If Model B wins, calculate the new ratings using K=32.
Answer Sketch
Expected score for A: E_A = 1 / (1 + 10^((1100-1200)/400)) = 1 / (1 + 10^(-0.25)) = approximately 0.64. E_B = 1 - 0.64 = 0.36. After B wins: A's new rating = 1200 + 32*(0 - 0.64) = 1200 - 20.5 = 1179.5. B's new rating = 1100 + 32*(1 - 0.36) = 1100 + 20.5 = 1120.5. The upset causes a larger rating change because B was the underdog.
Outline the architecture of an internal evaluation arena for comparing 4 LLM models. Include the randomization logic, the user interface flow, the vote storage schema, and the Elo update mechanism.
Answer Sketch
Architecture: (1) User submits a prompt. (2) Backend randomly selects 2 of 4 models, randomly assigns left/right positions. (3) Both models generate responses in parallel. (4) UI shows responses side-by-side without model names. (5) User votes A/B/Tie. (6) Vote stored: {prompt_id, model_a, model_b, position_a (left/right), winner, timestamp, user_id}. (7) Elo update runs after each vote using the formula from Exercise 29.8.2. (8) Dashboard shows current ratings with confidence intervals.
Compare crowdsourced evaluation (like Chatbot Arena) with expert evaluation for a medical Q&A system. What are the strengths and weaknesses of each approach? Which would you recommend and why?
Answer Sketch
Crowdsourced: high volume, diverse prompts, low cost per judgment, but evaluators lack medical expertise and may prefer confident-sounding but incorrect answers. Expert: medically accurate judgments, can assess clinical safety, but expensive, slow, and limited prompt diversity. For medical Q&A, expert evaluation is essential because factual correctness requires domain knowledge. Use crowdsourced for usability/helpfulness and expert evaluation for accuracy/safety. Combine both in a two-stage process.
Arena evaluations show a known verbosity bias where users prefer longer responses. Design an experiment to measure the magnitude of this bias in your arena and propose a correction method.
Answer Sketch
Experiment: take a set of 100 prompts where you have both a concise correct answer and a verbose correct answer. Present both versions in arena format and measure win rates. The difference from 50/50 quantifies the verbosity bias. Correction methods: (1) include response length as a covariate in the Bradley-Terry model, (2) instruct evaluators to penalize unnecessary verbosity, (3) show a "conciseness" sub-rating alongside the overall preference, (4) stratify results by response length ratio and report bias-adjusted rankings.
What Comes Next
In the next chapter, Chapter 31: Production Engineering, we shift from evaluating models to deploying them in production. You will learn how to build application architectures, deploy services, implement guardrails, and operate LLM systems at scale.
Bibliography
Chiang, W.L., Zheng, L., Sheng, Y., et al. (2024). "Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference." arXiv:2403.04132
Zheng, L., Chiang, W.L., Sheng, Y., et al. (2023). "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena." arXiv:2306.05685
Bradley, R.A. & Terry, M.E. (1952). "Rank Analysis of Incomplete Block Designs: I. The Method of Paired Comparisons." Biometrika, 39(3/4), 324-345.
Elo, A.E. (1978). The Rating of Chessplayers, Past and Present. Arco Publishing.
Li, X., Zhang, T., Dubois, Y., et al. (2023). "AlpacaEval: An Automatic Evaluator of Instruction-Following Models." GitHub
Zheng, L., et al. (2023). "FastChat: An Open Platform for Training, Serving, and Evaluating Large Language Model based Chatbots." GitHub
Oren, Y., Meister, N., Chatterji, N., et al. (2024). "Proving Test Set Contamination in Black Box Language Models." arXiv:2310.17623
