Section 46.2: Judge Reliability and Common Biases

"Argmax over '1' through '5' throws away the uncertainty the judge was kind enough to give you. Read the log-probabilities; the judge meant 3.5, not 4."
Eval, Probability-Mass-Conserving AI Agent

Big Picture: Chain-of-Thought Scoring

Asking a judge model "rate this summary from 1 to 5" and parsing the integer it emits throws away most of the signal. Two summaries that are clearly different in quality can both round to "4", and a judge that is genuinely uncertain between 3 and 4 collapses that uncertainty into a single hard choice. G-Eval fixes both problems with two ideas. First, the judge is forced to reason step-by-step before emitting a score, which both improves alignment with human ratings and gives a chain of thought for diagnosing disagreements. Second, instead of taking the argmax token, G-Eval reads the log-probability distribution over the digit tokens "1"-"5" and computes a probability-weighted continuous score. The result is a finer-grained scalar that captures judge uncertainty, correlates better with human ratings, and is more stable across reruns. The rest of this section walks through the algorithm and the practical considerations for non-OpenAI models.

G-Eval, introduced by Liu et al. (2023), applies chain-of-thought (CoT) reasoning to NLG evaluation. Rather than asking a judge model to produce a single score, G-Eval prompts the model to first generate a detailed evaluation chain of thought, then produce a numeric score. The key innovation is using the token probabilities of the score tokens to compute a weighted score, which produces finer-grained and more stable evaluations than simply taking the argmax score. This probability-weighted approach reduces the discretization noise inherent in integer scoring scales.

Comparison of argmax integer scoring versus probability-weighted G-Eval continuous scoring on a 5-point judge distribution — **Figure 46.2.1:** Argmax scoring vs G-Eval. Argmax picks the most-probable digit and reports 3, throwing away the judge's near-equal preference for 4. G-Eval takes a probability-weighted expectation over all five digits and reports 3.18, preserving the gradient that distinguishes "definitely a 3" from "borderline between 3 and 4."

G-Eval operates on four dimensions commonly used in NLG evaluation: coherence, consistency, fluency, and relevance. For each dimension, a task-specific prompt instructs the judge model to evaluate the output step by step. The prompt includes the evaluation criteria, a detailed description of what each score level means, and the source document and generated summary (or question and response for QA tasks). Code Fragment 46.2.3 implements the G-Eval scoring pipeline.

Algorithm 46.2.1: G-Eval Probability-Weighted Scoring

Algorithm: G-Eval expectation score
Input:  judge model M with token logprobs, evaluation prompt (rubric + CoT instruction),
        source document s, candidate output y, score range {1, .., K}
Output: continuous score s_hat in [1, K]

  // 1. Construct the judge prompt with chain-of-thought instruction
  prompt := build_geval_prompt(rubric, s, y)
            // "First think step by step, then output a score from 1 to K."

  // 2. Run the judge; capture logprobs of the score-token slot
  cot, score_logprobs := M.generate_with_logprobs(prompt)

  // 3. Project onto the K candidate digits, exponentiate, renormalize
  For k = 1..K:
      p_k := exp( score_logprobs[ token(str(k)) ] )
  Z := sum_{k = 1..K} p_k
  For k = 1..K:
      p_k := p_k / Z                                    // digit-only renormalization

  // 4. Expectation under the (renormalized) score distribution
  s_hat := sum_{k = 1..K} k * p_k                       // E_{k ~ p}[k] in [1, K]
  Return s_hat

Fallback when logprobs are not exposed (Anthropic Claude, some Gemini configs):
  s_hat := (1 / N) * sum_{n = 1..N} parse_int( M.generate(prompt; temperature = tau) )
  with N ~ 10..50 and tau ~ 0.5..1.0. By the law of large numbers this Monte Carlo
  average converges to the same E_{k ~ p}[k] for tau matched to the model's calibration.

Code Fragment 46.2.1a: Pseudocode for the G-Eval expectation score. The judge emits a chain-of-thought trace followed by a digit slot whose logprobs are projected onto the K candidate scores, renormalized over digit-only mass, and combined into an expectation. The fallback path Monte-Carlos the same expectation by sampling the judge multiple times when logprobs are unavailable.

For an integer scoring scale $S = \{1, 2, \ldots, K\}$ (typical $K = 5$), let $p_k$ be the model's predicted probability that the next token is the digit $k$, conditioned on the chain-of-thought prefix. The naive argmax estimator is $\hat{s}_{\text{argmax}} = \arg\max_{k \in S} p_k$, which is integer-valued and discards the full distribution. G-Eval instead computes the probability-weighted expectation:

$$\hat{s}_{\text{G-Eval}} \;=\; \sum_{k=1}^{K} k \cdot \frac{p_k}{\sum_{j=1}^{K} p_j}.$$

The denominator renormalizes over only the digit tokens (ignoring probability mass on non-numeric tokens). The result $\hat{s}_{\text{G-Eval}} \in [1, K]$ is continuous: a judge that splits its mass evenly between "3" and "4" returns $3.5$, whereas argmax would return whichever digit had a marginally higher probability. Continuous outputs preserve rank ordering across small quality differences that argmax collapses to ties, which is the source of G-Eval's correlation improvement with human ratings. Source: Liu et al., "G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment," EMNLP 2023 (arXiv:2303.16634).

Numeric Example: Argmax vs Probability-Weighted G-Eval

Suppose a judge produces the following score-token distribution for two competing summaries on a 1-5 coherence scale:

Summary A: $p_3 = 0.42$, $p_4 = 0.40$, $p_5 = 0.10$, others $0.08$.
Summary B: $p_3 = 0.10$, $p_4 = 0.55$, $p_5 = 0.25$, others $0.10$.

Argmax assigns A $\to$ 3 and B $\to$ 4. Renormalizing the digit mass (A: $1 - 0.08 = 0.92$; B: $1 - 0.10 = 0.90$) and computing the weighted score:

$\hat{s}_A = (3 \cdot 0.42 + 4 \cdot 0.40 + 5 \cdot 0.10) / 0.92 = 3.65$
$\hat{s}_B = (3 \cdot 0.10 + 4 \cdot 0.55 + 5 \cdot 0.25) / 0.90 = 4.17$

G-Eval reports $3.65$ and $4.17$, preserving the $\sim 0.5$-point quality gap and B's mild edge. Liu et al. (2023) report this kind of finer-grained signal improves Spearman correlation with human ratings on SummEval by roughly $0.05$-$0.10$ points (e.g., $0.42 \to 0.51$ on coherence), which is the difference between a borderline-useful and a clearly-useful judge.

# G-Eval: chain-of-thought scoring with probability weighting
from openai import OpenAI
import numpy as np

client = OpenAI()

GEVAL_COHERENCE_PROMPT = """You will be given a summary of a news article.
Your task is to rate the summary on coherence (1-5).
Evaluation Criteria:
Coherence (1-5): The collective quality of all sentences. The summary
should be well-structured and well-organized. It should not just be
a heap of related information, but should build from sentence to
sentence to a coherent body of information about a topic.
Evaluation Steps:
1. Read the summary carefully.
2. Check if the sentences logically follow each other.
3. Check if there is a clear topic progression.
4. Check if the summary has a clear beginning, middle, and end.
5. Assign a score from 1 to 5.
Summary: {summary}
After providing your evaluation steps, output your score (1-5):"""

def geval_score(
    summary: str,
    prompt_template: str = GEVAL_COHERENCE_PROMPT,
    model: str = "gpt-4o",
) -> dict:
    """Compute G-Eval score with probability weighting."""
    prompt = prompt_template.format(summary=summary)
    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        max_tokens=512,
        temperature=0.0,
        logprobs=True,
        top_logprobs=5,
    )
    content = response.choices[0].message.content
    logprobs = response.choices[0].logprobs
    # Extract probability distribution over score tokens (1-5)
    # from the last token's logprobs
    score_probs = {}
    if logprobs and logprobs.content:
        last_token_logprobs = logprobs.content[-1].top_logprobs
        for lp in last_token_logprobs:
            if lp.token.strip() in ["1", "2", "3", "4", "5"]:
                score_probs[int(lp.token.strip())] = np.exp(lp.logprob)
    # Compute probability-weighted score
    if score_probs:
        total_prob = sum(score_probs.values())
        weighted_score = sum(
            score * (prob / total_prob)
            for score, prob in score_probs.items()
        )
    else:
        # Fallback: extract integer score from text
        weighted_score = float(content.strip()[-1])
    return {
        "chain_of_thought": content,
        "score_distribution": score_probs,
        "weighted_score": round(weighted_score, 3),
    }

Code Fragment 46.2.2: G-Eval chain-of-thought scoring with probability weighting. The pipeline prompts the judge to reason step by step, then uses token-level log probabilities to compute a weighted score that is finer-grained and more stable than argmax scoring.

46.2.1 G-Eval: Chain-of-Thought Scoring

Fun Fact

LLM judges have a documented preference for verbose answers, answers that appear first in the prompt, and answers that share their own writing style. Teams who train their own SFT data with a single judge model often discover, six months later, that they have trained the student to write like the teacher, including the teacher's grammatical tics.

Prerequisites

This section assumes familiarity with why LLM-as-judge matters from Section 46.1 and with experimental design and inter-rater agreement from Section 42.2. Familiarity with decoding and logprob outputs from Section 4.2 helps when reading the G-Eval algorithm.

Library Shortcut: deepeval for G-Eval as a one-liner metric

The 60-line implementation above is the teaching version. In production, deepeval (Confident AI, 2023+) wraps the same probability-weighted G-Eval recipe behind a pytest-style metric class with built-in logprob handling for OpenAI, Anthropic, and any open-weights model served behind LiteLLM. It also auto-detects whether the judge exposes logprobs and falls back to the temperature-averaging strategy described in the next callout when not.

Show code

pip install deepeval
from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCase, LLMTestCaseParams

coherence = GEval(
    name="Coherence",
    evaluation_steps=[
        "Check if sentences logically follow each other",
        "Check for clear topic progression",
    ],
    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],
)
case = LLMTestCase(input="Summarize the article.", actual_output=summary)
coherence.measure(case)
print(coherence.score, coherence.reason)

Code Fragment 46.2.3a: deepeval replaces the manual logprob extraction in Code Fragment 46.2.1 with a single measure() call; pair it with deepeval test run tests/ to run G-Eval inside CI alongside unit tests.

Tip

G-Eval requires logprobs access. The probability-weighted scoring that makes G-Eval effective depends on access to token-level log probabilities. As of 2025, the OpenAI API provides logprobs for GPT-4o and GPT-4o-mini. For models that do not expose logprobs (such as Claude), you can approximate G-Eval by running multiple evaluations with temperature > 0 and averaging the scores, though this is slower and less precise. Alternatively, use the open-source judge models discussed in the Prometheus section below, which provide full logprob access via local inference.

Figure 46.2.2 dramatizes why the same judging recipe behaves so differently across providers: only some APIs hand you the log-probability distribution that G-Eval reads, while others leave the judge squinting at a single argmax token.

Two cartoon judges side by side. The GPT-4 judge wears fancy logprobs glasses that overlay the numbers 0.42, 0.31, and 0.18 on a scoreboard, while the Claude judge squints at the same answer without glasses, holding a magnifying glass. — **Figure 46.2.2a**: Logprobs access decides which judge can run G-Eval directly. Providers that expose token-level log-probabilities let the judge read a continuous score off the digit distribution; providers that do not force the temperature-averaging fallback to recover the same expectation.

Library Shortcut

G-Eval on Claude (Temperature-Averaging Fallback)

Anthropic's Claude API does not expose token-level logprobs, so the canonical G-Eval recipe is unavailable. The fallback is Monte Carlo score averaging: query the judge $N$ times at temperature $\tau > 0$ and average the parsed integer scores. The empirical mean approximates the same expectation $\mathbb{E}_{k \sim p}[k]$ that G-Eval reads off the logprobs analytically.

from anthropic import Anthropic
client = Anthropic()
def geval_claude(summary: str, n_samples: int = 20) -> float:
    scores = []
    for _ in range(n_samples):
        msg = client.messages.create(
            model="claude-opus-4-5",
            max_tokens=256,
            temperature=1.0,  # non-zero so we get a distribution
            messages=[{"role": "user",
                       "content": GEVAL_COHERENCE_PROMPT.format(summary=summary)}],
        )
        # Parse the last digit 1-5 from the response text
        for ch in reversed(msg.content[0].text):
            if ch in "12345":
                scores.append(int(ch))
                break
    return sum(scores) / len(scores)  # continuous score in [1, 5]

Twenty samples at $\tau = 1.0$ typically recover ${\sim}80\%$ of the precision gain G-Eval gets from analytic logprobs at $\sim 20\times$ the cost. If your eval budget is tight, drop $n$ to 5-10 and accept noisier scores; if precision matters more than budget, push to 50. The same recipe works for any logprob-less judge (older open-weights Llama deployments behind chat-only APIs, Gemini in some configurations, etc.).

Key Takeaways

Argmax-over-digits throws away the judge's uncertainty: two summaries that round to "4" can differ in real quality, and a judge split between 3 and 4 collapses to one digit unless you read the distribution.
G-Eval's two innovations are CoT plus probability weighting: forcing step-by-step reasoning improves human-rating alignment, and computing the digit-renormalized expectation under the score-token distribution returns a continuous [1, K] score that preserves small quality differences.
The expectation formula renormalizes only over digit tokens: probability mass on non-numeric tokens is dropped before computing sum_k k * p_k, so a 3.65 versus 4.17 gap survives where argmax would have reported 3 versus 4.
Logprob-less judges use Monte Carlo averaging: 10-50 samples at temperature 0.5-1.0 on Claude or Gemini approximate the same E[k] by the law of large numbers, recovering ~80 percent of the precision gain at ~20x the API cost.
deepeval ships the production version: G-Eval is a one-liner metric class with auto-detected logprob handling for OpenAI, Anthropic, and LiteLLM-backed open-weights models, suitable for pytest-style runs in CI.
G-Eval lifts SummEval coherence Spearman by 0.05-0.10: the Liu et al. (2023) empirical gain (roughly 0.42 to 0.51 on coherence) is the difference between a borderline-useful and a clearly-useful judge.

What's Next

Once you can measure judge bias, the next step is to neutralize it. Continue to Section 46.3: Debiasing Techniques: Position, Length, and Verbosity.

Further Reading

Judge Bias

Wang, P., Li, L., Chen, L., et al. (2023). "Large Language Models are not Fair Evaluators." ACL 2024. arXiv:2305.17926. Reference paper on position and self-preference bias in LLM judges.

Panickssery, A., Bowman, S. R., & Feng, S. (2024). "LLM Evaluators Recognize and Favor Their Own Generations." arXiv:2404.13076. Reference paper on self-preference bias.

Saito, K., Wachi, A., Wataoka, K., & Akimoto, Y. (2023). "Verbosity Bias in Preference Labeling by Large Language Models." NeurIPS 2023 IFM Workshop. arXiv:2310.10076. Reference on verbosity bias.

Reliability Methods

Bavaresco, A., Bernardi, R., Bertolazzi, L., et al. (2024). "LLMs instead of Human Judges? A Large Scale Empirical Study across 20 NLP Evaluation Tasks." arXiv:2406.18403. Large-scale reliability study; the standard reference for judge-vs-human correlation.