Section 46.3: Debiasing Techniques: Position, Length, and Verbosity

"Swap the order, truncate the length, anchor the rubric, and suddenly I am almost fair. Three tweaks; that is the whole debiasing protocol."
Guard, Three-Axis Judge-Debiasing AI Agent

Big Picture: Three Debiasing Axes and the Open-Source Judge

Section 46.2 quantified the noise floor of LLM-as-Judge; this section gives you the toolkit for raising the signal. Three axes of debiasing dominate practice. (1) Position: in pairwise comparison the order of the candidates shifts the verdict; the fix is to run each comparison in both orders and either require consistency or aggregate the dual verdict. (2) Length: longer responses win pairwise comparisons by default; the fix is either to truncate outputs to a common budget before judging or to regress out the length effect post-hoc (the AlpacaEval LC recipe of Section 46.5). (3) Verbosity / formatting: structured, bulleted, confident-sounding text wins regardless of substance; the fix is rubric-anchored prompts that force the judge to score on specific axes (factuality, faithfulness, coverage) rather than gestalt preference. The second half of the section introduces Prometheus, an open-source 7B/8x7B judge model fine-tuned specifically for rubric-based evaluation; it is the canonical open-weights alternative to GPT-4 as a judge, and the architectural cousin of JudgeLM (Section 46.4).

Three debiasing axes for LLM-as-judge: position swap, length truncation, and rubric anchoring, with a noisy signal becoming cleaner — **Figure 46.3.1:** Three independent debiasing axes. Position swap cancels left-right preference, length control removes the long-answers-win effect, and rubric anchoring replaces gestalt "better?" judgments with structured per-axis scores. Use all three; they target different biases.

Prometheus (Kim et al., 2023) and Prometheus 2 (Kim et al., 2024) are open-source language models specifically trained to serve as evaluation judges. Unlike using a general-purpose model like GPT-4 as a judge, Prometheus models are fine-tuned on rubric-based evaluation data where each training example includes a detailed scoring rubric, a model output, a reference answer, and a human-assigned score with justification. This training process produces judges that are better calibrated to evaluation rubrics and less susceptible to the stylistic biases that affect general-purpose judges.

Prometheus 2 extends the original with two evaluation modes: direct assessment (scoring a single output on a rubric) and pairwise ranking (selecting the better of two outputs). The model accepts a structured input containing the evaluation criteria, the rubric with score-level descriptions, and the output(s) to evaluate. It produces a chain-of-thought justification followed by a score or preference verdict. Code Fragment 46.3.1a shows how to use Prometheus 2 for rubric-based evaluation.

# Prometheus 2: rubric-based evaluation with an open-source judge
# Install: pip install transformers torch
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_name = "prometheus-eval/prometheus-7b-v2.0"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    )
DIRECT_ASSESSMENT_TEMPLATE = """###Task Description:
An instruction (might include an Input inside it), a response to evaluate,
a reference answer that gets a score of 5, and a score rubric representing
the evaluation criteria are given.
1. Write a detailed feedback that assesses the quality of the response
strictly based on the given score rubric.
2. After writing the feedback, write a score that is an integer between
1 and 5. Refer to the score rubric.
###Instruction:
{instruction}
###Response to evaluate:
{response}
###Reference Answer (Score 5):
{reference}
###Score Rubric:
[{criteria}]
Score 1: {score_1_desc}
Score 2: {score_2_desc}
Score 3: {score_3_desc}
Score 4: {score_4_desc}
Score 5: {score_5_desc}
###Feedback:"""
def prometheus_evaluate(
    instruction: str,
    response: str,
    reference: str,
    criteria: str,
    rubric: dict,
    ) -> dict:
    """Run Prometheus 2 direct assessment."""
    prompt = DIRECT_ASSESSMENT_TEMPLATE.format(
        instruction=instruction,
        response=response,
        reference=reference,
        criteria=criteria,
        score_1_desc=rubric[1],
        score_2_desc=rubric[2],
        score_3_desc=rubric[3],
        score_4_desc=rubric[4],
        score_5_desc=rubric[5],
    )
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=512,
        temperature=0.0,
        do_sample=False,
        )
    generated = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)
    # Parse feedback and score from generated text
    feedback = generated.strip()
    score = None
    for line in reversed(feedback.split("\n")):
        line = line.strip()
        if line and line[0].isdigit():
            score = int(line[0])
            break
        return {"feedback": feedback, "score": score}

Code Fragment 46.3.1b: Using Prometheus 2 for rubric-based evaluation. The open-source judge model accepts a structured prompt with evaluation criteria and score-level descriptions, generates chain-of-thought feedback, and produces a calibrated score.

46.3.1 Prometheus and Prometheus 2: Open-Source Judge Models

Fun Fact

Prometheus (Kim et al., 2023) was named after the Greek titan who gave fire to humanity; the implicit joke was that the model gives away frontier-class evaluation capability to anyone with a consumer GPU. Prometheus 2, published at EMNLP 2024, was the first open-weights evaluation model that consistently outperformed GPT-4-as-judge on Vicuna and MT-Bench. The model is hosted on Hugging Face under prometheus-eval, where it has been downloaded over 500,000 times.

Prerequisites

This section assumes familiarity with judge reliability and biases from Section 46.2 and with LLM-as-judge fundamentals from Section 46.1. Familiarity with instruction tuning from Section 13.2 helps when reading the Prometheus training-recipe discussion.

Library Shortcut: Prometheus 2 Direct Assessment

A concrete invocation makes the rubric format tangible. Imagine you are scoring a customer-support response on empathy and accuracy:

result = prometheus_evaluate(
    instruction="A user writes: 'My order is two weeks late and I'm furious.' "
                "Reply with a tone-appropriate, accurate update.",
    response="Sorry for the delay. Your package shipped on 2026-05-15 and "
             "is now in transit; tracking link below.",
    reference="I'm really sorry your order is late, that's frustrating. "
              "It shipped 2026-05-15 and tracking shows it now in transit ...",
    criteria="Does the response acknowledge the user's frustration "
             "and provide accurate, actionable shipping information?",
    rubric={
        1: "Ignores user emotion and gives no useful information.",
        2: "Acknowledges emotion OR gives information, not both.",
        3: "Acknowledges both but one is weak or generic.",
        4: "Acknowledges emotion explicitly and provides correct info.",
        5: "Empathetic, specific, accurate, and offers next steps.",
    },
)
# result -> {"score": 3, "feedback": "The response is factually correct
#  but the apology is perfunctory and does not explicitly acknowledge
#  the user's stated frustration ... [Score 3]"}

Code Fragment 46.3.2: A worked Prometheus 2 invocation that scores a customer-support response on empathy and accuracy. The rubric is filled in completely: criterion sentence, score-level descriptions for 1 through 5, and a score-5 reference answer. The returned score-and-feedback pair shows how the judge anchors a "3" by citing the missing emotional acknowledgement.

Note the three rubric ingredients: a criterion sentence (one axis being scored), score-level descriptions (anchored definitions of what each integer means), and a reference answer at score 5 (concrete exemplar). Prometheus is explicitly trained on this format and degrades gracefully when fields are missing, but the more the rubric is filled in, the more the judge's verdict tracks human raters.

The primary advantage of Prometheus over general-purpose judge models is rubric adherence. When given a detailed scoring rubric, Prometheus produces scores that correlate more strongly with human judgments than GPT-4 on rubric-based evaluation tasks. The model is also fully open-source (Apache 2.0), enabling local deployment without API costs or data privacy concerns. For organizations evaluating sensitive content that cannot be sent to external APIs, Prometheus provides a viable alternative to proprietary judges.

Key Insight

Rubric-Trained Judges Outperform GPT-4 on Rubric-Following

The counter-intuitive finding from Kim et al. (2024) is that a 7B-parameter Prometheus 2 judge outperforms GPT-4 on rubric-based evaluation, despite GPT-4 being roughly two orders of magnitude larger and stronger on almost every other benchmark. Pearson correlation with human rubric scores on Feedback Bench reaches ${\sim}0.87$ for Prometheus 2 7B vs ${\sim}0.85$ for GPT-4. The reason: GPT-4 is a generalist trained on everything; Prometheus is a specialist that has seen $\sim$100K rubric-and-feedback examples in supervised fine-tuning, so it has internalized the structural priors of rubric-following (anchor each integer to its definition, justify before scoring, treat the reference answer as a literal score-5 exemplar). The takeaway generalizes beyond judges: for narrow, well-specified evaluation tasks, a small fine-tuned specialist routinely beats a frontier generalist at a fraction of the cost.

Figure 46.3.2 puts that David-and-Goliath result in cartoon form: the small open-source specialist arrives carrying the one thing the giant generalist was never trained to carry.

Two cartoon judges. On the left, a small bookish open-source judge labeled Prometheus 2 carries a thick rubric labeled SCORE 1-5; on the right, a much larger judge labeled GPT-4 looks impressed and slightly threatened. — **Figure 46.3.2a**: A 7B rubric specialist out-judging a frontier generalist. Because Prometheus 2 was fine-tuned on rubric-and-feedback examples, it follows score-level definitions more faithfully than GPT-4 does, which is why a far smaller model wins on rubric-based evaluation.

Library Shortcut

prometheus-eval as the open-source judge wrapper

The raw-transformers recipe above shows how the Prometheus 2 model works; in production reach for the official prometheus-eval package. It wraps prompt formatting, output parsing, batched vLLM inference, the canonical rubric templates, and both direct-assessment and pairwise modes. Pick it when you need a fully self-hosted judge with no external API dependency and predictable per-judgment cost.

Show code

pip install prometheus-eval
from prometheus_eval.vllm import VLLM
from prometheus_eval import PrometheusEval
from prometheus_eval.prompts import ABSOLUTE_PROMPT, SCORE_RUBRIC_TEMPLATE

model = VLLM(model="prometheus-eval/prometheus-7b-v2.0")
judge = PrometheusEval(model=model, absolute_grade_template=ABSOLUTE_PROMPT)

feedback, score = judge.single_absolute_grade(
    instruction=instruction,
    response=response,
    rubric=SCORE_RUBRIC_TEMPLATE.format(criteria=criteria, **level_descs),
    reference_answer=reference,
)

Code Fragment 46.3.3: VLLM autoscales batched inference to roughly 100 judgments/sec on a single A100; swap single_absolute_grade for single_relative_grade to switch to pairwise mode without changing the rest of the pipeline.

See Also

Prometheus and the JudgeLM model covered in Section 46.4 are two points on the same design axis: both are open-source judge models fine-tuned from a Llama-class base on evaluation-formatted data. The differences are training-data shape (Prometheus uses rubric-and-feedback triples; JudgeLM uses pairwise comparison data distilled from GPT-4), so Prometheus is stronger on rubric-anchored direct assessment and JudgeLM is stronger on pairwise ranking without a rubric. In practice you pick the model whose training format matches your eval format, or run both and ensemble. Section 46.4 walks through JudgeLM's training pipeline and the swap-augmentation trick that targets position bias.

Production Pattern

Production Example: LLM-as-Judge Stacks at Major Labs

Several companies publish their judge stacks. LMSYS's Chatbot Arena (the live leaderboard) uses GPT-4 as judge for offline rubric scoring while running anonymized human preference votes in parallel as the calibration anchor. Hugging Face's Open LLM Leaderboard switched to lm-evaluation-harness with judge models for IFEval-style format checks in 2024. Inside companies, Vercel AI and Cohere have described G-Eval-style chain-of-thought judges in production; Cursor uses LLM judges to grade code diffs during eval runs of new model versions. The recurring pattern: judge model is a frontier model (GPT-4o or Claude Sonnet) for rubric scoring, paired with a periodic human-rated calibration set (typically 200 to 1000 examples) to monitor for judge drift.

Self-Check: Position, Verbosity, and Self-Preference Bias

Before moving on, test your grasp of judge biases on three short scenarios.

Q1: Position bias. You run a pairwise judge with the same model twice on the same pair: once as (A, B) and once as (B, A). Roughly 12 percent of the time the verdict flips. Is this judge unsafe to deploy without modification? Name two cheap mitigations and one expensive one.

Show Answer

Yes, a 12 percent flip rate is unsafe for production: any single judgment is essentially a coin flip on close calls, and downstream metrics will be biased toward whichever order you chose. Cheap mitigations: (1) always run each pair twice (A,B and B,A) and only accept agreeing verdicts as decisive, treating disagreements as ties; (2) randomize the order across the eval set so position bias averages out at the aggregate level. Expensive mitigation: fine-tune or distill a judge model on order-swapped pairs so the bias is reduced by construction.

Q2: Verbosity bias. Your judge tends to pick the longer response when answers are close in quality. You suspect the prompt is leaking a "be helpful and thorough" instruction. Sketch a two-line prompt change you would test first and a metric you would track to confirm the bias is reduced.

Show Answer

Add an explicit "judge on correctness, not length; favor concise correct answers over long verbose ones" instruction; and remove any "be thorough" wording from the judge system prompt. Track the Pearson correlation between (response length) and (judge-assigned score) on a calibration set; if it drops from, say, 0.5 to under 0.1 the bias is materially reduced. Pair this with length-controlled win rates (LC AlpacaEval style) for a stricter check.

Q3: Self-preference. Your team picks GPT-4o as a judge for evaluating responses from GPT-4o, Claude, and Llama. A reviewer flags that this design will favor GPT-4o. Why is that concern justified, and what experimental design would let you measure and bound the effect?

Show Answer

Self-preference is justified because judges have been shown to prefer outputs that match their own writing style, formatting conventions, and reasoning patterns; GPT-4o judging GPT-4o is the textbook setup. To measure: assemble a small set of head-to-head pairs with human ground truth, then compute the win rate gap between (GPT-4o judge on GPT-4o vs Claude) and (Claude judge on GPT-4o vs Claude). The gap is your self-preference bias. To bound: ensemble multiple judges (GPT-4o + Claude + Gemini), use the majority verdict, and report disagreement rate alongside the metric.

If you can answer all three, you are ready for Section 46.4 (training judges) and Section 46.5 (calibrating against humans).

What's Next

Off-the-shelf debiasing only gets you so far; the next step is to train a judge model that has the biases reduced by construction. Continue to Section 46.4: Training Judge Models.

Further Reading

Debiasing Techniques

Dubois, Y., Galambosi, B., Liang, P., & Hashimoto, T. B. (2024). "Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators." arXiv:2404.04475. The canonical reference for length-controlled judge debiasing.

Wang, P., et al. (2023). "Large Language Models are not Fair Evaluators." arXiv:2305.17926. Reference for position-swap debiasing.

Zheng, L., Chiang, W.-L., Sheng, Y., et al. (2023). "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena." NeurIPS 2023. arXiv:2306.05685. The foundational paper that established LLM-as-judge methodology and documented position and verbosity bias; essential prerequisite reading for any judge-debiasing study.

Judge Models

Kim, S., Shin, J., Cho, Y., et al. (2023). "Prometheus: Inducing Fine-grained Evaluation Capability in Language Models." ICLR 2024. arXiv:2310.08491. The original Prometheus judge model paper; the reference for open-source rubric-conditioned scoring.

Kim, S., Suk, J., Longpre, S., et al. (2024). "Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models." arXiv:2405.01535. The second-generation Prometheus model that adds pairwise judging; the most-recent open judge baseline referenced in this section.

Production Patterns

Zheng, C., Zhou, H., Meng, F., Zhou, J., & Huang, M. (2024). "Large Language Models Are Not Robust Multiple Choice Selectors." ICLR 2024. arXiv:2309.03882. Reference for option-position bias in multiple-choice judge evaluation.