Section 45.4: Models

Two model categories matter for Part VIII: judge models (used for LLM-as-judge eval) and the production-serving models themselves.

45.4.1 Judge models

The LLM-as-judge pattern uses one model to score outputs of another. The judge needs to be at least as capable as the model under test and ideally more so.

Claude Opus 4.5 (Anthropic, 2025-2026) is the most-used LLM-as-judge in 2026 because of its strong instruction-following and willingness to follow complex rubrics. Its objective as a judge is to apply your scoring criteria faithfully even when the rubric is long, which matters because judges that ignore parts of the rubric corrupt downstream eval. Pick Claude Opus 4.5 as your default judge; cross-validate occasionally with a different family to detect self-preference and family-specific bias.
GPT-5 / o3 (OpenAI, 2024-2025) are the OpenAI frontier and reasoning models, useful as alternative judges when you want family diversity. Their objective in a judge role is to provide cross-validation against Claude Opus 4.5 so model-family bias is exposed, which matters because LLMs rate their own family's outputs higher (self-preference bias). Pick GPT-5 or o3 as a secondary judge for ensemble or cross-check.
Prometheus 2 (KAIST AI, 2024) is the open-weights judge model trained specifically for evaluation tasks via 100K judge-style preferences. Its objective is to provide an open self-hostable judge that does not leak your eval prompts to a third party, which matters when eval prompts are sensitive (security audits, internal red-team). The core concept is direct preference and absolute-score evaluation, supervised on synthetic and human eval rubrics. Pick Prometheus 2 when self-hosting is required; for raw judge quality, closed frontier models remain stronger.
Skywork-Reward (Skywork AI, 2024) is an open reward model, trained specifically to score (prompt, response) pairs for RLHF. Its objective is to provide an open scalar-reward model for downstream RL or preference filtering, which matters when you need to score thousands of generations cheaply. Pick Skywork-Reward when you need a fast scalar judge for RL or candidate filtering; for natural-language judge rubrics, Prometheus 2 fits better.

45.4.2 Production-serving model picks

The choice of model to actually serve depends on cost, latency, quality, and license. The trade-off table from Section 14.4 remains the right reference; we restate the production-relevant rows here.

For closed APIs: GPT-4o-mini, Claude Sonnet, or Gemini Flash usually win on cost-per-task.
For self-hosted at scale: Llama-4 70B or Qwen3 72B served on vLLM.
For edge or single-GPU: Llama-4 8B, Qwen3 7B, or Gemma 3 9B.

Key Insight: LLM-as-judge has known biases

Judges show position bias (preferring the first response), verbosity bias (preferring longer answers), and self-preference (LLMs rate their own outputs higher). Mitigate by randomizing position, normalizing length, and using a different family as the judge.

What's Next?

In the next section, Section 45.5: External Reading & Communities, we build on the material covered here.

Further Reading

Judge Models

Kim, S., Shin, J., Cho, Y., et al. (2023). "Prometheus: Inducing Fine-grained Evaluation Capability in Language Models." ICLR 2024. arXiv:2310.08491. The original Prometheus judge model paper; the foundational reference for open-source rubric-conditioned scoring.

Kim, S., Suk, J., Longpre, S., et al. (2024). "Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models." EMNLP 2024. arXiv:2405.01535. The second-generation Prometheus with pairwise judging support; the most-recent open judge baseline.

Vu, T., Krishna, K., Alzubi, S., et al. (2024). "Foundational Autoraters: Taming Large Language Models for Better Automatic Evaluation." arXiv:2407.10817. Google's foundational autorater work; the reference for treating evaluation as a core LLM capability.

Zheng, L., Chiang, W.-L., Sheng, Y., et al. (2023). "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena." NeurIPS 2023. arXiv:2306.05685. The paper that established LLM-as-judge as a methodology and documented its biases; essential prerequisite reading for any judge-model deployment.

Lambert, N., Pyatkin, V., Morrison, J., et al. (2024). "RewardBench: Evaluating Reward Models for Language Modeling." arXiv:2403.13787. The standard benchmark for reward models and pairwise judges; the reference for comparing judge models against frontier closed models like GPT-4.