Two model categories matter for Part VIII: judge models (used for LLM-as-judge eval) and the production-serving models themselves.
45.4.1 Judge models
The LLM-as-judge pattern uses one model to score outputs of another. The judge needs to be at least as capable as the model under test and ideally more so.
- Claude Opus 4.5 (Anthropic, 2025-2026) is the most-used LLM-as-judge in 2026 because of its strong instruction-following and willingness to follow complex rubrics. Its objective as a judge is to apply your scoring criteria faithfully even when the rubric is long, which matters because judges that ignore parts of the rubric corrupt downstream eval. Pick Claude Opus 4.5 as your default judge; cross-validate occasionally with a different family to detect self-preference and family-specific bias.
- GPT-5 / o3 (OpenAI, 2024-2025) are the OpenAI frontier and reasoning models, useful as alternative judges when you want family diversity. Their objective in a judge role is to provide cross-validation against Claude Opus 4.5 so model-family bias is exposed, which matters because LLMs rate their own family's outputs higher (self-preference bias). Pick GPT-5 or o3 as a secondary judge for ensemble or cross-check.
- Prometheus 2 (KAIST AI, 2024) is the open-weights judge model trained specifically for evaluation tasks via 100K judge-style preferences. Its objective is to provide an open self-hostable judge that does not leak your eval prompts to a third party, which matters when eval prompts are sensitive (security audits, internal red-team). The core concept is direct preference and absolute-score evaluation, supervised on synthetic and human eval rubrics. Pick Prometheus 2 when self-hosting is required; for raw judge quality, closed frontier models remain stronger.
- Skywork-Reward (Skywork AI, 2024) is an open reward model, trained specifically to score (prompt, response) pairs for RLHF. Its objective is to provide an open scalar-reward model for downstream RL or preference filtering, which matters when you need to score thousands of generations cheaply. Pick Skywork-Reward when you need a fast scalar judge for RL or candidate filtering; for natural-language judge rubrics, Prometheus 2 fits better.
45.4.2 Production-serving model picks
The choice of model to actually serve depends on cost, latency, quality, and license. The trade-off table from Section 14.4 remains the right reference; we restate the production-relevant rows here.
- For closed APIs: GPT-4o-mini, Claude Sonnet, or Gemini Flash usually win on cost-per-task.
- For self-hosted at scale: Llama-4 70B or Qwen3 72B served on vLLM.
- For edge or single-GPU: Llama-4 8B, Qwen3 7B, or Gemma 3 9B.
Judges show position bias (preferring the first response), verbosity bias (preferring longer answers), and self-preference (LLMs rate their own outputs higher). Mitigate by randomizing position, normalizing length, and using a different family as the judge.
What's Next?
In the next section, Section 45.5: External Reading & Communities, we build on the material covered here.