Models

Section 45.4

Two model categories matter for Part VIII: judge models (used for LLM-as-judge eval) and the production-serving models themselves.

45.4.1 Judge models

The LLM-as-judge pattern uses one model to score outputs of another. The judge needs to be at least as capable as the model under test and ideally more so.

45.4.2 Production-serving model picks

The choice of model to actually serve depends on cost, latency, quality, and license. The trade-off table from Section 14.4 remains the right reference; we restate the production-relevant rows here.

Key Insight: LLM-as-judge has known biases

Judges show position bias (preferring the first response), verbosity bias (preferring longer answers), and self-preference (LLMs rate their own outputs higher). Mitigate by randomizing position, normalizing length, and using a different family as the judge.

What's Next?

In the next section, Section 45.5: External Reading & Communities, we build on the material covered here.

Further Reading

Judge Models

Kim, S., Shin, J., Cho, Y., et al. (2023). "Prometheus: Inducing Fine-grained Evaluation Capability in Language Models." ICLR 2024. arXiv:2310.08491. The original Prometheus judge model paper; the foundational reference for open-source rubric-conditioned scoring.
Kim, S., Suk, J., Longpre, S., et al. (2024). "Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models." EMNLP 2024. arXiv:2405.01535. The second-generation Prometheus with pairwise judging support; the most-recent open judge baseline.
Vu, T., Krishna, K., Alzubi, S., et al. (2024). "Foundational Autoraters: Taming Large Language Models for Better Automatic Evaluation." arXiv:2407.10817. Google's foundational autorater work; the reference for treating evaluation as a core LLM capability.
Zheng, L., Chiang, W.-L., Sheng, Y., et al. (2023). "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena." NeurIPS 2023. arXiv:2306.05685. The paper that established LLM-as-judge as a methodology and documented its biases; essential prerequisite reading for any judge-model deployment.
Lambert, N., Pyatkin, V., Morrison, J., et al. (2024). "RewardBench: Evaluating Reward Models for Language Modeling." arXiv:2403.13787. The standard benchmark for reward models and pairwise judges; the reference for comparing judge models against frontier closed models like GPT-4.