Part IX's literature is split between the academic eval community and the industrial MLOps community.
45.5.1 Foundational papers
- Liang et al., "Holistic Evaluation of Language Models (HELM)" (2022).
- Zheng et al., "Judging LLM-as-a-Judge" (2023).
- Kwon et al., "Efficient Memory Management for LLM Serving (vLLM)" (2023).
45.5.2 Blogs and resources
- Eugene Yan, "Evals for production LLMs".
- Hamel Husain, "Your AI Product Needs Evals".
- UK AI Safety Institute: publications on agent and safety evals.
- Stanford CRFM: HELM updates and related research.
45.5.3 Communities
- MLOps Community Slack.
- r/mlops.
- EleutherAI Discord (eval-harness channel).
Tip: Eval-driven development
Treat your eval set as code: version it, review changes, track scores in CI. The most reliable signal that an LLM product is on the right path is a steadily improving curve on a stable, version-controlled eval.
What's Next?
In the next chapter, Chapter 46: Why LLM-as-Judge Matters, we continue building on the material from this chapter.
Further Reading
External Reading
OpenAI (2024). "GPT-4 Technical Report." arXiv:2303.08774. Reference for the evaluation methodology used in flagship LLM releases.
Stanford HAI (2024). "HELM: Holistic Evaluation of Language Models." crfm.stanford.edu/helm. Reference holistic LLM evaluation framework.