Section 45.5: External Reading & Communities

Part IX's literature is split between the academic eval community and the industrial MLOps community.

45.5.1 Foundational papers

45.5.2 Blogs and resources

Eugene Yan, "Evals for production LLMs".
Hamel Husain, "Your AI Product Needs Evals".
UK AI Safety Institute: publications on agent and safety evals.
Stanford CRFM: HELM updates and related research.

45.5.3 Communities

MLOps Community Slack.
r/mlops.
EleutherAI Discord (eval-harness channel).

Tip: Eval-driven development

Treat your eval set as code: version it, review changes, track scores in CI. The most reliable signal that an LLM product is on the right path is a steadily improving curve on a stable, version-controlled eval.

What's Next?

In the next chapter, Chapter 46: Why LLM-as-Judge Matters, we continue building on the material from this chapter.

Further Reading

External Reading

OpenAI (2024). "GPT-4 Technical Report." arXiv:2303.08774. Reference for the evaluation methodology used in flagship LLM releases.

Stanford HAI (2024). "HELM: Holistic Evaluation of Language Models." crfm.stanford.edu/helm. Reference holistic LLM evaluation framework.