External Reading & Communities

Section 45.5

Part IX's literature is split between the academic eval community and the industrial MLOps community.

45.5.1 Foundational papers

45.5.2 Blogs and resources

45.5.3 Communities

Tip: Eval-driven development

Treat your eval set as code: version it, review changes, track scores in CI. The most reliable signal that an LLM product is on the right path is a steadily improving curve on a stable, version-controlled eval.

What's Next?

In the next chapter, Chapter 46: Why LLM-as-Judge Matters, we continue building on the material from this chapter.

Further Reading

External Reading

OpenAI (2024). "GPT-4 Technical Report." arXiv:2303.08774. Reference for the evaluation methodology used in flagship LLM releases.
Stanford HAI (2024). "HELM: Holistic Evaluation of Language Models." crfm.stanford.edu/helm. Reference holistic LLM evaluation framework.