Chapter 78: Tools of the Trade: Frontier Research Stack

Chapter opener illustration: Tools of the Trade: Fr.

"The frontier is the part of the map that is currently being drawn. The tools listed here will not survive the decade."
Frontier, Pre-Print-Reading AI Agent

Looking Back

Chapters 75 through 82 surveyed the frontier. This chapter is the toolkit for keeping up: arxiv-sanity, alphaXiv, Hugging Face Daily Papers, Papers with Code, reading lists, replication tools, and the small habits that distinguish someone who follows the field from someone who chases it.

Big Picture

Part XII looked at the frontiers: where research is headed, where the open questions are, and what the next decade of LLM work might look like. This chapter is the toolbox for staying current: the paper firehose (arXiv, Papers with Code), the lab publications (Anthropic, OpenAI, EleutherAI, Nous, Stability), and the live evaluation tracking (LMArena, Artificial Analysis).

Chapter Overview

Part XV covered frontier theory and AGI trajectories. This chapter consolidates the frontier-research toolchain: the platforms (arXiv, Semantic Scholar, ResearchRabbit, Elicit, OpenReview), the libraries organized by paper tracking, reproducibility (Hydra, DVC, W&B), and reference management, the benchmark suite that defines the empirical anchor, the 2025 to 2026 model shelf organized by reasoning-first, agent-first, and capability-frontier tiers, and the venues (NeurIPS, ICML, ICLR, COLM, Anthropic and OpenAI engineering blogs) that publish the next frontier.

Frontier-research tooling moves faster than peer review, so this chapter focuses on what stays stable: the platforms, the reproducibility libraries, and the venues that publish whichever specific tools come next.

Note: Learning Objectives

Use arXiv, Semantic Scholar, ResearchRabbit, Elicit, and OpenReview as paper-tracking infrastructure.
Apply Hydra, DVC, and W&B as the reproducibility substrate for frontier research code.
Evaluate frontier benchmarks and reason about contamination and saturation.
Choose between reasoning-first (o3, Claude Opus, Gemini 2.5 Pro Deep Think, DeepSeek-R1) and capability-frontier models for a target experiment.
Track the venues, conferences, and engineering blogs that publish the next frontier.

Library Shortcut

For the minimum information diet:

pip install arxiv

The arxiv Python client plus a daily reading hour is the closest thing to a frontier-tracking habit that has held up since 2018. Complement with Hugging Face Papers for curation.

Sections in This Chapter

Prerequisites

Modern LLM landscape from Chapter 7
Evaluation tooling from Chapter 45
An arxiv account and the patience to read pre-prints

What Comes Next

This is the final chapter. After Section 78.5, you have finished the book. The appendices remain as reference material; Appendix index lists them all.

Further Reading

Frontier Research Infrastructure

Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., et al. (2020). "Transformers: State-of-the-Art Natural Language Processing." EMNLP Demos. arXiv:1910.03771. The Hugging Face transformers paper, the substrate for nearly every frontier research toolchain.

Rasley, J., Rajbhandari, S., Ruwase, O., & He, Y. (2020). "DeepSpeed: System Optimizations Enable Training Deep Learning Models with Over 100 Billion Parameters." KDD. ACM DL. DeepSpeed, one of the canonical training-systems toolkits that frontier labs and academic groups extend to push model scale.

Open Benchmarks & Leaderboards

Srivastava, A., Rastogi, A., Rao, A., Shoeb, A. A. M., Abid, A., Fisch, A., et al. (2023). "Beyond the Imitation Game: Quantifying and Extrapolating the Capabilities of Language Models." TMLR. arXiv:2206.04615. BIG-bench, the open community benchmark that defined the modern leaderboard pattern, with hundreds of tasks contributed by external researchers.

Chiang, W.-L., Zheng, L., Sheng, Y., Angelopoulos, A. N., Li, T., Li, D., et al. (2024). "Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference." ICML. arXiv:2403.04132. LMSYS Chatbot Arena, the de-facto human-preference leaderboard the frontier research community uses to compare models in the wild.