"The papers that matter this quarter are not yet on arXiv. Half are on X, half are on Discord, and a third (the math is fuzzy here) are still in someone's notebook."
Frontier, Pre-Print-Wanderer AI Agent
Part II's external-reading list is centrally about three things: keeping current with the frontier model releases, internalizing the canonical pretraining and inference papers, and finding the working LLM-research community (Discords, mailing lists, conferences) where the actual debugging happens. This section is the curated map of where to look when the textbook ends and the field continues moving.
Prerequisites
This section is the end-of-part reading list and assumes you have worked through the rest of Part II (modules 6 through 10). No new technical prerequisites; some sources presuppose comfort with transformer mechanics, scaling laws, and quantization formats from earlier sections.
Part II's external-reading list is centrally about three things: keeping current with the frontier model zoo, going deep on tokenization, and going deep on mechanistic interpretability. Each topic has a small set of canonical resources that consistently outperform "AI news" coverage. Subscribe to the originals; the secondary aggregators are noise.
10.11.1 Leaderboards to bookmark
The LMSYS Chatbot Arena leaderboard is one of the few public benchmarks that frontier labs cannot easily game because the test items are live human prompts that arrive faster than any model can be trained on them. It is also one of the rare places where you can watch a $10 billion lab tie with a 7B open-weight model on a Tuesday afternoon.
- LM Arena: blind pairwise comparisons, updated weekly. The single most-trusted ranking of chat models.
- Chatbot Arena leaderboard (HF Space): same data, sortable by category, model size, and licensing.
- Artificial Analysis: cost-per-token, latency, throughput across providers. The "how cheap is this in production" tracker.
- LiveBench: contamination-resistant benchmark with monthly question rotation.
- Epoch AI benchmarks (FrontierMath and friends): the math-reasoning frontier tracker.
10.11.2 Lab blogs and engineering posts
- Transformer Circuits (Anthropic): every important mech-interp paper of 2022-26. Required reading for Chapter 10.
- OpenAI Research: the GPT family's model cards and "system card" disclosures.
- Google DeepMind blog: Gemini disclosures, Gemma releases, alignment-research updates.
- Hugging Face blog: tokenizer internals, dataset releases, FineWeb / SmolLM technical reports.
- EleutherAI blog: the lm-evaluation-harness updates, Pythia analysis, open-source pretraining.
10.11.3 Newsletters and survey blogs
- Ahead of AI (Sebastian Raschka): monthly. Best single newsletter for "what shipped" with benchmarks.
- Lilian Weng's blog: quarterly long-form surveys. The 2024 "Why we think" and 2025 reasoning-models survey are the modern anchors.
- Interconnects (Nathan Lambert): weekly. Strong on alignment, RLHF, evaluation politics.
- Alignment Forum: serious alignment-research discussion.
- smol.ai newsletter (swyx): daily AI-engineering newsletter; good filter for "which Twitter / Bluesky thread mattered today".
- Karpathy: "Let's reproduce GPT-2 (124M)" (2024): the most-watched practical pretraining walkthrough; works as the missing companion to Chapter 7.
- AI Safety Fundamentals: the dominant 2025 alignment-onboarding curriculum, free.
10.11.4 Mechanistic interpretability deep-reading
- Neel Nanda: "A Comprehensive Mechanistic Interpretability Explainer & Glossary": the field's pedagogical anchor.
- Olsson et al.: "In-context Learning and Induction Heads": still the most-cited paper-level introduction to circuit-level analysis.
- Templeton et al.: "Scaling Monosemanticity": 34M-feature SAE on Claude 3 Sonnet. The breakthrough that made SAEs production-scale.
- Anthropic: "Attribution Graphs and Cross-Layer Transcoders" (March 2025): the next-generation circuit-tracing methodology.
- 2025 "Mechanistic Interpretability 1-year Update" (Anthropic, 2025-Q3): the current canonical survey replacing the 2022 Olsson paper as the newcomer's first read.
By 2025, Bluesky overtook X for the academic-ML poster crowd (Karpathy, Raschka, many CS faculty), while X continues to host commercial-lab announcements. Cross-post-following both is reasonable; if you must pick one, Bluesky is the better academic firehose. The smol.ai newsletter still aggregates across both platforms.
The mistake most people make tracking 2026 AI is spending too much time, not too little. Fifteen minutes a day on the daily/weekly tier above, plus an hour a month on the monthly tier, beats any "Twitter-scrolling all morning" routine. The frontier moves quickly but most weeks add less than 15 minutes of genuinely new technical content.
- LM Arena and Artificial Analysis are the load-bearing leaderboards: blind pairwise Elo for chat-quality ranking and cost-per-token plus latency tracking for production economics, with LiveBench and Epoch AI handling contamination-resistant evaluation.
- Transformer Circuits and the lab blogs beat secondary aggregators: Anthropic's Transformer Circuits, OpenAI Research, Google DeepMind, Hugging Face, and EleutherAI publish the primary material before the news cycle picks it up.
- Newsletters and surveys cover the mid-cadence: Ahead of AI, Lilian Weng's quarterly long-form, Interconnects on alignment politics, Alignment Forum on serious safety research, and smol.ai for daily filtering.
- Mech-interp deep reading anchors on Anthropic's lineage: Nanda's glossary, Olsson et al. on induction heads, Templeton et al. on scaling monosemanticity, and the 2025 Attribution Graphs paper give the field's pedagogical and methodological spine.
- Bluesky displaced X for academic ML in 2025: Karpathy, Raschka, and many CS faculty migrated to Bluesky, while X retains commercial-lab announcements, so smol.ai's cross-platform aggregation is the practical compromise.
- Fifteen minutes a day beats two-hour scroll sessions: most weeks add less than 15 minutes of genuinely new technical content, so daily plus weekly tiers plus a monthly hour outpace continuous Twitter checking.
What's Next?
This chapter completes the current part. The next part, Part III: Working with LLMs, opens a new arc; see the part index for chapter ordering.