External Reading & Communities

Section 10.11

"The papers that matter this quarter are not yet on arXiv. Half are on X, half are on Discord, and a third (the math is fuzzy here) are still in someone's notebook."

FrontierFrontier, Pre-Print-Wanderer AI Agent
Big Picture

Part II's external-reading list is centrally about three things: keeping current with the frontier model releases, internalizing the canonical pretraining and inference papers, and finding the working LLM-research community (Discords, mailing lists, conferences) where the actual debugging happens. This section is the curated map of where to look when the textbook ends and the field continues moving.

Prerequisites

This section is the end-of-part reading list and assumes you have worked through the rest of Part II (modules 6 through 10). No new technical prerequisites; some sources presuppose comfort with transformer mechanics, scaling laws, and quantization formats from earlier sections.

Part II's external-reading list is centrally about three things: keeping current with the frontier model zoo, going deep on tokenization, and going deep on mechanistic interpretability. Each topic has a small set of canonical resources that consistently outperform "AI news" coverage. Subscribe to the originals; the secondary aggregators are noise.

External reading map: leaderboards, lab blogs, newsletters, deep readings
Figure 10.11.1: Where to look when Part II's reading ends. Leaderboards (LM Arena, HF Chatbot Arena, Artificial Analysis cost-and-latency dashboards, LiveBench, Epoch AI's FrontierMath) answer "which model should I call?" with weekly-refreshed data. Lab blogs (Anthropic's Transformer Circuits, OpenAI Research, Google DeepMind, Hugging Face's FineWeb and SmolLM reports, EleutherAI) are authoritative on each lab's own work. Newsletters (Sebastian Raschka's monthly Ahead of AI, Lilian Weng's quarterly long-form surveys, Nathan Lambert's weekly Interconnects, swyx's daily smol.ai news) filter the firehose. Communities (EleutherAI Discord, Alignment Forum, AI Safety Fundamentals curriculum, NeurIPS and ICLR proceedings) plus Karpathy's "Let's reproduce GPT-2" video are where the working debugging actually happens.

10.11.1 Leaderboards to bookmark

Fun Fact

The LMSYS Chatbot Arena leaderboard is one of the few public benchmarks that frontier labs cannot easily game because the test items are live human prompts that arrive faster than any model can be trained on them. It is also one of the rare places where you can watch a $10 billion lab tie with a 7B open-weight model on a Tuesday afternoon.

10.11.2 Lab blogs and engineering posts

10.11.3 Newsletters and survey blogs

10.11.4 Mechanistic interpretability deep-reading

Tip: Where the academic ML conversation moved in 2025

By 2025, Bluesky overtook X for the academic-ML poster crowd (Karpathy, Raschka, many CS faculty), while X continues to host commercial-lab announcements. Cross-post-following both is reasonable; if you must pick one, Bluesky is the better academic firehose. The smol.ai newsletter still aggregates across both platforms.

Tip: spend 15 minutes a day, not 2 hours

The mistake most people make tracking 2026 AI is spending too much time, not too little. Fifteen minutes a day on the daily/weekly tier above, plus an hour a month on the monthly tier, beats any "Twitter-scrolling all morning" routine. The frontier moves quickly but most weeks add less than 15 minutes of genuinely new technical content.

Key Takeaways

What's Next?

This chapter completes the current part. The next part, Part III: Working with LLMs, opens a new arc; see the part index for chapter ordering.

Further Reading

Foundational Mechanistic Interpretability

Olah, C., Cammarata, N., Schubert, L., et al. (2020). "Zoom In: An Introduction to Circuits." Distill. distill.pub/2020/circuits/zoom-in
The Distill essay that named the field; the canonical entry point for the circuit-level program of mechanistic interpretability.
Elhage, N., Nanda, N., Olsson, C., et al. (2021). "A Mathematical Framework for Transformer Circuits." Anthropic / Transformer Circuits Thread. transformer-circuits.pub/2021/framework
The formal decomposition of attention-only transformers into circuits; the technical foundation underlying nearly all later mech-interp work.
Olsson, C., Elhage, N., Nanda, N., et al. (2022). "In-Context Learning and Induction Heads." Anthropic / Transformer Circuits Thread. transformer-circuits.pub/2022/in-context-learning
The induction-head paper; the most-cited mechanistic-interpretability result of 2022 and still the textbook example of a "discovered" circuit.
Elhage, N., Hume, T., Olsson, C., et al. (2022). "Toy Models of Superposition." Anthropic / Transformer Circuits Thread. transformer-circuits.pub/2022/toy_model
The paper that put "superposition" on the mech-interp map; the conceptual basis for the SAE / dictionary-learning research program that dominated 2023-2025.

Sparse Autoencoders and Feature Dictionaries

Bricken, T., Templeton, A., Batson, J., et al. (2023). "Towards Monosemanticity: Decomposing Language Models With Dictionary Learning." Anthropic / Transformer Circuits Thread. transformer-circuits.pub/2023/monosemantic-features
The Anthropic paper that demonstrated SAEs on a small transformer; the work that turned "dictionary learning on activations" into a serious research direction.
Templeton, A., Conerly, T., Marcus, J., et al. (2024). "Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet." Anthropic / Transformer Circuits Thread. transformer-circuits.pub/2024/scaling-monosemanticity
The Claude 3 Sonnet feature-extraction paper; the first demonstration of SAE-discovered features on a frontier production model, and a major proof of concept for interpretability at scale.
Cunningham, H., Ewart, A., Riggs, L., et al. (2023). "Sparse Autoencoders Find Highly Interpretable Features in Language Models." ICLR 2024. arXiv:2309.08600
The independent academic SAE result published alongside the Anthropic work; widely used as the standard external citation for SAE feature discovery.
Gao, L., la Tour, T.D., Tillman, H., et al. (2024). "Scaling and Evaluating Sparse Autoencoders." OpenAI. arXiv:2406.04093
The OpenAI SAE-on-GPT-4 paper; demonstrated SAE scaling laws and pushed dictionary learning into the frontier-model regime.

Analysis Tools and Libraries

Nanda, N. & Bloom, J. (2022-2026). "TransformerLens: A Library for Mechanistic Interpretability of Transformer Language Models." TransformerLens Documentation. transformerlensorg.github.io/TransformerLens
The de-facto standard library for hooking and probing transformer internals; the entry-point tool for almost every mech-interp tutorial.
Lin, J., Bloom, J., et al. (2024-2026). "Neuronpedia: An Interactive Database of Sparse Autoencoder Features." Neuronpedia. neuronpedia.org
The community feature-explorer; the most-used interface for browsing SAE features on open models (Gemma-2, Llama-3, GPT-2).
Sarti, G., Feldhus, N., Sickert, L., & van der Wal, O. (2023). "Inseq: An Interpretability Toolkit for Sequence Generation Models." ACL 2023 System Demonstrations. arXiv:2302.13942
The most-cited open-source library for feature-attribution and saliency analysis on generative LLMs.
Marks, S., Rager, C., Michaud, E.J., et al. (2024). "Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models." ICLR 2025. arXiv:2403.19647
The paper bridging SAEs and circuit-level analysis; the basis for the "sparse feature circuits" pipeline now widely used to audit and edit LLM behavior.

Benchmarks and Reasoning Probes

Wang, Y., Ma, X., Zhang, G., et al. (2024). "MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark." NeurIPS 2024. arXiv:2406.01574
The 2024 reasoning-focused upgrade of MMLU; widely used as a coarse-grained probe of compositional reasoning behavior to triangulate against mech-interp findings.
Suzgun, M., Scales, N., Schärli, N., et al. (2022). "Challenging BIG-Bench Tasks and Whether chain-of-thought Can Solve Them." ACL 2023 (BIG-Bench Hard). arXiv:2210.09261
The BIG-Bench Hard subset; the canonical reasoning-probe harness used to stress-test circuit-level hypotheses about chain-of-thought behavior.
Wang, K., Variengien, A., Conmy, A., et al. (2022). "Interpretability in the Wild: A Circuit for Indirect Object Identification in GPT-2 Small." ICLR 2023. arXiv:2211.00593
The IOI circuit paper; the canonical end-to-end circuit-discovery case study and the de-facto worked example for any mech-interp course.