Section 61.5: External Reading and Communities

"SC for the foundations, MLSys for the methods, Hugging Face for the playbook, r/LocalLLaMA for the ground truth. One stream a week and the field cannot lose you."
Tensor, Reading-List-Curating Scale AI Agent

Big Picture

Staying current on LLM systems at scale requires sampling four streams. First, the systems and HPC venues (SC / Supercomputing, ISC, MLSys, OSDI, SOSP, EuroSys, NSDI) where the foundational training-systems papers are published. Second, the foundational papers themselves: Megatron-LM, the DeepSpeed series, GPT-3 / PaLM / OPT / BLOOM training reports, the DeepSeek-V3 paper, the Llama-3 herd paper, the GShard / Mixture-of-Experts papers. Third, the engineering blogs that translate research into practice: Hugging Face's scale series, PyTorch Engineering, the DeepSpeed blog, Yi Tay (Reka), Nathan Lambert (Interconnects), the various lab tech blogs. Fourth, the communities where practitioners exchange tactical know-how: r/LocalLLaMA infrastructure threads, EleutherAI Discord, MLPerf and MLCommons working groups, the open-source library issue trackers themselves. A practitioner who samples one of each stream weekly stays within a week of the field; one who samples only Twitter / X stays in a noisy filter bubble.

The systems-at-scale field is unusual in that the most important results are often published not in academic venues but in vendor technical reports and engineering blogs. The Llama-3 herd paper, the DeepSeek-V3 paper, the OPT-175B logbook, the BLOOM book, and the various Anthropic / Google / OpenAI technical reports are arguably more practically influential than the median MLSys paper. The reading list below therefore mixes academic and industrial sources roughly evenly, weighted by what practitioners actually cite in production work.

61.5.1 Conferences and academic venues

The systems-side conferences are where the foundational training-infrastructure papers appear. They differ from the ML conferences (NeurIPS, ICML, ICLR) which focus on model and method advances rather than systems.

SC (International Conference for High Performance Computing, Networking, Storage and Analysis) (annual, US-based, since 1988) is the premier HPC conference, the venue where the original ZeRO, the Megatron papers, GShard, and many distributed-training foundational results appeared. Its objective is to be the systems-side conference for high-performance computing, which matters because the field's understanding of GPU collective communication, fault tolerance, and large-scale checkpointing comes from SC papers. Pick SC as the canonical conference to follow if you work on training systems infrastructure; the November proceedings each year are essential reading.
ISC High Performance (annual, European-based, since 1986) is the European HPC conference, complementary to SC with a stronger European systems-research focus. Pick ISC for European HPC research and the TOP500 listing updates; the SC / ISC pair gives full-year coverage.
MLSys (Conference on Machine Learning and Systems) (annual, since 2018) is the conference at the intersection of ML and systems, where many distributed-training, inference-system, and accelerator-design papers appear. Pick MLSys as the primary venue for ML systems research; the proceedings each spring are the canonical reading.
OSDI (USENIX Symposium on Operating Systems Design and Implementation) and SOSP (ACM Symposium on Operating Systems Principles) (biennial, alternating) are the two top-tier operating systems / systems-software conferences. Increasingly home to ML systems work (Pathways, Alpa, FlexAttention, etc.). Pick OSDI / SOSP for systems-flavored ML papers with deep technical content; the proceedings are essential for serious systems work.
NSDI (USENIX Symposium on Networked Systems Design and Implementation) is the networked-systems conference, increasingly home to AI-cluster networking papers (RoCE-versus-InfiniBand work, in-network aggregation). Pick NSDI for networking-focused training-systems research.
EuroSys is the European top-tier systems conference; complementary to OSDI / SOSP.
NeurIPS, ICML, ICLR are the canonical ML conferences. Less systems-focused than MLSys but home to foundational papers (the original Transformer, the original scaling laws, GPT-3, PaLM 2, etc.). Pick NeurIPS / ICML / ICLR for model and method advances; for systems-side training infrastructure, MLSys / SC are more directly relevant.
MLPerf and MLCommons working groups (ongoing) host benchmark submissions and working groups where the canonical systems-comparison results emerge. Pick MLPerf / MLCommons participation when you specifically want to influence or follow the standardized training and inference benchmarks.

61.5.2 Foundational papers to read

A 20-paper canon for LLM systems at scale work, ordered roughly from architectural foundations through training systems through specific frontier-scale writeups.

"Attention Is All You Need" (Vaswani et al., 2017): the original Transformer paper. Foundational reading even though most practitioners know it. Pick when starting from first principles.
"Scaling Laws for Neural Language Models" (Kaplan et al., 2020) and "Training Compute-Optimal Large Language Models" (Chinchilla) (Hoffmann et al., 2022): the canonical scaling-law papers. Pick to understand why "20 tokens per parameter" became the Chinchilla rule of thumb and why later models train past that on saturating-but-still-useful data.
"Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism" (Shoeybi et al., 2019): the original tensor-parallelism paper. Pick for the foundational tensor parallelism technique.
"Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM" (Narayanan et al., 2021): the 3D parallelism (data + tensor + pipeline) paper. Pick for the canonical reference on combining parallelism dimensions.
"ZeRO: Memory Optimizations Toward Training Trillion Parameter Models" (Rajbhandari et al., 2020): the original ZeRO paper. Pick for the foundational sharded-data-parallel technique.
"ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning" (Rajbhandari et al., 2021): the ZeRO-3 paper with CPU and NVMe offloading. Pick for the technique behind memory-constrained large-model training.
"GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding" (Lepikhin et al., 2020): the foundational MoE paper. Pick for the technique behind every modern sparse MoE.
"Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity" (Fedus et al., 2021): the Switch Transformer paper. Pick for the simplified top-1 routing that influenced subsequent MoE work.
"Pathways: Asynchronous Distributed Dataflow for ML" (Barham et al., 2022): the Google Pathways paper. Pick for the multi-pod TPU training abstraction behind Gemini-class models.
"FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness" (Dao et al., 2022): the original FlashAttention paper. Foundational reading.
"FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-Precision" (Dao, 2024): the FA3 paper extending to FP8 and H100-specific patterns. Pick for the current state-of-the-art.
"PaLM: Scaling Language Modeling with Pathways" (Chowdhery et al., 2022): the PaLM 540B training writeup. Pick for the early TPU-pod frontier-training reference.
"OPT: Open Pretrained Transformer Language Models" (Zhang et al., 2022) and the accompanying OPT-175B Logbook: the OPT paper and the famous logbook documenting the day-by-day pretraining experience. The logbook is mandatory reading for anyone considering frontier training; it documents the kinds of operational failures that occur during multi-month runs.
"BLOOM: A 176B-Parameter Open-Access Multilingual Language Model" (BigScience, 2022) and the accompanying book chapters: the BLOOM technical report. Pick for the canonical open multilingual pretraining reference and for the operational lessons learned during the run.
"The Llama-3 Herd of Models" (Dubey et al., 2024): the 92-page Llama-3 technical report. Pick as the canonical "how a 2024 frontier dense model is trained" reference.
"DeepSeek-V3 Technical Report" (DeepSeek, 2024): the canonical reference for production-scale MoE training, with detailed FP8 recipe and the auxiliary-loss-free load-balancing technique. Pick as the 2024-25 reference for serious MoE work.
"DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning" (DeepSeek, 2025): the R1 paper documenting GRPO-based reasoning training. Pick as the 2025 reference for RL-based reasoning.
"PaLM 2 Technical Report" (Anil et al., 2023): Google's PaLM 2 writeup. Pick for the Google-flavored frontier reference.
"Gemma: Open Models Based on Gemini Research and Technology" (Gemma Team, 2024): the Gemma technical report. Pick for the Google-trained open model approach.
"torchtitan: A Native PyTorch Library for Large Model Training" (Meta, 2024): the canonical 2024 PyTorch-native pretraining reference. Pick for the in-tree PyTorch 4D-parallelism design.

61.5.3 Engineering blogs and technical writeups

The engineering blogs are where the practical "how to actually do this" knowledge lives, often closer to production reality than academic papers.

Hugging Face Blog (Hugging Face, ongoing): the most active applied-ML engineering blog, with extensive "scale" series (Llama-3 fine-tuning at scale, FineWeb buildout, training large MoE, the SmolLM2 documentation, the Ultra-Scale Playbook by Sanseviero et al.). Pick as the single highest-volume source of applied-ML engineering content; the "Ultra-Scale Playbook" series is mandatory reading for anyone training at scale.
PyTorch Engineering Blog (Meta, ongoing): the official PyTorch blog with deep-dive engineering posts on FSDP2, torch.compile, DTensor, async checkpointing, and the various torchtitan tutorials. Pick for in-tree PyTorch features and the canonical reference for PyTorch-native distributed training.
DeepSpeed Blog (Microsoft, ongoing): the DeepSpeed project blog with announcements and technical posts (ZeRO-Infinity, DeepSpeed-MoE, DeepSpeed-Ulysses for long-context). Pick for DeepSpeed-specific work.
Yi Tay's blog (Reka AI; formerly Google Brain): high-signal posts on training large models, lessons from PaLM / U-PaLM / Reka work, and the experience of running serious frontier training. Pick as one of the highest-density signals from a practitioner in the field.
Interconnects (Nathan Lambert, Allen Institute for AI): weekly newsletter and blog covering RLHF, post-training, and open-model developments with strong analytical depth. Pick as the single highest-quality regular newsletter for the alignment / post-training side of the field.
Databricks (Mosaic) Engineering Blog: extensive coverage of training large open models (DBRX, the Mosaic LLM training recipes), Composer framework, MLflow features. Pick for Mosaic / MosaicML / Databricks-flavored work.
NVIDIA Developer Blog (Generative AI section): NVIDIA-specific engineering content on Megatron, Transformer Engine, TensorRT-LLM, NCCL tuning. Pick when CUDA / NVIDIA-specific optimization matters.
Together AI Research Blog: the RedPajama series writeups, training-throughput optimizations, and inference-cost analyses.
Hazy Research (Stanford): Tri Dao's group's blog with the FlashAttention writeups and adjacent work on long-context attention, state-space models, etc.
LessWrong and Alignment Forum: where the alignment / safety side of frontier-scale work is discussed, including a substantial systems flavor (compute governance, training-run monitoring, etc.).
SemiAnalysis (Dylan Patel, paid): detailed (occasionally speculative) analysis of GPU supply chains, datacenter buildouts, frontier-lab cluster sizes, and the economics of training. Pick when you specifically need supply-chain / capacity-pricing analysis; the speculative parts should be treated as such.
Ahead of AI (Sebastian Raschka): high-quality long-form posts walking through training and fine-tuning techniques with code; especially valuable for practitioners coming up to speed.
Eugene Yan's blog: applied-ML systems writing with strong production focus, covering both LLM systems and broader ML engineering.

61.5.4 Communities and forums

Communities are where the tacit knowledge lives: which library breaks on which GPU driver, which dataset has known bugs, which inference engine actually handles MoE well at production scale. The 2026 high-signal communities:

r/LocalLLaMA: the largest open community for self-hosted LLM work, with active infrastructure threads on multi-GPU setups, quantization comparisons, inference engine benchmarking, and the practical experience of running open models. Pick for ground-truth on what is currently working for self-hosted deployment; the noise-to-signal ratio is high but the high-signal threads are very high-signal.
EleutherAI Discord: where much of the open-source training-research community gathers, with active channels for distributed training, eval harness, scaling laws, mech interp, and the various EleutherAI projects. Pick as a higher-signal community for serious training research; many of the lm-evaluation-harness maintainers and frontier-research practitioners are active there.
MLCommons working groups: the working groups that produce MLPerf results, with formal participation paths for systems teams. Pick when you specifically want to influence benchmark design.
Library issue trackers: the GitHub issue trackers for vLLM, SGLang, TensorRT-LLM, Megatron-LM, DeepSpeed, PyTorch, Hugging Face transformers / accelerate are themselves communities. Bug reports and feature discussions on these issues are often the canonical place to learn about known-issues and workarounds.
r/MachineLearning: the broader ML subreddit, with a more academic-paper-discussion focus than r/LocalLLaMA.
Hugging Face Hub discussions: model-specific community threads on the Hub, often where users surface bugs, share fine-tunes, and discuss specific checkpoints.
X / Twitter: the field's de facto discussion forum, with a strong AI / ML community. The signal-to-noise ratio depends entirely on who you follow; the recommended core list includes the authors of the major training-systems papers, the maintainers of the major libraries, and the engineering leads at frontier labs. Pick X for fast-breaking developments; treat all claims as needing verification.
Slack: various private Slacks (Databricks, NVIDIA enterprise customer Slack, vendor-customer Slacks) where vendor-specific technical conversations happen. Pick when you have vendor relationships that grant access.

61.5.5 Newsletters and podcasts

Curated weekly or biweekly reading reduces the firehose to a tractable signal.

Interconnects (Nathan Lambert, weekly): as above; the highest-signal regular newsletter for post-training and open-model news.
The Algorithmic Bridge (Alberto Romero): broader AI commentary with regular systems content.
Ahead of AI (Sebastian Raschka, monthly): as above; long-form pedagogical content.
Language Models & Co (Jay Alammar): visualizations and conceptual explanations of LLM internals.
Latent Space podcast and newsletter (Swyx and Alessio): weekly podcast with practitioners from the AI engineering side, often surfacing tactical engineering knowledge that does not appear in papers.
The Gradient (paywall + free articles): long-form essays on AI / ML with stronger systems content than most academic-leaning outlets.
Import AI (Jack Clark, weekly): the longest-running serious AI policy and capability newsletter, with strong coverage of frontier-lab developments.
Dwarkesh Podcast: long-form interviews with researchers and engineers from frontier labs, often including significant discussion of training-systems and infrastructure topics.

61.5.6 Books on LLM systems and scale

The book-length references for serious systems work; relatively few specifically target LLM training scale, but several adjacent texts are essential.

"Deep Learning" (Goodfellow, Bengio, Courville, 2016, MIT Press): the canonical textbook. Foundational but pre-Transformer-era; pick for the mathematical foundations.
"Designing Machine Learning Systems" (Chip Huyen, 2022, O'Reilly): production-ML systems reference, with strong coverage of the ML lifecycle, training pipelines, and infrastructure choices.
Online textbooks and lecture notes: various university courses (Stanford CS336, MIT 6.S965, CMU 11-868, etc.) post lecture notes that function as up-to-date alternative to traditional textbooks. Pick when you want structured coverage and the formal-textbook publishing lag is a problem.
The "AI Engineer" books (various, 2024-2026 publishing wave): a number of books targeting the practitioner audience are now in print, including titles on inference engines, fine-tuning, and production deployment.

61.5.7 A weekly reading cadence

A practitioner running production LLM systems at scale typically allocates 2 to 5 hours per week to reading. A reasonable allocation:

One paper per week from the canonical list (rotating through the 20-paper canon plus new releases). Read in depth: notes, reproduce key claims if possible.
One engineering blog per day from the high-signal feed (Hugging Face, PyTorch, NVIDIA, Together, Yi Tay, Interconnects). Skim; deep-read selectively.
Daily check of r/LocalLLaMA and EleutherAI Discord for tactical "what is breaking now" knowledge. 15 minutes is enough.
Weekly newsletter scan (Interconnects, Import AI, Ahead of AI): the curators have already filtered for you.
Quarterly: a survey paper or industry analysis (SemiAnalysis quarterly summaries, State of AI Report, etc.) for the broader landscape view.

Key Insight

The field changes fast enough that any reading list is stale; the meta-skill is curation

Any specific reading list in this section will be partly stale within six months. The 2024 list would have included papers and blogs that are now superseded; the mid-2025 list would have omitted DeepSeek-V3 (December 2024) and DeepSeek-R1 (January 2025) which became canonical reading within a quarter. The meta-skill is not "read this list" but "build a curation pipeline." Subscribe to the high-signal sources, follow the right people on X / GitHub / Discord, and refresh the list quarterly. The five-year-out reading list will look nothing like this one; the curation muscle is what carries over.

61.5.8 Engaging with the community

Beyond consumption, contribution to the community is itself a way to stay current. The most accessible paths:

Open-source contributions: file useful issue reports, submit PRs to vLLM / SGLang / Hugging Face transformers / Axolotl / etc. Even small documentation improvements have outsize impact.
Public writeups: a blog post about your training-scale lessons reaches more people than an internal document and improves your understanding. The community has rewarded specific, technical writeups (especially with failure-mode documentation) for years.
Conference submission: a workshop paper at MLSys, NeurIPS, or ICLR with even one novel systems insight is a strong way to engage with the academic-leaning community.
Public benchmark contributions: MLPerf submissions, lm-evaluation-harness benchmark additions, Open LLM Leaderboard submissions all engage with the formal benchmark-community process.
Library maintenance: becoming a co-maintainer of an open-source library is the deepest community engagement; even infrequent code review and triage is valuable.

Library Shortcut

huggingface_hub for shipping a checkpoint to the world

Once your fine-tune or distillation produces a checkpoint worth sharing, the huggingface_hub client (v0.26+, 2024 to 2026) is the canonical upload path. HfApi().upload_folder(...) streams arbitrarily large weight files, generates a default README.md model card, computes content hashes for git-LFS deduplication, and gives you back the URL the community will cite. The same client downloads with snapshot_download(...) for license-checking and reproducibility audits before deployment.

Show code

pip install -U huggingface_hub
from huggingface_hub import HfApi, ModelCard, ModelCardData, login

login(token=os.environ["HF_TOKEN"])

# 1. Create the repo (private until you flip the switch).
api = HfApi()
api.create_repo("your-org/llama3-finetune-r1", private=True, exist_ok=True)

# 2. Write a model card with license, base model, and eval results.
card = ModelCard.from_template(
    card_data=ModelCardData(
        language="en", license="llama3", base_model="meta-llama/Llama-3-8B",
        tags=["fine-tuned", "instruction-tuned"],
    ),
    model_description="Llama-3-8B fine-tuned on OpenHermes-2.5.",
)
card.save("./out/README.md")

# 3. Push all checkpoint shards in one streaming upload.
api.upload_folder(
    folder_path="./out",
    repo_id="your-org/llama3-finetune-r1",
    repo_type="model",
    commit_message="v1.0: SFT on OpenHermes-2.5",
)

Code Fragment 61.5.8.1: One upload_folder call ships a multi-shard checkpoint, model card, and license metadata.

Real-World Scenario

How an open-weight team kept current in 2024-2026

A mid-sized open-weight research team in 2024-2026 reported the following reading and engagement pattern as their actual operational practice: every Monday, the team's tech lead reviewed the previous week's r/LocalLLaMA top posts (15 minutes), the Interconnects newsletter (15 minutes), and one chosen paper from arxiv-sanity (1 hour); every Thursday, the team's training-systems engineer reviewed the PyTorch and Hugging Face blogs and any new releases on the major training-framework GitHubs (30 minutes); the team filed at least one upstream issue or PR to a relevant open-source project every month; and the team gave one external talk per quarter (workshop, meetup, or conference). The senior team members credited this pattern with consistently staying ahead of competitor open-weight teams that did not maintain a similar discipline. The cost was approximately 3 hours per person per week, on the low end for serious technical fields. The lesson generalizes: a small, sustained reading-and-contribution practice substantially compounds, while sporadic catch-up reading does not.

61.5.9 Mapping the reading and community landscape

LLM scale reading and community map — **Figure 61.5.1**: The 2026 reading and community landscape for LLM systems at scale: papers, model hubs, leaderboards, blogs, conferences, and chat communities that keep practitioners current.

61.5.10 Vendor technical reports and frontier disclosures

Beyond peer-reviewed papers, the 2024-2026 frontier-lab technical reports are essential reading. These are released alongside model launches and provide the closest-to-ground-truth information about how frontier-scale work is actually done.

Anthropic model cards and addenda: each Claude release ships with a model card and capability addendum that documents (selectively) the capability and safety evaluations. Pick when you need the Anthropic-specific assertions about a model's training and evaluation.
OpenAI system cards and research releases: GPT-4, GPT-4o, o1, o3, and GPT-5 each have published system cards. The level of training-detail disclosure has decreased over time but the safety and red-teaming sections remain valuable.
Google DeepMind Gemini technical reports: the Gemini 1.0, 1.5, and 2.0 / 2.5 technical reports document the multi-pod TPU training stack, the long-context engineering, and the multimodal-native approach.
xAI Grok technical writeups: the Grok-1, Grok-1.5, Grok-2, Grok-3, and Grok-4 announcement posts and capability disclosures.
Meta Llama technical reports: Llama 1, 2, 3, 3.1, 3.2, and 4 each have technical reports of varying depth. The Llama-3 herd paper is the most detailed.
DeepSeek technical reports: DeepSeek-V2, V3, R1, and the various coder / math variants each have technical reports, generally with more architectural detail than the Western-lab equivalents.
Mistral, Qwen, Yi, and other open-weight technical reports: most open-weight releases ship a technical report or model card with substantial architectural and training detail.

These reports are typically the canonical reference for the model's architecture, training data scale (sometimes), training recipe (often), and evaluation results. Read the ones for the models you actually use; treat the safety / capability claims as marketing-flavored.

61.5.11 Survey papers and state-of-the-field reports

For a broad understanding of the field's trajectory, periodic survey papers and industry reports help:

State of AI Report (Nathan Benaich and Ian Hogarth, annual): the canonical annual state-of-the-field report covering research, industry, politics, and predictions. Pick as the yearly catch-up if you only have time for one comprehensive read.
Stanford AI Index Report (Stanford HAI, annual): the academic counterpart with extensive quantitative data on compute, talent, publications, and applications.
Periodic survey papers on arXiv: search for "survey" plus your topic of interest; the field publishes high-quality survey papers regularly on topics like alignment, long-context, MoE, RLHF, evaluation. Pick when entering a new sub-area.
Awesome-LLM and similar GitHub lists: community-curated reading lists. Pick for breadth; expect uneven curation quality.

61.5.12 Courses and structured learning

For practitioners new to the field or refreshing fundamentals, several courses have become canonical references:

Stanford CS336 (Language Modeling from Scratch): Stanford's course covering the systems side of building LLMs, with strong coverage of distributed training, tokenization, and evaluation. Lecture videos and notes are publicly available.
Stanford CS224N (NLP with Deep Learning): the long-running Stanford NLP course, regularly updated for LLMs.
Hugging Face Courses: free, regularly updated courses on the Hugging Face ecosystem (transformers, datasets, accelerate, RLHF, evaluation).
Full Stack Deep Learning: the longest-running practitioner-oriented course covering ML systems engineering and production deployment.
Karpathy's nanoGPT and follow-on materials: the canonical "build a GPT from scratch" reference, with accompanying YouTube videos that have become essential pedagogy.

Looking Back

the platform / library / dataset / model decomposition of Chapter 61

Chapter 61 organized the scale tools-of-the-trade catalog along four axes that re-appear in every chapter of Parts VII-XII: platforms (61.1: hyperscalers vs specialized GPU clouds vs in-house datacenters, plus HPC schedulers, parallel storage, and training observability); libraries and frameworks (61.2: Megatron / DeepSpeed / FSDP2 / Colossal-AI as the foundation distributed-training layer, plus the high-level recipes, optimization kernels, communication libraries, orchestrators, and compilers); datasets and benchmarks (61.3: open pretraining corpora like FineWeb / RedPajama-v2, alignment / instruction datasets, multimodal corpora, MLPerf Training, lm-eval-harness, the canary-and-decontamination methodology); and models (61.4: the open-weight frontier from Llama through DeepSeek, the closed-weight frontier from Claude / GPT / Gemini, and the scaling-law-derived choice of which size to actually train). The fifth section (61.5) was the reading list and community map that lets you keep all four axes current as the field moves. The pattern generalizes: every "tools of the trade" chapter in this book is organized as platforms / libraries / datasets / models / external reading. Master this decomposition once and you have a template for navigating any of them.

What's Next: Part XIII closes the loop with LLMOps

Continue to Section 62.1: Scaling, Performance & Production Guardrails. Part XII (Chapters 56-61) covered LLM systems at scale: the platforms, hardware, training systems, edge deployment, and tools-of-the-trade catalog. Part XIII (LLMOps, Chapters 62-66) now turns from "build the system" to "operate the system in production for years." The transition is direct: Chapter 62 picks up exactly where Chapter 61 leaves off, covering production engineering core (deployment, scaling, performance guardrails, the SRE practices that turn a 50-day pretraining run into a 5-year production assistant). Chapters 63-66 then cover MLOps lifecycle (CI/CD for LLMs, model registries, drift detection), observability and monitoring at production cadence (the cluster-side observability of 59.5 generalized to per-request telemetry), incident response and continuous improvement, and the LLMOps tools-of-the-trade catalog. The same platform / library / dataset / model decomposition recurs in Chapter 66. The conceptual thread is that training is a finite project but operation is forever; the scale chapter taught you to spend $3M efficiently for two months, the LLMOps chapters teach you to spend $50K/month forever, well.

61.5.13 Research labs and groups to follow

Beyond following individual papers, following specific labs gives you advance notice of upcoming work:

Frontier labs: Anthropic, OpenAI, Google DeepMind, Meta AI, xAI: the closed frontier. Follow their blog feeds and key researchers on X.
Major open-weight labs: DeepSeek, Mistral, Alibaba (Qwen), 01.AI (Yi), TII (Falcon), Snowflake, Databricks (Mosaic), Hugging Face research: the open frontier. Follow their model card releases and any associated papers.
Academic systems labs: Stanford (Hazy Research, CRFM), CMU Catalyst, MIT-IBM Watson, Berkeley Sky Computing, UW (Sutton group historically; now broader): where systems-side innovation often originates.
Independent research collectives: EleutherAI, BigScience, LAION, Nous Research, Allen AI (AI2): open-research collectives that have produced many of the foundational open datasets and small models.
Vendor research: NVIDIA Research, Microsoft Research (DeepSpeed team), AWS AI, Together Research: vendor research groups whose work often becomes foundational systems software.

Further Reading

Vaswani, A. et al. (2017). "Attention Is All You Need." NeurIPS 2017. arxiv.org/abs/1706.03762. The original Transformer paper; the foundational reading that all of LLM systems work descends from.

Hoffmann, J. et al. (2022). "Training Compute-Optimal Large Language Models." arXiv preprint arXiv:2203.15556. arxiv.org/abs/2203.15556. The Chinchilla paper; the canonical reference for compute-optimal scaling that defines the "20 tokens per parameter" rule.

BigScience Workshop (2022). "BLOOM: A 176B-Parameter Open-Access Multilingual Language Model." arXiv preprint arXiv:2211.05100. arxiv.org/abs/2211.05100. The BLOOM technical report; the canonical open multilingual pretraining reference and a foundational logbook of what frontier training operations look like.

Zhang, S. et al. (2022). "OPT-175B Logbook." Meta AI Research. github.com/facebookresearch/metaseq/projects/OPT/chronicles. The OPT-175B logbook; mandatory reading on the operational reality of frontier-scale pretraining failures and recovery.

Huyen, C. (2022). Designing Machine Learning Systems. O'Reilly Media. oreilly.com/library/view/designing-machine-learning. The production-ML-systems reference book; broader than just LLM scale but the canonical book-length reference on ML systems engineering.

Lambert, N. (2024). "Interconnects: AI, policy, and post-training." Substack newsletter. interconnects.ai. The highest-signal regular newsletter for post-training, RLHF, and open-model developments in 2024-2026.

Frontier-Lab Model Disclosures

Dubey, A., et al. (2024). "The Llama 3 Herd of Models." Meta AI. arXiv:2407.21783

DeepSeek-AI (2024). "DeepSeek-V3 Technical Report." DeepSeek. arXiv:2412.19437

Gemini Team, Google (2024). "Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context." Google DeepMind. arXiv:2403.05530

Anthropic (2024). "Claude 3.5 Sonnet Model Card Addendum." Anthropic. anthropic.com Model Card

Qwen Team (2024). "Qwen2.5 Technical Report." Alibaba Cloud. arXiv:2412.15115

Pretraining Frameworks

Shoeybi, M., et al. (2019). "Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism." NVIDIA. arXiv:1909.08053

Rajbhandari, S., et al. (2020). "ZeRO: Memory Optimizations Toward Training Trillion Parameter Models." SC20. arXiv:1910.02054

Zhao, Y., et al. (2023). "PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel." VLDB 2023. arXiv:2304.11277

Liang, W., et al. (2024). "torchtitan: One-stop PyTorch native solution for production ready LLM pretraining." Meta AI. arXiv:2410.06511

Hugging Face (2024). "nanotron: Minimalistic large language model 3D-parallelism training." GitHub. github.com/huggingface/nanotron

Compute Economics and Scaling Laws

Kaplan, J., et al. (2020). "Scaling Laws for Neural Language Models." OpenAI. arXiv:2001.08361

Hoffmann, J., et al. (2022). "Training Compute-Optimal Large Language Models (Chinchilla)." DeepMind. arXiv:2203.15556

Sevilla, J., et al. (2022). "Compute Trends Across Three Eras of Machine Learning." IJCNN 2022. arXiv:2202.05924

Community Resources and Newsletters

Epoch AI (2024). "Tracking Large-Scale AI Models: Methodology and Database." Epoch AI. epoch.ai/data/large-scale-ai-models

Stanford CRFM (2024). "HELM: Holistic Evaluation of Language Models." Stanford Center for Research on Foundation Models. crfm.stanford.edu/helm