Prerequisites
Python with pandas/numpy. SQL and data pipeline experience. Chapters 0 through 5 as foundations review. Basic cloud infrastructure concepts.
Data Engineering Track
Building and curating datasets for LLM training, fine-tuning, and evaluation.
Learning Sequence
Follow the numbered steps in order. Each step builds on the previous one to give you a coherent understanding of this topic area.
- Section 06.4: Data Curation at Scale (how pre-training corpora like FineWeb and Dolma are assembled)
- Chapter 13: Synthetic Data Generation (full chapter on Evol-Instruct, self-play, quality filtering)
- Section 19.4: Document Processing and Chunking (turning raw documents into structured inputs)
- Section 14.6: Fine-Tuning for Classification (data quality requirements for supervised fine-tuning)
- Chapter 29: Evaluation, Experiment Design and Observability (benchmarking datasets and measuring model performance)
- Chapter 34: Emerging Architectures and Scaling Frontiers (scaling laws, state-space models, and data requirements for new architectures)
- Chapter 35: AI, Society and Open Problems (open-weight debate, data governance, and societal implications)
Recommended Appendices
- Appendix K: HuggingFace: Transformers, Datasets, and Hub – access models, datasets, and pipelines on HuggingFace
- Appendix O: LlamaIndex – build retrieval pipelines with LlamaIndex
- Appendix D: Environment Setup – set up your data engineering environment
What Comes Next
Return to the Course Syllabi overview to explore other tracks and courses, or proceed to FM.4: How to Use This Book for a quick orientation on conventions and callout types.