FM.3.T1: Data Engineering Track

Python with pandas/numpy. SQL and data pipeline experience. Chapters 0 through 5 as foundations review. Basic cloud infrastructure concepts.

Data Engineering Track

Building and curating datasets for LLM training, fine-tuning, and evaluation.

Learning Sequence

Follow the numbered steps in order. Each step builds on the previous one to give you a coherent understanding of this topic area.

Section 06.4: Data Curation at Scale (how pre-training corpora like FineWeb and Dolma are assembled)
Chapter 13: Synthetic Data Generation (full chapter on Evol-Instruct, self-play, quality filtering)
Section 19.4: Document Processing and Chunking (turning raw documents into structured inputs)
Section 14.6: Fine-Tuning for Classification (data quality requirements for supervised fine-tuning)
Chapter 29: Evaluation, Experiment Design and Observability (benchmarking datasets and measuring model performance)
Chapter 34: Emerging Architectures and Scaling Frontiers (scaling laws, state-space models, and data requirements for new architectures)
Chapter 35: AI, Society and Open Problems (open-weight debate, data governance, and societal implications)

Recommended Appendices

Appendix K: HuggingFace: Transformers, Datasets, and Hub – access models, datasets, and pipelines on HuggingFace
Appendix O: LlamaIndex – build retrieval pipelines with LlamaIndex
Appendix D: Environment Setup – set up your data engineering environment

Return to the Course Syllabi overview to explore other tracks and courses, or proceed to FM.4: How to Use This Book for a quick orientation on conventions and callout types.