Front Matter
FM.3: Course Syllabi

Data Engineering Track

Prerequisites

Python with pandas/numpy. SQL and data pipeline experience. Chapters 0 through 5 as foundations review. Basic cloud infrastructure concepts.

Data Engineering Track

Building and curating datasets for LLM training, fine-tuning, and evaluation.

Learning Sequence

Follow the numbered steps in order. Each step builds on the previous one to give you a coherent understanding of this topic area.

  1. Section 06.4: Data Curation at Scale (how pre-training corpora like FineWeb and Dolma are assembled)
  2. Chapter 13: Synthetic Data Generation (full chapter on Evol-Instruct, self-play, quality filtering)
  3. Section 19.4: Document Processing and Chunking (turning raw documents into structured inputs)
  4. Section 14.6: Fine-Tuning for Classification (data quality requirements for supervised fine-tuning)
  5. Chapter 29: Evaluation, Experiment Design and Observability (benchmarking datasets and measuring model performance)
  6. Chapter 34: Emerging Architectures and Scaling Frontiers (scaling laws, state-space models, and data requirements for new architectures)
  7. Chapter 35: AI, Society and Open Problems (open-weight debate, data governance, and societal implications)
Recommended Appendices

What Comes Next

Return to the Course Syllabi overview to explore other tracks and courses, or proceed to FM.4: How to Use This Book for a quick orientation on conventions and callout types.