Pathway 7: "I'm a Data Scientist Adding LLMs to My Toolkit" (Data Scientist / Analyst)
Target audience: Data scientists and analysts who already know Python, pandas, scikit-learn, and basic deep learning
Goal: Learn where LLMs complement (not replace) your existing ML models, how to build hybrid pipelines, and how to use LLMs for feature engineering, classification, extraction, and analytics.
Chapter Guide
- Skim Ch 00: ML and PyTorch Foundations (refresh if needed) refresh PyTorch if needed
- Skim Ch 01: NLP and Text Representation (refresh if needed) refresh NLP basics if needed
- Skim Ch 02: Tokenization (refresh if needed) refresh tokenization if needed
- Focus Ch 03: Sequence Models and Attention bridge from classical sequences to attention
- Skim Ch 04: The Transformer Architecture (unless curious about internals) optional deep dive into transformer internals
- Focus Ch 05: Decoding and Text Generation understand how generation differs from prediction
- Skim Ch 06: Pre-training and Scaling Laws (especially Section 06.4 on data curation) especially data curation for LLM training
- Skim Ch 07: The Modern LLM Landscape choose the right model for each analytical task
- Focus Ch 10: Working with LLM APIs call LLMs from your data pipelines
- Focus Ch 11: Prompt Engineering prompt LLMs for extraction and classification
- Focus Ch 12: Hybrid ML+LLM Architectures combine scikit-learn and LLMs in one pipeline
- Skim Ch 13: Synthetic Data Generation generate labeled data when labels are scarce
- Skim Ch 14: Fine-Tuning Fundamentals (especially Section 14.6 on classification) fine-tune for classification and extraction tasks
- Focus Ch 19: Embeddings and Vector Databases embed your datasets for similarity search
- Focus Ch 20: RAG build retrieval over your knowledge bases
- Focus Ch 28: LLM Applications applied patterns for analytics and reporting
- Focus Ch 29: Evaluation and Experiment Design evaluate LLM outputs with your existing metrics
- Skim Ch 34: Emerging Architectures scaling laws and new model designs to watch
- Optional Ch 35: AI and Society responsible AI context for data practitioners
Recommended Appendices
- Appendix K: HuggingFace: Transformers, Datasets, and Hub – access pretrained models and datasets on HuggingFace
- Appendix R: Experiment Tracking – track experiments and compare model runs
- Appendix J: Datasets and Benchmarks – explore benchmark datasets for evaluation
What Comes Next
Return to the Reading Pathways overview to explore other pathways, or proceed to FM.4: How to Use This Book for a quick orientation on conventions and callout types, then start reading.