FM.2.7: Pathway 7: "<span class="math">$I'$</span>m a Data Scientist Adding LLMs to My Toolkit" (Data Scientist / Analyst)

Pathway 7: "I'm a Data Scientist Adding LLMs to My Toolkit" (Data Scientist / Analyst)

Time estimate: 5 to 7 weeks Difficulty: Intermediate

Target audience: Data scientists and analysts who already know Python, pandas, scikit-learn, and basic deep learning

Goal: Learn where LLMs complement (not replace) your existing ML models, how to build hybrid pipelines, and how to use LLMs for feature engineering, classification, extraction, and analytics.

Chapter Guide

Skim Ch 00: ML and PyTorch Foundations (refresh if needed) refresh PyTorch if needed
Skim Ch 01: NLP and Text Representation (refresh if needed) refresh NLP basics if needed
Skim Ch 02: Tokenization (refresh if needed) refresh tokenization if needed
Focus Ch 03: Sequence Models and Attention bridge from classical sequences to attention
Skim Ch 04: The Transformer Architecture (unless curious about internals) optional deep dive into transformer internals
Focus Ch 05: Decoding and Text Generation understand how generation differs from prediction
Skim Ch 06: Pre-training and Scaling Laws (especially Section 06.4 on data curation) especially data curation for LLM training
Skim Ch 07: The Modern LLM Landscape choose the right model for each analytical task
Focus Ch 10: Working with LLM APIs call LLMs from your data pipelines
Focus Ch 11: Prompt Engineering prompt LLMs for extraction and classification
Focus Ch 12: Hybrid ML+LLM Architectures combine scikit-learn and LLMs in one pipeline
Skim Ch 13: Synthetic Data Generation generate labeled data when labels are scarce
Skim Ch 14: Fine-Tuning Fundamentals (especially Section 14.6 on classification) fine-tune for classification and extraction tasks
Focus Ch 19: Embeddings and Vector Databases embed your datasets for similarity search
Focus Ch 20: RAG build retrieval over your knowledge bases
Focus Ch 28: LLM Applications applied patterns for analytics and reporting
Focus Ch 29: Evaluation and Experiment Design evaluate LLM outputs with your existing metrics
Skim Ch 34: Emerging Architectures scaling laws and new model designs to watch
Optional Ch 35: AI and Society responsible AI context for data practitioners

Recommended Appendices

Appendix K: HuggingFace: Transformers, Datasets, and Hub – access pretrained models and datasets on HuggingFace
Appendix R: Experiment Tracking – track experiments and compare model runs
Appendix J: Datasets and Benchmarks – explore benchmark datasets for evaluation

What Comes Next

Return to the Reading Pathways overview to explore other pathways, or proceed to FM.4: How to Use This Book for a quick orientation on conventions and callout types, then start reading.