Certain code patterns appear so frequently in LLM work that they are worth memorizing. This section collects the most important ones. Code Fragment C.4.1 below puts this into practice.
Pattern 1: Loading a Model with Automatic Device Mapping
This snippet loads a causal language model with automatic device placement and 4-bit quantization using Hugging Face Transformers.
# PyTorch implementation
# Key operations: training loop
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "mistralai/Mistral-7B-Instruct-v0.3"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype="auto", # automatically choose float16/bfloat16
device_map="auto", # spread across available GPUs
load_in_4bit=True, # 4-bit quantization to save memory
)
device_map="auto" to spread layers across available GPUs and load_in_4bit=True to compress weights from 16-bit to 4-bit, reducing memory from ~14 GB to ~4 GB for a 7B model.Pattern 2: Chat-Style Inference with Templates
This snippet applies a chat template to a multi-turn conversation and generates a response from the model.
# Most modern models use chat templates
messages = [
{"role": "system", "content": "You are a helpful coding assistant."},
{"role": "user", "content": "Write a Python function to compute Fibonacci numbers."},
]
# Apply the model's chat template
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
# Generate
outputs = model.generate(**inputs, max_new_tokens=256, temperature=0.7, do_sample=True)
response = tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
print(response)
apply_chat_template() to format a multi-turn conversation into the model's expected token format. The slice outputs[0][inputs["input_ids"].shape[1]:] strips the prompt tokens from the generated output.Pattern 3: Batch Processing with a DataLoader
This snippet tokenizes a dataset in batches and wraps it in a PyTorch DataLoader for efficient iteration.
# implement tokenize_fn
# Key operations: attention mechanism
from torch.utils.data import DataLoader
from datasets import load_dataset
# Load a dataset
dataset = load_dataset("imdb", split="test")
# Tokenize in batches
def tokenize_fn(examples):
return tokenizer(examples["text"], truncation=True, padding="max_length", max_length=512)
tokenized = dataset.map(tokenize_fn, batched=True, batch_size=64)
tokenized.set_format("torch", columns=["input_ids", "attention_mask", "label"])
# Create a DataLoader for efficient batched iteration
loader = DataLoader(tokenized, batch_size=16, shuffle=False)
Pattern 4: API Calls with Retry Logic
This snippet wraps an OpenAI API call with exponential backoff to handle rate limits gracefully.
# implement call_with_retry
# Key operations: results display, rate limiting, prompt construction
import time
from openai import OpenAI
client = OpenAI()
def call_with_retry(prompt, max_retries=3, base_delay=1.0):
"""Call the API with exponential backoff on rate limits."""
for attempt in range(max_retries):
try:
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}],
temperature=0.7,
)
return response.choices[0].message.content
except Exception as e:
if "rate_limit" in str(e).lower() and attempt < max_retries - 1:
delay = base_delay * (2 ** attempt)
print(f"Rate limited. Waiting {delay}s...")
time.sleep(delay)
else:
raise
# Save a fine-tuned model
model.save_pretrained("./my-finetuned-model")
tokenizer.save_pretrained("./my-finetuned-model")
# Load it back later
model = AutoModelForCausalLM.from_pretrained("./my-finetuned-model")
tokenizer = AutoTokenizer.from_pretrained("./my-finetuned-model")
# Push to Hugging Face Hub
model.push_to_hub("your-username/my-finetuned-model")
tokenizer.push_to_hub("your-username/my-finetuned-model")
Pattern 5: Saving and Loading Checkpoints
This snippet saves a fine-tuned model and tokenizer to disk, reloads them, and pushes them to the Hugging Face Hub.
Thanks to the pipeline API, you can run a language model in two lines of Python: one to create the pipeline, one to call it. The entire transformer architecture, tokenization, and decoding are handled behind the scenes. This is both a blessing (rapid prototyping) and a danger (it is easy to treat the model as a black box without understanding its behavior).
Python Toolkit Summary
transformersis your primary interface to pretrained models. Learn itsAutoclasses andpipelineAPI first.- Use Conda for GPU environments (it manages CUDA) or uv for speed. Always use a virtual environment.
- Jupyter notebooks for exploration; Python scripts for production. Always validate notebooks by running all cells from scratch.
- Master the five patterns: model loading, chat templates, batch processing, API retry, and checkpoint management.
- The Hugging Face ecosystem (
datasets,peft,trl,accelerate) covers the full LLM lifecycle from data to deployment.
What Comes Next
Continue to Appendix D: Development Environment Setup for the next reference appendix in this collection.