Section C.4: Common Patterns for LLM Scripting

Certain code patterns appear so frequently in LLM work that they are worth memorizing. This section collects the most important ones. Code Fragment C.4.1 below puts this into practice.

Pattern 1: Loading a Model with Automatic Device Mapping

This snippet loads a causal language model with automatic device placement and 4-bit quantization using Hugging Face Transformers.


# PyTorch implementation
# Key operations: training loop
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "mistralai/Mistral-7B-Instruct-v0.3"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",         # automatically choose float16/bfloat16
    device_map="auto",           # spread across available GPUs
    load_in_4bit=True,           # 4-bit quantization to save memory
)

Code Fragment 1: Loading a model with device_map="auto" to spread layers across available GPUs and load_in_4bit=True to compress weights from 16-bit to 4-bit, reducing memory from ~14 GB to ~4 GB for a 7B model.

Pattern 2: Chat-Style Inference with Templates

This snippet applies a chat template to a multi-turn conversation and generates a response from the model.

# Most modern models use chat templates
messages = [
    {"role": "system", "content": "You are a helpful coding assistant."},
    {"role": "user", "content": "Write a Python function to compute Fibonacci numbers."},
]

# Apply the model's chat template
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

# Generate
outputs = model.generate(**inputs, max_new_tokens=256, temperature=0.7, do_sample=True)
response = tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
print(response)

def fibonacci(n): if n <= 1: return n a, b = 0, 1 for _ in range(2, n + 1): a, b = b, a + b return b

Code Fragment 2: Chat-style inference using apply_chat_template() to format a multi-turn conversation into the model's expected token format. The slice outputs[0][inputs["input_ids"].shape[1]:] strips the prompt tokens from the generated output.

Pattern 3: Batch Processing with a DataLoader

This snippet tokenizes a dataset in batches and wraps it in a PyTorch DataLoader for efficient iteration.


# implement tokenize_fn
# Key operations: attention mechanism
from torch.utils.data import DataLoader
from datasets import load_dataset

# Load a dataset
dataset = load_dataset("imdb", split="test")

# Tokenize in batches
def tokenize_fn(examples):
    return tokenizer(examples["text"], truncation=True, padding="max_length", max_length=512)

tokenized = dataset.map(tokenize_fn, batched=True, batch_size=64)
tokenized.set_format("torch", columns=["input_ids", "attention_mask", "label"])

# Create a DataLoader for efficient batched iteration
loader = DataLoader(tokenized, batch_size=16, shuffle=False)

Pattern 4: API Calls with Retry Logic

This snippet wraps an OpenAI API call with exponential backoff to handle rate limits gracefully.


# implement call_with_retry
# Key operations: results display, rate limiting, prompt construction
import time
from openai import OpenAI

client = OpenAI()

def call_with_retry(prompt, max_retries=3, base_delay=1.0):
    """Call the API with exponential backoff on rate limits."""
    for attempt in range(max_retries):
        try:
            response = client.chat.completions.create(
                model="gpt-4o",
                messages=[{"role": "user", "content": prompt}],
                temperature=0.7,
            )
            return response.choices[0].message.content
        except Exception as e:
            if "rate_limit" in str(e).lower() and attempt < max_retries - 1:
                delay = base_delay * (2 ** attempt)
                print(f"Rate limited. Waiting {delay}s...")
                time.sleep(delay)
            else:
                raise

# Save a fine-tuned model
model.save_pretrained("./my-finetuned-model")
tokenizer.save_pretrained("./my-finetuned-model")

# Load it back later
model = AutoModelForCausalLM.from_pretrained("./my-finetuned-model")
tokenizer = AutoTokenizer.from_pretrained("./my-finetuned-model")

# Push to Hugging Face Hub
model.push_to_hub("your-username/my-finetuned-model")
tokenizer.push_to_hub("your-username/my-finetuned-model")

Code Fragment C.4.4: An API call wrapper with exponential backoff: on a rate-limit error, the function waits 1s, 2s, 4s before retrying, preventing cascading failures in production pipelines.

Pattern 5: Saving and Loading Checkpoints

This snippet saves a fine-tuned model and tokenizer to disk, reloads them, and pushes them to the Hugging Face Hub.

Code Fragment C.4.3: Saving a fine-tuned model and tokenizer to disk, reloading them, and pushing to the Hugging Face Hub for sharing or deployment.

Fun Fact: The Two-Line LLM

Thanks to the pipeline API, you can run a language model in two lines of Python: one to create the pipeline, one to call it. The entire transformer architecture, tokenization, and decoding are handled behind the scenes. This is both a blessing (rapid prototyping) and a danger (it is easy to treat the model as a black box without understanding its behavior).

Python Toolkit Summary

transformers is your primary interface to pretrained models. Learn its Auto classes and pipeline API first.
Use Conda for GPU environments (it manages CUDA) or uv for speed. Always use a virtual environment.
Jupyter notebooks for exploration; Python scripts for production. Always validate notebooks by running all cells from scratch.
Master the five patterns: model loading, chat templates, batch processing, API retry, and checkpoint management.
The Hugging Face ecosystem (datasets, peft, trl, accelerate) covers the full LLM lifecycle from data to deployment.

What Comes Next

Continue to Appendix D: Development Environment Setup for the next reference appendix in this collection.