Section 15.3: Training Platforms & Tools

"The best training framework is the one that lets you stop configuring YAML and start looking at loss curves."
LoRA, YAML-Weary AI Agent

Big Picture

The fine-tuning tool landscape is evolving rapidly. While you can always write a training loop from scratch using PyTorch and the PEFT library, specialized platforms can dramatically reduce setup time, optimize GPU utilization, and provide production-tested configurations out of the box. This section surveys the most important tools in the ecosystem: Unsloth for raw speed, Axolotl for configuration-driven workflows, LLaMA-Factory for a visual interface, torchtune for PyTorch-native composability, and TRL for alignment training. We also cover the cloud compute landscape to help you choose the right GPU infrastructure for your budget. The SFT workflow from Section 14.3 and the LoRA/QLoRA techniques from Section 15.1 are the foundations these tools build upon.

Prerequisites

This section builds on the LoRA and QLoRA techniques from Section 15.1 and the advanced PEFT methods covered in Section 15.2. You should be comfortable with the supervised fine-tuning workflow from Section 14.3, as these tools wrap that workflow with convenience layers and optimizations. Familiarity with Section 09.2 will help you understand how QLoRA integrates with these platforms.

1. Unsloth: 2x Faster Fine-Tuning

Unsloth is an open-source library that achieves roughly 2x training speedup and 50% memory reduction compared to standard Hugging Face training, with zero accuracy loss. It accomplishes this through hand-written Triton kernels for attention, RoPE, cross-entropy loss, and other operations, bypassing the overhead of PyTorch's autograd in performance-critical paths.

Unsloth integrates seamlessly with the Hugging Face ecosystem: you load models through Unsloth's optimized loader, and then use standard SFTTrainer or DPOTrainer for the actual training. The output is a standard PEFT adapter that can be loaded by any tool. Figure 15.3.1 compares the performance gains.

Fun Note

Unsloth's name is a playful jab at the perceived slowness of standard training frameworks. The irony is that "slow" training with Hugging Face Transformers was already considered fast by pre-2020 standards. In the LLM era, expectations have shifted so dramatically that a training run finishing in 4 hours instead of 2 hours now feels unacceptable. Unsloth exists because engineers would rather optimize Triton kernels than wait an extra two hours.

Mental Model: The Auto-Tuning Workshop

Think of training platforms as auto-tuning workshops for cars. Unsloth is the speed shop that optimizes your engine (training kernels) to run twice as fast on the same hardware. Axolotl is the all-in-one garage with pre-built configurations for common jobs. Hugging Face TRL is the standard toolkit that every mechanic knows, with parts (trainers, callbacks) that work across any model. The choice depends on whether you need speed, convenience, or flexibility.

Figure 15.3.1: Unsloth reduces memory by ~50% and doubles training speed through optimized Triton kernels.

The following implementation (Code Fragment 15.3.2) shows this approach in practice.

# Load a model with Unsloth for 2x faster LoRA fine-tuning
# Unsloth fuses kernels and optimizes memory layout automatically
from unsloth import FastLanguageModel
from trl import SFTTrainer
from transformers import TrainingArguments
from datasets import load_dataset

# 1. Load model with Unsloth (handles quantization + LoRA setup)
model, tokenizer = FastLanguageModel.from_pretrained(
 model_name="unsloth/Meta-Llama-3.1-8B",
 max_seq_length=2048,
 dtype=None, # Auto-detect (BF16 on Ampere+)
 load_in_4bit=True, # QLoRA mode
)

# 2. Add LoRA adapters (Unsloth optimized)
model = FastLanguageModel.get_peft_model(
 model,
 r=16,
 lora_alpha=16,
 lora_dropout=0, # Unsloth recommends 0 for speed
 target_modules=[
 "q_proj", "k_proj", "v_proj", "o_proj",
 "gate_proj", "up_proj", "down_proj",
 ],
 use_gradient_checkpointing="unsloth", # 30% less VRAM
)

# 3. Standard SFTTrainer workflow
dataset = load_dataset("tatsu-lab/alpaca", split="train")

trainer = SFTTrainer(
 model=model,
 tokenizer=tokenizer,
 train_dataset=dataset,
 dataset_text_field="text",
 max_seq_length=2048,
 args=TrainingArguments(
 per_device_train_batch_size=2,
 gradient_accumulation_steps=4,
 warmup_steps=5,
 num_train_epochs=1,
 learning_rate=2e-4,
 bf16=True,
 logging_steps=1,
 optim="adamw_8bit",
 output_dir="outputs",
 ),
)

trainer.train()

# 4. Export to various formats
model.save_pretrained("lora_model") # Save LoRA adapter
model.save_pretrained_merged("merged_model", # Merged FP16
 tokenizer, save_method="merged_16bit")
model.save_pretrained_gguf("gguf_model", # GGUF for llama.cpp
 tokenizer, quantization_method="q4_k_m")

# Run training with a single command: $ accelerate launch -m axolotl.cli.train axolotl_{config}.yml # Or preprocess data first (useful for large datasets): $ python -m axolotl.cli.preprocess axolotl_config.yml $ accelerate launch -m axolotl.cli.train axolotl_config.yml

Code Fragment 15.3.1: Load a model with Unsloth for 2x faster LoRA fine-tuning

Key Insight

Unsloth's save_pretrained_gguf method directly exports to GGUF format, eliminating the separate llama.cpp conversion step. This makes the workflow from training to local deployment (via Ollama or llama.cpp) a single pipeline. For production vLLM deployments, use save_pretrained_merged instead.

Real-World Scenario: Unsloth for Rapid Prototype-to-Production

Who: Solo developer building a legal document summarization product

Situation: The developer needed to fine-tune Llama 3 8B on 5,000 legal summaries and deploy the result locally via Ollama for a privacy-sensitive law firm client.

Problem: The standard Hugging Face training pipeline took 8 hours on a single RTX 4090 and required a separate GGUF conversion step for local deployment.

Dilemma: Faster iteration was essential (the client wanted weekly model updates), but switching to a smaller model would degrade summarization quality.

Decision: They adopted Unsloth, which provides optimized kernels for LoRA training and direct GGUF export in one pipeline.

How: Using FastLanguageModel.from_pretrained with 4-bit quantization, they trained with Unsloth's fused kernels. After training, save_pretrained_gguf exported directly to Q4_K_M format.

Result: Training time dropped from 8 hours to 2.5 hours (a 3.2x speedup). The entire train-to-deploy pipeline (including GGUF export and Ollama import) completed in under 3 hours, enabling weekly update cycles.

Lesson: When your deployment target is local inference (Ollama, llama.cpp), Unsloth's integrated GGUF export eliminates a fragile conversion step and dramatically accelerates iteration.

2. Axolotl: Configuration-Driven Training

Axolotl takes a different approach: instead of writing Python code, you define your entire training run in a YAML configuration file. This makes experiments reproducible, shareable, and easy to iterate on. Axolotl supports all major model architectures, PEFT methods, dataset formats, and training features (DeepSpeed, FSDP, multi-GPU) through configuration alone. Code Fragment 15.3.4 shows this in practice.

# axolotl_config.yml
base_model: meta-llama/Meta-Llama-3.1-8B-Instruct
model_type: LlamaForCausalLM
tokenizer_type: AutoTokenizer

# Dataset configuration
datasets:
 - path: tatsu-lab/alpaca
 type: alpaca
 - path: ./my_custom_data.jsonl
 type: sharegpt

# QLoRA configuration
load_in_4bit: true
adapter: qlora
lora_r: 32
lora_alpha: 64
lora_dropout: 0.05
lora_target_modules:
 - q_proj
 - k_proj
 - v_proj
 - o_proj
 - gate_proj
 - up_proj
 - down_proj
lora_target_linear: true

# Training parameters
sequence_len: 4096
sample_packing: true # Pack multiple samples per sequence
pad_to_sequence_len: true
micro_batch_size: 2
gradient_accumulation_steps: 4
num_epochs: 3
learning_rate: 0.0002
lr_scheduler: cosine
warmup_ratio: 0.05
optimizer: paged_adamw_8bit
bf16: auto
gradient_checkpointing: true
flash_attention: true

# Evaluation and logging
val_set_size: 0.05
eval_steps: 100
logging_steps: 10
save_strategy: steps
save_steps: 200
output_dir: ./outputs/llama3-qlora

Code Fragment 15.3.2: axolotl_config.yml

Note

Axolotl's sample_packing feature concatenates multiple short training examples into a single sequence, significantly improving GPU utilization when your dataset contains many short examples. This can speed up training by 2-5x for datasets with average sequence lengths well below the maximum. Axolotl handles the attention masking automatically so that packed samples do not attend to each other.

3. LLaMA-Factory: Web UI for Fine-Tuning

LLaMA-Factory provides a graphical web interface (LLaMA Board) for configuring and launching fine-tuning runs. It is particularly valuable for teams where not everyone is comfortable writing YAML or Python configurations. The web UI lets you select models, datasets, PEFT methods, and hyperparameters through dropdown menus and sliders, then generates and executes the corresponding training code. Code Fragment 15.3.5 shows this approach in practice.

# Install LLaMA-Factory
# pip install llamafactory

# Launch the web UI
# llamafactory-cli webui

# Or use CLI for scriptable workflows
import json

# LLaMA-Factory uses a JSON config (similar to Axolotl YAML)
config = {
 "model_name_or_path": "meta-llama/Meta-Llama-3.1-8B-Instruct",
 "stage": "sft",
 "finetuning_type": "lora",
 "lora_rank": 16,
 "lora_alpha": 32,
 "lora_target": "all",
 "dataset": "alpaca_en",
 "template": "llama3",
 "quantization_bit": 4,
 "per_device_train_batch_size": 4,
 "gradient_accumulation_steps": 4,
 "num_train_epochs": 3.0,
 "learning_rate": 2e-4,
 "output_dir": "./llama_factory_output",
}

# Save and run via CLI
with open("train_config.json", "w") as f:
 json.dump(config, f, indent=2)

# llamafactory-cli train train_config.json

Code Fragment 15.3.3: LLaMA-Factory programmatic configuration for a QLoRA fine-tuning run. While most users configure runs through the web UI (LLaMA Board), this JSON config approach enables scripted automation and CI/CD integration.

4. torchtune: PyTorch-Native Fine-Tuning

torchtune is PyTorch's official library for fine-tuning LLMs. Its philosophy is transparency and composability: rather than hiding complexity behind abstractions, it provides well-documented, hackable recipes that you can read, understand, and modify. Each recipe is a self-contained Python script, not a framework that manages your training loop.

torchtune is the best choice when you need full control over the training process, want to implement custom training logic, or are integrating fine-tuning into an existing PyTorch codebase. Code Fragment 15.3.5 shows this approach in practice.

# torchtune uses YAML configs and CLI recipes
# Install: pip install torchtune

# Download a model
# tune download meta-llama/Meta-Llama-3.1-8B-Instruct \
# --output-dir ./models/llama3-8b

# Run a built-in recipe (LoRA single GPU)
# tune run lora_finetune_single_device \
# --config llama3_1/8B_lora_single_device

# Custom config override
# tune run lora_finetune_single_device \
# --config llama3_1/8B_lora_single_device \
# batch_size=4 \
# epochs=3 \
# lora_rank=32

# torchtune YAML config example:
# model:
# _component_: torchtune.models.llama3_1.lora_llama3_1_8b
# lora_attn_modules: ['q_proj', 'v_proj', 'k_proj', 'output_proj']
# apply_lora_to_mlp: True
# lora_rank: 16
# lora_alpha: 32

# torchtune is also great for programmatic use:
from torchtune.models.llama3_1 import lora_llama3_1_8b
from torchtune.modules.peft import get_adapter_params

# Build LoRA model with full control
model = lora_llama3_1_8b(
 lora_attn_modules=["q_proj", "v_proj"],
 apply_lora_to_mlp=True,
 lora_rank=16,
 lora_alpha=32,
)

# Get only adapter parameters for the optimizer
adapter_params = get_adapter_params(model)
optimizer = torch.optim.AdamW(adapter_params, lr=2e-4)

Code Fragment 15.3.4: torchtune setup showing both CLI-based and programmatic LoRA fine-tuning. The library provides composable recipes that expose the full training loop, making it ideal for researchers who need to customize training logic beyond what config-driven tools allow.

5. TRL: Transformer Reinforcement Learning

TRL (Transformer Reinforcement Learning) from Hugging Face is the standard library for alignment training, including SFT, RLHF, DPO, and other preference optimization methods. While its scope extends beyond PEFT, TRL integrates deeply with the PEFT library, making it the natural choice when your fine-tuning involves alignment stages. Code Fragment 15.3.2 shows this approach in practice.

# Fine-tune with TRL's SFTTrainer for supervised instruction tuning
# SFTTrainer handles chat template formatting and packing automatically
from trl import SFTTrainer, SFTConfig, DPOTrainer, DPOConfig
from peft import LoraConfig

# SFT with LoRA (most common PEFT + TRL pattern)
sft_config = SFTConfig(
 output_dir="./sft_output",
 max_seq_length=2048,
 per_device_train_batch_size=4,
 num_train_epochs=3,
 learning_rate=2e-4,
 packing=True, # Sample packing for efficiency
)

# Configure LoRA: rank, target modules, and scaling factor
peft_config = LoraConfig(
 r=16,
 lora_alpha=32,
 target_modules="all-linear",
 task_type="CAUSAL_LM",
)

# Initialize HuggingFace Trainer with model, data, and config
trainer = SFTTrainer(
 model="meta-llama/Meta-Llama-3.1-8B",
 args=sft_config,
 train_dataset=dataset,
 peft_config=peft_config, # TRL handles PEFT setup automatically
)
# Launch the training loop
trainer.train()

# DPO with LoRA (preference optimization after SFT)
dpo_config = DPOConfig(
 output_dir="./dpo_output",
 per_device_train_batch_size=2,
 num_train_epochs=1,
 learning_rate=5e-5,
 beta=0.1, # DPO temperature
)

dpo_trainer = DPOTrainer(
 model=sft_model, # Start from SFT checkpoint
 args=dpo_config,
 train_dataset=preference_dataset,
 peft_config=peft_config,
)
dpo_trainer.train()

Code Fragment 15.3.5: TRL (Transformer Reinforcement Learning) workflow showing SFT followed by DPO, both using LoRA adapters via PEFT integration. TRL handles chat template formatting, sample packing, and PEFT model wrapping automatically, making it the standard library for alignment training pipelines (see Section 17.3).

Why this matters: The choice of training platform has more practical impact than the choice of PEFT method. A misconfigured training run wastes hours and GPU dollars, while a well-integrated platform handles mixed-precision training, gradient checkpointing, and dataset preprocessing automatically. For most practitioners, the recommendation is clear: use Unsloth for single-GPU experiments (fastest iteration), Axolotl for reproducible multi-GPU training (best configuration management), and TRL when you need RLHF or DPO integration. All three connect to the same Hugging Face ecosystem, so switching between them is straightforward.

6. Tool Comparison Matrix

6. Tool Comparison Matrix Intermediate

Feature	Unsloth	Axolotl	LLaMA-Factory	torchtune	TRL
Interface	Python API	YAML config	Web UI + CLI	CLI + Python	Python API
Speed	2x faster	1x (standard)	1x (standard)	1x (standard)	1x (standard)
Memory	50% less	Standard	Standard	Standard	Standard
Multi-GPU	Limited	DeepSpeed, FSDP	DeepSpeed	FSDP native	Accelerate
RLHF/DPO	Via TRL	Via TRL	Built-in	Recipes	Core feature
Export	GGUF, vLLM	HF format	HF, GGUF	HF format	HF format
Best For	Speed, single GPU	Reproducibility	Beginners, teams	Custom research	Alignment

Warning

Unsloth's speed advantage comes from custom CUDA/Triton kernels that may lag behind the latest model architectures. When a new model is released (for example, a new Qwen or Gemma variant), it can take days to weeks before Unsloth adds optimized support. Axolotl and TRL, which rely on standard Hugging Face Transformers, typically support new models within hours of their release. Plan accordingly if you need cutting-edge model support.

7. Cloud Compute Options

Choosing the right GPU infrastructure depends on your budget, scale, and workflow preferences. Here is a comparison of the major options available for LLM fine-tuning.

7. Cloud Compute Options Advanced Comparison

Platform	GPU Options	Price Range	Best For
Google Colab	T4 (free), A100 (Pro+)	Free to $50/mo	Prototyping, learning, small models
Lambda Labs	A100, H100	$1.10-$2.49/hr per GPU	On-demand training, reserved instances
RunPod	A100, H100, A6000	$0.44-$3.89/hr per GPU	Serverless, spot pricing, community cloud
Modal	A100, H100, T4	Pay-per-second	Serverless functions, burst training
Vast.ai	Various (marketplace)	$0.20-$2.00/hr	Cheapest option, community GPUs
AWS/GCP/Azure	Full range	$1.00-$30+/hr	Enterprise, compliance, multi-region

Figure 15.3.2 maps GPU requirements and approximate costs by model size and fine-tuning method.

Fun Fact

Hugging Face's PEFT library reduced the code needed to add LoRA to a model from hundreds of lines to roughly five. Democratizing access to advanced techniques is great, until you realize that "easy to use" also means "easy to misuse with default settings."

Figure 15.3.2: GPU requirements and approximate costs scale with model size and fine-tuning method.

Note

For beginners, start with Google Colab Pro ($10/month) to experiment with QLoRA on 7B models using a T4 or A100 GPU. Once you have a working pipeline, move to RunPod or Lambda Labs for longer training runs. Modal is excellent for teams that want serverless infrastructure where you pay only for the seconds of GPU time you actually use.

8. Recommended Workflows

Here are recommended end-to-end workflows depending on your experience level and requirements.

Beginner: First Fine-Tune

Use Google Colab with a free T4 GPU
Install Unsloth for optimized training
Fine-tune a 7B model with QLoRA (r=16)
Export to GGUF and test with Ollama locally

Intermediate: Production Fine-Tune

Use Axolotl for reproducible YAML-based configuration
Train on RunPod or Lambda Labs with an A100
Run evaluation suite before and after training
Merge adapter and deploy via vLLM

Advanced: Multi-Stage Alignment

SFT with TRL + LoRA on instruction data
DPO with TRL + LoRA on preference pairs
Merge both adapters sequentially
Evaluate with custom benchmarks and human evaluation
Deploy with vLLM or serve adapters via LoRAX

Self-Check

Q1: What are the two main performance benefits of Unsloth, and how does it achieve them?

Show Answer

Unsloth provides roughly 2x training speed and 50% memory reduction. It achieves this through custom Triton kernels for attention, RoPE embeddings, cross-entropy loss, and other operations. These kernels bypass PyTorch's autograd overhead in performance-critical paths, providing the same numerical results with less computation and memory usage.

Q2: What is sample packing in Axolotl, and when is it most beneficial?

Show Answer

Sample packing concatenates multiple short training examples into a single sequence up to the maximum sequence length, with attention masking to prevent cross-contamination between packed samples. It is most beneficial when your dataset contains many short examples (for example, single-turn instructions averaging 200 tokens with a 4096-token max sequence length). In such cases, packing can improve GPU utilization by 2-5x by eliminating padding waste.

Q3: How does torchtune differ philosophically from tools like Axolotl or LLaMA-Factory?

Show Answer

torchtune prioritizes transparency and composability over convenience. Instead of providing a framework that manages your training loop, it offers self-contained, readable recipes (Python scripts) that you can modify directly. This makes it the best choice when you need full control over the training process, want to implement custom training logic, or are integrating fine-tuning into an existing PyTorch codebase. Axolotl and LLaMA-Factory prioritize ease of use and rapid experimentation through configuration files or web UIs.

Q4: What GPU would you recommend for QLoRA fine-tuning of a 70B model, and approximately how much would a training run cost?

Show Answer

A 70B model with QLoRA requires approximately 36 GB of GPU memory, so a single A100 80GB GPU is the minimum viable option. A typical training run (3 epochs on 50K samples) would cost roughly $30-80 on spot/community GPU pricing (RunPod, Vast.ai), or 2-5x more on reserved instances or major cloud providers. An H100 would train faster but costs more per hour.

Q5: You need to fine-tune a model and then run DPO alignment. Which tool combination would you use, and why?

Show Answer

Use TRL (Transformer Reinforcement Learning) for both stages, as it natively supports SFT, DPO, and other alignment methods with built-in PEFT integration. For the SFT stage, use SFTTrainer with a LoRA config. For the DPO stage, use DPOTrainer starting from the SFT checkpoint. Optionally, use Unsloth as the model backend for 2x speed improvement on both stages. TRL handles the LoRA adapter management automatically across both training phases.

Tip: Merge Adapters Before Deployment

After training, merge LoRA weights into the base model with model.merge_and_unload(). This eliminates adapter overhead during inference, making your fine-tuned model exactly as fast as the base model with no additional memory cost.

Key Takeaways

Unsloth delivers 2x speed and 50% memory savings through custom Triton kernels, making it the best choice for single-GPU fine-tuning when the model is supported.
Axolotl provides reproducible, configuration-driven training via YAML files, with sample packing and multi-GPU support built in.
LLaMA-Factory offers a web UI that makes fine-tuning accessible to teams without deep Python expertise.
torchtune is PyTorch-native and transparent, ideal when you need full control over the training pipeline or are doing custom research.
TRL is essential for alignment training (SFT, DPO, RLHF) and integrates seamlessly with PEFT for parameter-efficient alignment.
For beginners, start with Colab + Unsloth on 7B models. For production, use Axolotl on RunPod or Lambda Labs. For alignment, use TRL.
QLoRA on a T4 costs under $5 for a typical 7B fine-tuning run, making experimentation accessible to almost anyone.

Fun Fact

Axolotl, one of the popular fine-tuning frameworks, was named after the adorable Mexican salamander known for its regenerative abilities. The metaphor is apt: these tools help you regenerate model capabilities without starting from scratch.

Research Frontier

Training platforms are converging on declarative configuration formats (like Axolotl's YAML-based setup) that abstract away distributed training details and let practitioners focus on data and hyperparameters. Cloud-native fine-tuning services are integrating evaluation pipelines directly into the training loop, automatically running benchmark suites at checkpoints and selecting the best model without manual intervention.

An emerging frontier is on-device PEFT, where LoRA adapters are trained directly on edge devices (phones, laptops) using private user data, enabling personalization without cloud round-trips. Apple's on-device LoRA work (2024) demonstrates adapter training on iPhone hardware with less than 1 GB of additional memory.

Exercises

Exercise 15.3.1: Unsloth advantages Conceptual

What does Unsloth do to achieve 2x training speedup and 50% memory reduction compared to standard HuggingFace training? Name two key techniques.

Answer Sketch

Unsloth uses: (1) hand-written Triton kernels for attention, RoPE, cross-entropy loss, and other operations that bypass PyTorch autograd overhead, and (2) intelligent memory management that avoids materializing large intermediate tensors. These custom kernels fuse operations that HuggingFace runs as separate steps, reducing GPU memory traffic and computation. The result is identical model quality at roughly half the time and memory.

Exercise 15.3.2: Tool comparison Analysis

Compare Unsloth, Axolotl, and LLaMA-Factory on three dimensions: ease of use, flexibility, and performance. When would you choose each?

Answer Sketch

Unsloth: best performance (2x speedup), code-first API, limited to single-GPU. Choose for fast iteration on a single GPU. Axolotl: config-driven (YAML), supports multi-GPU and complex training recipes (DPO, RLHF). Choose for production training pipelines that need reproducibility. LLaMA-Factory: web UI for non-programmers, supports many model families, good for experimentation. Choose when team members without deep ML experience need to fine-tune models.

Exercise 15.3.3: GPU selection Coding

Write a function that recommends a GPU configuration given the model size (in billions of parameters) and PEFT method (LoRA, QLoRA, full fine-tune). Consider VRAM requirements.

Answer Sketch

Rough estimates: Full fine-tune needs ~4 bytes * params * 4 (weights + gradients + optimizer). QLoRA: 0.5 bytes * params (4-bit model) + LoRA parameters in fp16. LoRA: 2 bytes * params + LoRA in fp16. For 7B: full = ~112GB (2x A100-80GB), LoRA = ~28GB (1x A100-40GB), QLoRA = ~10GB (1x RTX 4090). For 70B: full = ~1.1TB (14x A100-80GB), QLoRA = ~48GB (1x A100-80GB). Return GPU type and count recommendation.

Exercise 15.3.4: End-to-end QLoRA workflow Coding

Write a complete Unsloth QLoRA training script: load a 4-bit quantized model, configure LoRA, prepare a dataset in chat format, train for 1 epoch, and save the adapter.

Answer Sketch

Load: model, tokenizer = FastLanguageModel.from_pretrained('unsloth/llama-3-8b-bnb-4bit', max_seq_length=2048). LoRA: model = FastLanguageModel.get_peft_model(model, r=16, target_modules=['q_proj','k_proj','v_proj','o_proj'], lora_alpha=16). Trainer: SFTTrainer(model=model, tokenizer=tokenizer, train_dataset=dataset, max_seq_length=2048, args=TrainingArguments(per_device_train_batch_size=2, num_train_epochs=1, output_dir='outputs')). Save: model.save_pretrained('lora_adapter').

Exercise 15.3.5: Cloud compute costs Analysis

Compare the hourly cost and time to fine-tune a 7B model using QLoRA on three cloud platforms: (a) Lambda Labs A100, (b) RunPod A100, (c) Google Colab Pro A100. Assume 2 hours of training.

Answer Sketch

Approximate costs (2025): Lambda Labs A100-80GB: ~$1.50/hr = $3.00. RunPod A100-80GB: ~$1.20/hr = $2.40. Google Colab Pro: ~$10/month flat (but limited GPU time and less reliable). For a one-off 2-hour job, RunPod is cheapest. For regular experimentation, Colab Pro offers the best value if you stay within usage limits. Lambda Labs is best for long running jobs due to persistent instances.

What Comes Next

In the next chapter, Chapter 16: Knowledge Distillation & Model Merging, we explore knowledge distillation and model merging, techniques for creating smaller, specialized models from larger ones.

Bibliography

Training Frameworks

Han, D. & Rao, D. (2024). Unsloth: Fast and Memory-Efficient LLM Fine-Tuning.

An open-source library that achieves 2x faster LoRA training with 60% less memory through custom Triton kernels and manual backpropagation. Practitioners training on consumer GPUs should try Unsloth first, as it supports Llama, Mistral, and Gemma out of the box.

Speed Optimization

Wing Lian et al. (2023). Axolotl: A Configuration-Driven Framework for LLM Fine-Tuning.

A YAML-driven training framework that wraps Hugging Face Transformers with support for multi-GPU, LoRA, and dozens of dataset formats. Teams who want a single config file to control their entire training pipeline will find Axolotl's design philosophy compelling.

Config-Driven

Mangrulkar, S., Gugger, S., Debut, L., Belkada, Y., Paul, S., & Bossan, B. (2022). PEFT: State-of-the-art Parameter-Efficient Fine-Tuning methods. Hugging Face.

The official Hugging Face library implementing LoRA, QLoRA, prefix tuning, prompt tuning, and adapter methods with a unified API. This is the de facto standard for PEFT in the Hugging Face ecosystem and integrates seamlessly with the Trainer class.

Library

System Optimization

Rasley, J., Rajbhandari, S., Ruwase, O., & He, Y. (2020). DeepSpeed: System Optimizations Enable Training Deep Learning Models with Over 100 Billion Parameters. KDD 2020.

Introduces ZeRO (Zero Redundancy Optimizer) stages that partition optimizer states, gradients, and parameters across GPUs. Essential reading for anyone scaling beyond a single GPU, as DeepSpeed integration is built into most modern training frameworks.

Distributed Training

Zhao, J., Zhang, Z., Chen, B., Wang, Z., Anandkumar, A., & Tian, Y. (2024). GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection. ICML 2024.

Projects gradients into a low-rank subspace during training, reducing optimizer memory without modifying model architecture. Researchers interested in alternatives to LoRA that achieve full-rank weight updates at reduced cost should explore this approach.

Memory Efficiency

Dettmers, T., Lewis, M., Belkada, Y., & Zettlemoyer, L. (2022). LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale. NeurIPS 2022.

Introduces mixed-precision decomposition that handles outlier features in 16-bit while keeping the rest in 8-bit, enabling inference and training of large models on consumer hardware. This paper laid the quantization groundwork that made QLoRA possible.

Quantization