Section 17.3a: Training Tool Comparison, Cloud Compute & Recommended Workflows

"The cheapest GPU is the one you don't accidentally leave running overnight."
LoRA, YAML-Weary AI Agent

Big Picture

This section continues from Section 17.3, which surveyed the major fine-tuning training platforms (Unsloth, Axolotl, LLaMA-Factory, torchtune, TRL). Here we put them side by side, walk through the cloud GPU landscape and its pricing tiers, and stitch the pieces into three recommended end-to-end workflows (beginner, intermediate production, advanced multi-stage alignment).

Prerequisites

This section continues from Section 17.3. Familiarity with the five tools surveyed there (Unsloth, Axolotl, LLaMA-Factory, torchtune, TRL) is assumed.

17.3.6 Tool Comparison Matrix

Table 17.3a.1: Tool Comparison Matrix (as of 2026).

Feature	Unsloth	Axolotl	LLaMA-Factory	torchtune	TRL
Interface	Python API	YAML config	Web UI + CLI	CLI + Python	Python API
Speed	2x faster	1x (standard)	1x (standard)	1x (standard)	1x (standard)
Memory	50% less	Standard	Standard	Standard	Standard
Multi-GPU	Limited	DeepSpeed, FSDP	DeepSpeed	FSDP native	Accelerate
RLHF/DPO (canonical home: Sec 18.1)	Via TRL	Via TRL	Built-in	Recipes	Core feature
Export	GGUF, vLLM	HF format	HF, GGUF	HF format	HF format
Best For	Speed, single GPU	Reproducibility	Beginners, teams	Custom research	Alignment

Warning

Unsloth's speed advantage comes from custom CUDA/Triton kernels that may lag behind the latest model architectures. When a new model is released (for example, a new Qwen or Gemma variant), it can take days to weeks before Unsloth adds optimized support. Axolotl and TRL, which rely on standard Hugging Face Transformers, typically support new models within hours of their release. Plan accordingly if you need cutting-edge model support.

Production Pattern: Who Uses Which Fine-Tuning Stack

The fine-tuning toolchain is well-documented in the wild. Hugging Face's own TRL is the canonical reference; the Zephyr, StarCoder2, and IDEFICS3 model card releases describe their TRL recipes openly. Axolotl powers many open-weights model releases: NousResearch (Hermes), the Dolphin series, and most TheBloke-era community fine-tunes shipped with Axolotl YAMLs. Unsloth's Discord features case studies from indie devs fine-tuning Llama 3 70B on a single 24 GB consumer card. On the enterprise side, Together AI, Predibase, and Anyscale offer Axolotl/torchtune as a hosted product; OpenPipe (YC W24) wraps the same stack in a one-click product targeted at startups replacing GPT-4 with a smaller LoRA-tuned model.

Library Shortcut: mergekit (SLERP, TIES, DARE)

To fuse multiple LoRA adapters or full fine-tunes into one model, use mergekit from Arcee AI. A YAML config picks the merge method (slerp, ties, dare_ties, linear, model_stock), the source models and weights, and the base model used as the reference. Mergekit is the engine behind most of the top-ranked merges on the Open LLM Leaderboard and works equally well for adapters via mergekit-lora.

Show code

pip install mergekit
# config.yaml selects models, weights, and merge method (slerp/ties/dare_ties)
import subprocess
subprocess.run(["mergekit-yaml", "config.yaml", "./merged",
                "--cuda", "--copy-tokenizer", "--allow-crimes"], check=True)

Code Fragment 17.3a.1: Mergekit fuses adapters or full fine-tunes with one CLI call.

17.3.7 Cloud Compute Options

Choosing the right GPU infrastructure depends on your budget, scale, and workflow preferences. Here is a comparison of the major options available for LLM fine-tuning.

Table 17.3a.2: Cloud Compute Options Comparison (as of 2026).

Platform	GPU Options	Price Range	Best For
Google Colab	T4 (free), A100 (Pro+)	Free to $50/mo	Prototyping, learning, small models
Lambda Labs	A100, H100	$1.10-$2.49/hr per GPU	On-demand training, reserved instances
RunPod	A100, H100, A6000	$0.44-$3.89/hr per GPU	Serverless, spot pricing, community cloud
Modal	A100, H100, T4	Pay-per-second	Serverless functions, burst training
Vast.ai	Various (marketplace)	$0.20-$2.00/hr	Cheapest option, community GPUs
AWS/GCP/Azure	Full range	$1.00-$30+/hr	Enterprise, compliance, multi-region

Figure 17.3a.1 maps GPU requirements and approximate costs by model size and fine-tuning method.

Fun Fact

Hugging Face's PEFT library reduced the code needed to add LoRA to a model from hundreds of lines to roughly five. Democratizing access to advanced techniques is great, until you realize that "easy to use" also means "easy to misuse with default settings."

GPU requirements and training cost by model size and PEFT method

Figure 17.3a.1: GPU requirements and approximate costs scale with model size and fine-tuning method.

Note

For beginners, start with Google Colab Pro ($10/month) to experiment with QLoRA on 7B models using a T4 or A100 GPU. Once you have a working pipeline, move to RunPod or Lambda Labs for longer training runs. Modal is excellent for teams that want serverless infrastructure where you pay only for the seconds of GPU time you actually use.

17.3.8 Recommended Workflows

Here are recommended end-to-end workflows depending on your experience level and requirements.

Beginner: First Fine-Tune

Use Google Colab with a free T4 GPU
Install Unsloth for optimized training
Fine-tune a 7B model with QLoRA (r=16)
Export to GGUF and test with Ollama locally

Intermediate: Production Fine-Tune

Use Axolotl for reproducible YAML-based configuration
Train on RunPod or Lambda Labs with an A100
Run evaluation suite before and after training
Merge adapter and deploy via vLLM

Advanced: Multi-Stage Alignment

SFT with TRL + LoRA on instruction data
DPO with TRL + LoRA on preference pairs
Merge both adapters sequentially
Evaluate with custom benchmarks and human evaluation
Deploy with vLLM or serve adapters via LoRAX

Tip: Merge Adapters Before Deployment

After training, merge LoRA weights into the base model with model.merge_and_unload(). This eliminates adapter overhead during inference, making your fine-tuned model exactly as fast as the base model with no additional memory cost.

Research Frontier

Training platforms are converging on declarative configuration formats (like Axolotl's YAML-based setup) that abstract away distributed training details and let practitioners focus on data and hyperparameters. Cloud-native fine-tuning services are integrating evaluation pipelines directly into the training loop, automatically running benchmark suites at checkpoints and selecting the best model without manual intervention.

An emerging frontier is on-device PEFT, where LoRA adapters are trained directly on edge devices (phones, laptops) using private user data, enabling personalization without cloud round-trips. Apple's on-device LoRA work (2024) demonstrates adapter training on iPhone hardware with less than 1 GB of additional memory.

Key Takeaways

Tool choice is a function of constraints: single-GPU speed (Unsloth), reproducibility (Axolotl), UI-driven teams (LLaMA-Factory), custom research (torchtune), alignment training (TRL).
Cutting-edge support lags Unsloth. Stock HF-based stacks support new models within hours; Unsloth often takes days to weeks.
Spot/community GPUs (RunPod, Vast.ai) are 2-5x cheaper than reserved cloud, and adequate for most fine-tuning runs that tolerate occasional preemption.
QLoRA on a T4 costs under $5 for a typical 7B fine-tuning run, making experimentation accessible to almost anyone.
For beginners, start with Colab + Unsloth on 7B models. For production, use Axolotl on RunPod or Lambda Labs. For alignment, use TRL.

Self-Check

Q1: What GPU would you recommend for QLoRA fine-tuning of a 70B model, and approximately how much would a training run cost?

Show Answer

A 70B model with QLoRA requires approximately 36 GB of GPU memory, so a single A100 80GB GPU is the minimum viable option. A typical training run (3 epochs on 50K samples) would cost roughly $30-80 on spot/community GPU pricing (RunPod, Vast.ai), or 2-5x more on reserved instances or major cloud providers. An H100 would train faster but costs more per hour.

Q2: You need to fine-tune a model and then run DPO alignment. Which tool combination would you use, and why?

Show Answer

Use TRL (Transformer Reinforcement Learning) for both stages, as it natively supports SFT, DPO, and other alignment methods with built-in PEFT integration. For the SFT stage, use SFTTrainer with a LoRA config. For the DPO stage, use DPOTrainer starting from the SFT checkpoint. Optionally, use Unsloth as the model backend for 2x speed improvement on both stages. TRL handles the LoRA adapter management automatically across both training phases.

Exercises

Exercise 17.3a.1: GPU selection Coding

Write a function that recommends a GPU configuration given the model size (in billions of parameters) and PEFT method (LoRA, QLoRA, full fine-tune). Consider VRAM requirements.

Answer Sketch

Rough estimates: Full fine-tune needs ~4 bytes * params * 4 (weights + gradients + optimizer). QLoRA: 0.5 bytes * params (4-bit model) + LoRA parameters in fp16. LoRA: 2 bytes * params + LoRA in fp16. For 7B: full = ~112GB (2x A100-80GB), LoRA = ~28GB (1x A100-40GB), QLoRA = ~10GB (1x RTX 4090). For 70B: full = ~1.1TB (14x A100-80GB), QLoRA = ~48GB (1x A100-80GB). Return GPU type and count recommendation.

Exercise 17.3a.2: Cloud compute costs Analysis

Compare the hourly cost and time to fine-tune a 7B model using QLoRA on three cloud platforms: (a) Lambda Labs A100, (b) RunPod A100, (c) Google Colab Pro A100. Assume 2 hours of training.

Answer Sketch

Approximate costs: Lambda Labs A100-80GB: ~$1.50/hr = $3.00. RunPod A100-80GB: ~$1.20/hr = $2.40. Google Colab Pro: ~$10/month flat (but limited GPU time and less reliable). For a one-off 2-hour job, RunPod is cheapest. For regular experimentation, Colab Pro offers the best value if you stay within usage limits.

What Comes Next

In the next section, Section 17.4: Soft Prompts: Prompt Tuning, Prefix Tuning, and P-Tuning, we cover soft-prompt-based PEFT methods.

System Optimization & Compute

Rasley, J., Rajbhandari, S., Ruwase, O., & He, Y. (2020). DeepSpeed: System Optimizations Enable Training Deep Learning Models with Over 100 Billion Parameters. KDD 2020. Introduces ZeRO (Zero Redundancy Optimizer) stages that partition optimizer states, gradients, and parameters across GPUs.

Zhao, J., Zhang, Z., Chen, B., Wang, Z., Anandkumar, A., & Tian, Y. (2024). GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection. ICML 2024. Projects gradients into a low-rank subspace during training, reducing optimizer memory without modifying model architecture.

Dettmers, T., Lewis, M., Belkada, Y., & Zettlemoyer, L. (2022). LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale. NeurIPS 2022. Introduces mixed-precision decomposition that handles outlier features in 16-bit while keeping the rest in 8-bit, enabling inference and training of large models on consumer hardware.