Section 14.4: Fine-Tuning via Provider APIs

"Upload JSONL. Click train. Wait. Download model. It really is that simple, until you need to debug why it is not working."
Finetune, Deceptively Simple AI Agent

Big Picture

Not every team needs to manage their own GPU cluster. Provider APIs offer a managed fine-tuning experience where you upload your data, configure a few parameters, and receive a fine-tuned model endpoint. This approach trades control and flexibility for simplicity and speed. Building on the API patterns from Section 10.1, this section covers the two most widely used provider APIs (OpenAI and Google Vertex AI), walks through complete workflows for each, and provides a framework for deciding when managed fine-tuning is the right choice versus self-hosted training.

Prerequisites

Before starting, make sure you are familiar with fine-tuning concepts as covered in Section 14.1: When and Why to Fine-Tune.

A person dropping off a package of ingredients at a cloud-shaped factory in the sky then relaxing in a hammock while a finished gourmet dish descends on a small parachute — **Figure 14.4.1**: Provider APIs handle the heavy lifting of fine-tuning; you supply the data and receive a trained model.

1. OpenAI Fine-Tuning API

OpenAI's fine-tuning API (building on the API foundations from Chapter 10) is the most accessible entry point for teams new to fine-tuning. It supports GPT-4o, GPT-4o-mini, and GPT-3.5-turbo models with a straightforward workflow: prepare data in JSONL format, upload the file, create a fine-tuning job, and use the resulting model through the standard chat completions API.

Fun Note

Provider fine-tuning APIs have made the process so simple that the hardest part is no longer "how do I train?" but "should I train?" Many teams upload 50 examples, click "Start Training," get a model back in 30 minutes, and declare victory. Then they discover that their fine-tuned model is worse than the base model with a good prompt, because 50 examples is rarely enough to beat a well-engineered system prompt.

1.1 Data Preparation for OpenAI

OpenAI requires training data in JSONL format with the ChatML messages structure. Each line contains a JSON object with a messages array. The system message is optional but recommended for consistent behavior.

Key Insight

Mental Model: The Cloud Workshop. Think of provider fine-tuning APIs as a cloud workshop where you send raw materials (your dataset) and receive a finished product (a fine-tuned model). You do not need to own the machinery (GPUs) or understand every step of the manufacturing process (training loop details). The trade-off is control: you cannot inspect every step, adjust obscure hyperparameters, or see exactly what happened during training. For many practical applications, this convenience outweighs the loss of control, much like ordering custom furniture instead of building it yourself. Code Fragment 14.4.2 shows this approach in practice.

# Format training data as JSONL for the OpenAI fine-tuning API
# Each line contains a messages array with system, user, and assistant turns
import json
from typing import List, Dict

def prepare_openai_training_file(
 examples: List[Dict],
 output_path: str,
 validate: bool = True
) -> dict:
 """Prepare and validate a JSONL file for OpenAI fine-tuning."""
 stats = {"total": 0, "valid": 0, "errors": [], "token_estimates": []}

 with open(output_path, "w") as f:
 for i, example in enumerate(examples):
 stats["total"] += 1
 messages = example.get("messages", [])

 if validate:
 # Validate message structure
 if not messages:
 stats["errors"].append(f"Example {i}: empty messages")
 continue

 roles = [m["role"] for m in messages]

 # Must end with assistant message
 if roles[-1] != "assistant":
 stats["errors"].append(
 f"Example {i}: last message must be 'assistant'"
 )
 continue

 # Check for valid roles
 valid_roles = {"system", "user", "assistant"}
 invalid = set(roles) - valid_roles
 if invalid:
 stats["errors"].append(
 f"Example {i}: invalid roles {invalid}"
 )
 continue

 # Rough token estimate (4 chars per token)
 total_chars = sum(len(m["content"]) for m in messages)
 estimated_tokens = total_chars // 4
 stats["token_estimates"].append(estimated_tokens)

 f.write(json.dumps(example) + "\n")
 stats["valid"] += 1

 if stats["token_estimates"]:
 import numpy as np
 tokens = np.array(stats["token_estimates"])
 stats["token_summary"] = {
 "mean": int(tokens.mean()),
 "median": int(np.median(tokens)),
 "p95": int(np.percentile(tokens, 95)),
 "total_training_tokens": int(tokens.sum()),
 }

 print(f"Prepared {stats['valid']}/{stats['total']} examples")
 if stats["errors"]:
 print(f"Errors: {len(stats['errors'])}")
 for err in stats["errors"][:5]:
 print(f" {err}")

 return stats

# Example usage
examples = [
 {
 "messages": [
 {"role": "system", "content": "You are a helpful customer support agent."},
 {"role": "user", "content": "How do I reset my password?"},
 {"role": "assistant", "content": "To reset your password, go to Settings, "
 "then Security, and click 'Reset Password'. You will receive an email "
 "with a reset link within 5 minutes."}
 ]
 },
 # ... more examples
]

stats = prepare_openai_training_file(examples, "train.jsonl")

Prepared 1/1 examples

Code Fragment 14.4.1: Format training data as JSONL for the OpenAI fine-tuning API

1.2 Creating and Monitoring a Fine-Tuning Job

Code Fragment 14.4.8 demonstrates the full fine-tuning job lifecycle: uploading files, creating a job, and monitoring progress.

# Format training data as JSONL for the OpenAI fine-tuning API
# Each line contains a messages array with system, user, and assistant turns
from openai import OpenAI
import time

client = OpenAI() # Uses OPENAI_API_KEY env var

# Step 1: Upload training file
with open("train.jsonl", "rb") as f:
 training_file = client.files.create(file=f, purpose="fine-tune")
print(f"Uploaded file: {training_file.id}")

# Step 2: (Optional) Upload validation file
with open("val.jsonl", "rb") as f:
 validation_file = client.files.create(file=f, purpose="fine-tune")

# Step 3: Create fine-tuning job
job = client.fine_tuning.jobs.create(
 training_file=training_file.id,
 validation_file=validation_file.id,
 model="gpt-4o-mini-2024-07-18",
 hyperparameters={
 "n_epochs": 3, # Number of epochs
 "learning_rate_multiplier": 1.8, # Relative to default
 "batch_size": 4, # Auto if not specified
 },
 suffix="customer-support-v1", # Custom model name suffix
)
print(f"Job created: {job.id}")
print(f"Status: {job.status}")

# Step 4: Monitor training progress
def monitor_fine_tuning(job_id: str, poll_interval: int = 60):
 """Poll a fine-tuning job until completion."""
 while True:
 job = client.fine_tuning.jobs.retrieve(job_id)
 print(f"Status: {job.status}")

 # Check for events (training metrics)
 events = client.fine_tuning.jobs.list_events(
 fine_tuning_job_id=job_id, limit=5
 )
 for event in events.data:
 print(f" [{event.created_at}] {event.message}")

 if job.status in ("succeeded", "failed", "cancelled"):
 break

 time.sleep(poll_interval)

 if job.status == "succeeded":
 print(f"\nFine-tuned model: {job.fine_tuned_model}")
 return job.fine_tuned_model
 else:
 print(f"\nJob {job.status}: {job.error}")
 return None

model_name = monitor_fine_tuning(job.id)

Uploaded file: file-abc123 Job created: ftjob-xyz789 Status: validating_files Status: running [1711234567] Step 100/300: training loss=1.234 [1711234600] Step 200/300: training loss=0.876 Status: running [1711234633] Step 300/300: training loss=0.654 Status: succeeded Fine-tuned model: ft:gpt-4o-mini-2024-07-18:my-org:customer-support-v1:9abc123

Code Fragment 14.4.2: Format training data as JSONL for the OpenAI fine-tuning API

1.3 Using the Fine-Tuned Model

Code Fragment 14.4.3 shows how to call the fine-tuned model using the standard chat completions API.

# Step 5: Use the fine-tuned model (identical to standard API calls)
response = client.chat.completions.create(
 model="ft:gpt-4o-mini-2024-07-18:my-org:customer-support-v1:9abc123",
 messages=[
 {"role": "system", "content": "You are a helpful customer support agent."},
 {"role": "user", "content": "I can't find my order confirmation email."},
 ],
 temperature=0.7,
 max_tokens=200,
)

# Extract the generated message from the API response
print(response.choices[0].message.content)

I'd be happy to help you find your order confirmation! Here are a few steps to check: 1. Search your email inbox (including spam/junk folders) for "order confirmation" 2. Log into your account at mystore.com/orders to view your order history 3. If you still cannot find it, I can resend the confirmation to your registered email Would you like me to resend it?

Code Fragment 14.4.3: Step 5: Use the fine-tuned model (identical to standard API calls)

Note

OpenAI fine-tuning pricing (as of early 2025; check current rates before planning). You pay for training tokens (the number of tokens in your dataset multiplied by the number of epochs) and for inference on the fine-tuned model (which is more expensive per token than the base model). For GPT-4o-mini, training costs approximately $3.00 per million tokens and inference costs $0.30/$1.20 per million input/output tokens. LLM API prices drop frequently, so always check the provider's pricing page and estimate total cost before starting a job, especially with large datasets.

2. Google Vertex AI Fine-Tuning

Google Vertex AI provides fine-tuning for Gemini models with a similar managed experience. The workflow uses the Google Cloud SDK and supports both supervised fine-tuning and RLHF-style tuning. Vertex AI gives you slightly more control over hyperparameters compared to OpenAI. Code Fragment 14.4.4 shows this in practice.

2.1 Vertex AI Workflow

This snippet launches a fine-tuning job on Google Cloud Vertex AI using the Gemini tuning API.

# Upload training data to GCS and launch a Vertex AI SFT job
# Vertex AI fine-tunes Gemini models with LoRA adapter configuration
import vertexai
from vertexai.tuning import sft as vertex_sft
from google.cloud import storage

# Initialize Vertex AI
vertexai.init(project="my-project-id", location="us-central1")

# Step 1: Upload training data to GCS
# Vertex AI expects data in GCS (Google Cloud Storage)
# Format: JSONL with same messages structure as OpenAI

def upload_to_gcs(local_path: str, bucket_name: str, blob_name: str) -> str:
 """Upload training data to Google Cloud Storage."""
 client = storage.Client()
 bucket = client.bucket(bucket_name)
 blob = bucket.blob(blob_name)
 blob.upload_from_filename(local_path)
 gcs_uri = f"gs://{bucket_name}/{blob_name}"
 print(f"Uploaded to {gcs_uri}")
 return gcs_uri

train_uri = upload_to_gcs(
 "train.jsonl",
 "my-training-bucket",
 "fine-tuning/medical-qa/train.jsonl"
)
val_uri = upload_to_gcs(
 "val.jsonl",
 "my-training-bucket",
 "fine-tuning/medical-qa/val.jsonl"
)

# Step 2: Create supervised fine-tuning job
sft_tuning_job = vertex_sft.train(
 source_model="gemini-1.5-flash-002",
 train_dataset=train_uri,
 validation_dataset=val_uri,
 epochs=3,
 adapter_size=4, # LoRA rank (1, 4, 8, or 16)
 learning_rate_multiplier=1.0,
 tuned_model_display_name="medical-qa-gemini-v1",
)

# Step 3: Monitor (blocking call)
print(f"Job resource: {sft_tuning_job.resource_name}")

# Poll for completion
while not sft_tuning_job.has_ended:
 time.sleep(60)
 sft_tuning_job.refresh()
 print(f"State: {sft_tuning_job.state}")

# Step 4: Get the tuned model endpoint
tuned_model = sft_tuning_job.tuned_model_endpoint_name
print(f"Tuned model endpoint: {tuned_model}")

Training file uploaded to gs://my-project-data/train.jsonl SFT job created: projects/my-project-id/locations/us-central1/tuningJobs/12345 Status: JOB_STATE_RUNNING Status: JOB_STATE_SUCCEEDED Tuned model endpoint: projects/my-project-id/locations/us-central1/endpoints/67890

Code Fragment 14.4.4: Upload training data to GCS and launch a Vertex AI SFT job

2.2 Using the Vertex AI Fine-Tuned Model

Code Fragment 14.4.5 demonstrates using the tuned Gemini model for inference.

# Use the fine-tuned Gemini model for inference via Vertex AI
# The tuned model endpoint is returned by the completed SFT job
from vertexai.generative_models import GenerativeModel

# Load the fine-tuned model
model = GenerativeModel(
 model_name=tuned_model, # Endpoint from training
)

# Generate responses
response = model.generate_content(
 "Patient presents with recurring headaches and blurred vision. "
 "Suggest differential diagnoses.",
 generation_config={
 "temperature": 0.3,
 "max_output_tokens": 500,
 }
)

print(response.text)

Based on the symptoms of sudden onset severe headache and blurred vision, the following differential diagnoses should be considered: 1. Subarachnoid hemorrhage (thunderclap headache, emergent) 2. Hypertensive emergency 3. Migraine with aura 4. Intracranial mass lesion Priority: URGENT. Recommend immediate CT head without contrast, blood pressure measurement, and neurological exam. Escalate to attending physician.

Code Fragment 14.4.5: Use the fine-tuned Gemini model for inference via Vertex AI

3. Provider Comparison

Figure 14.4.1 compares the key decision factors between API fine-tuning and self-hosted approaches.

Figure 14.4.2: Provider API fine-tuning vs. self-hosted: each approach has distinct advantages and limitations

Aspect Comparison

Aspect	OpenAI	Google Vertex AI	Self-Hosted (TRL)
Available models	GPT-4o, GPT-4o-mini, GPT-3.5	Gemini 1.5 Flash, Gemini 1.5 Pro	Any open-weight model
Data format	JSONL (ChatML)	JSONL (ChatML)	Any (ChatML, Alpaca, ShareGPT)
Max training examples	50 million tokens	10,000 examples	Unlimited
Hyperparameter control	Epochs, LR multiplier, batch size	Epochs, LR multiplier, adapter size	Full control over all parameters
Training cost (10K examples)	~$15 to $50 (GPT-4o-mini)	~$10 to $40 (Gemini Flash)	$5 to $20 (cloud GPU rental)
Time to first result	30 min to 2 hours	1 to 3 hours	Hours to days (setup + training)
Data privacy	Data processed by OpenAI	Data processed by Google	Data stays on your servers
Model weights access	No (API only)	No (API only)	Full access
Serving	Included (pay per token)	Included (pay per token)	Self-managed (vLLM, TGI)

4. Anthropic Claude Fine-Tuning

Anthropic offers fine-tuning for Claude models through its API, enabling teams to customize Claude's behavior on domain-specific tasks. The workflow follows a similar pattern to OpenAI and Vertex AI: prepare data in JSONL format, submit a fine-tuning job, and use the resulting model through the Messages API. Claude fine-tuning supports models including Claude 3.5 Haiku and Claude 3 Haiku, with data formatted as conversational message pairs.

4.1 Data Format and Job Submission

Anthropic's fine-tuning data format mirrors the Messages API structure. Each training example is a JSON object with a messages array (alternating user and assistant turns) and an optional system field. The training data must be uploaded as a JSONL file. Code Fragment 14.4.3 shows this approach in practice.

# Fine-tune a Claude model via the Anthropic API
# Anthropic uses a different message format: system is a top-level field
import anthropic
import json

# Step 1: Prepare training data in Anthropic's format
training_examples = [
 {
 "system": "You are a medical triage assistant.",
 "messages": [
 {"role": "user", "content": "Patient reports chest pain radiating to left arm."},
 {"role": "assistant", "content": "Priority: URGENT. Recommend immediate ECG and "
 "troponin levels. Differential includes acute coronary syndrome, "
 "musculoskeletal pain, and anxiety. Escalate to attending physician."}
 ]
 },
 # ... more examples in the same format
]

# Write JSONL file
with open("anthropic_train.jsonl", "w") as f:
 for ex in training_examples:
 f.write(json.dumps(ex) + "\n")

# Step 2: Create the fine-tuning job via the Anthropic SDK
client = anthropic.Anthropic() # Uses ANTHROPIC_API_KEY env var

# Upload the training file
with open("anthropic_train.jsonl", "rb") as f:
 training_file = client.files.create(
 file=f,
 purpose="fine-tune"
 )

# Create the fine-tuning job
job = client.fine_tuning.jobs.create(
 model="claude-3-5-haiku-20241022",
 training_file=training_file.id,
 hyperparameters={
 "n_epochs": 4,
 "learning_rate": 1e-5,
 "batch_size": 8,
 },
)
print(f"Fine-tuning job created: {job.id}")
print(f"Status: {job.status}")

# Step 3: Poll for completion
import time
while job.status not in ("succeeded", "failed", "cancelled"):
 time.sleep(60)
 job = client.fine_tuning.jobs.retrieve(job.id)
 print(f"Status: {job.status}")

# Step 4: Use the fine-tuned model
if job.status == "succeeded":
 response = client.messages.create(
 model=job.fine_tuned_model,
 max_tokens=300,
 system="You are a medical triage assistant.",
 messages=[
 {"role": "user", "content": "Patient has sudden onset severe headache."}
 ],
 )
 print(response.content[0].text)

Fine-tuning job created: ftjob-claude-abc123 Status: running Status: running Status: succeeded Priority: URGENT. Sudden onset severe headache requires immediate evaluation for subarachnoid hemorrhage. Order stat CT head without contrast and lumbar puncture if CT is negative. Check vitals and neurological status. Escalate to attending.

Code Fragment 14.4.6: Fine-tune a Claude model via the Anthropic API

Note

Key differences from OpenAI fine-tuning. Anthropic's fine-tuning places the system prompt as a top-level field rather than inside the messages array. Anthropic also enforces strict alternation between user and assistant turns within the messages array. Check the latest Anthropic documentation for supported models and pricing, as the fine-tuning offering has been expanding through 2025 and 2026.

Why this matters: Provider API fine-tuning represents a fundamental tradeoff: you sacrifice control and customization in exchange for dramatically lower operational complexity. For teams without dedicated ML infrastructure engineers, API fine-tuning eliminates GPU procurement, distributed training configuration, and model deployment entirely. The practical implication is that your first fine-tuning experiment should almost always start with a provider API (as covered in Chapter 9), and you should only move to self-hosted training once you have validated that fine-tuning actually improves your use case and you need capabilities the API does not expose.

5. Third-Party Fine-Tuning Platforms

Beyond the major model providers, several third-party platforms offer managed fine-tuning with additional flexibility. These platforms are particularly valuable when you need to fine-tune open-weight models (Llama, Mistral, Qwen) with more hyperparameter control than provider APIs allow, but without managing your own GPU infrastructure.

5. Third-Party Fine-Tuning Platforms Advanced Comparison

Platform	Supported Models	Key Differentiators	Hyperparameter Control
Together AI	Llama 3.x, Mistral, Qwen, Code Llama, custom uploads	Broad open-model catalog; integrated serverless inference; competitive training pricing	Full: LR, epochs, batch size, warmup, LoRA rank, weight decay
Fireworks AI	Llama 3.x, Mistral, Mixtral, Gemma, custom uploads	Optimized inference engine (FireAttention); sub-second cold starts; model composition (combining base + adapters at serving time)	Full: LR, epochs, batch size, LoRA config, sequence length
Anyscale	Llama 3.x, Mistral, custom models via Ray	Built on Ray for distributed training; seamless scale from fine-tuning to multi-node training; enterprise MLOps integration	Full: all training parameters, distributed training config, custom training scripts

These platforms fill an important gap in the fine-tuning ecosystem. Provider APIs (OpenAI, Google, Anthropic) offer simplicity but limit you to their proprietary models with restricted hyperparameter access. Self-hosted training (using TRL or Axolotl) gives full control but requires GPU procurement and ML engineering expertise. Third-party platforms provide a middle path: you get access to open-weight models with extensive hyperparameter control, while the platform handles GPU provisioning, distributed training orchestration, and serving infrastructure.

5.1 Together AI: Fine-Tuning Open-Weight Models

Together AI provides a Python SDK and REST API for fine-tuning open-weight models on managed infrastructure. The workflow resembles the OpenAI pattern (upload data, create job, poll for completion), but you choose from a catalog of open-weight base models and have access to a richer set of hyperparameters including LoRA rank, warmup ratio, and weight decay. Once training completes, the fine-tuned model is automatically available for inference through Together's serverless endpoint, or you can download the weights for self-hosted deployment. Code Fragment 14.4.7 shows this in practice.

# Fine-tune an open-weight model on Together AI with full LoRA control
# Together supports Llama, Mistral, Qwen, and custom model uploads
import together
import json

# Step 1: Upload training data (same JSONL/ChatML format)
train_file = together.Files.upload(file="train.jsonl")
print(f"Uploaded: {train_file.id}")

# Step 2: Create a fine-tuning job with full hyperparameter control
job = together.Fine_tuning.create(
 training_file=train_file.id,
 model="meta-llama/Meta-Llama-3.1-8B-Instruct",
 n_epochs=3,
 learning_rate=2e-5,
 batch_size=8,
 lora=True,
 lora_r=16, # LoRA rank
 lora_alpha=32, # LoRA scaling factor
 lora_dropout=0.05,
 warmup_ratio=0.1,
 suffix="customer-support-v1",
)
print(f"Job: {job.id}, Status: {job.status}")

# Step 3: Poll for completion
import time
while job.status not in ("completed", "failed", "cancelled"):
 time.sleep(30)
 job = together.Fine_tuning.retrieve(job.id)
 print(f"Status: {job.status}")

# Step 4: Inference via Together's serverless endpoint
if job.status == "completed":
 response = together.Complete.create(
 model=job.output_name,
 prompt="<|user|>\nHow do I reset my password?\n<|assistant|>\n",
 max_tokens=200,
 temperature=0.7,
 )
 print(response.choices[0].text)

Uploaded: file-tog-abc123 Job: ft-job-xyz789, Status: pending Status: running Status: completed To reset your password, follow these steps: 1. Go to Settings > Account Security 2. Click "Reset Password" 3. You will receive a confirmation email within 2 minutes Let me know if you need further help!

Code Fragment 14.4.7: Fine-tune an open-weight model on Together AI with full LoRA control

5.2 Fireworks AI: Adapter Composition and Fast Inference

Fireworks AI differentiates itself through its optimized inference engine (FireAttention) and its support for adapter composition: you can train multiple LoRA adapters against the same base model and combine them at serving time without retraining. This is particularly useful when you need different behavioral specializations (tone, domain expertise, format) that can be mixed and matched per request. Fine-tuning through Fireworks follows a similar API pattern, with the addition of serving configuration that specifies how the adapter should be loaded. Code Fragment 14.4.8 shows this in practice.

# Fine-tune a LoRA adapter on Fireworks AI with composable deployment
# Fireworks supports adapter composition: mix adapters at serving time
import fireworks.client as fw

# Configure the client
fw.client.api_key = "your-api-key"

# Step 1: Upload dataset
dataset = fw.Datasets.create(
 file=open("train.jsonl", "rb"),
 name="support-training-data",
)

# Step 2: Create fine-tuning job
job = fw.FineTuning.create(
 model="accounts/fireworks/models/llama-v3p1-8b-instruct",
 dataset=dataset.id,
 epochs=3,
 learning_rate=1e-4,
 lora_rank=16,
 max_seq_length=2048,
 output_model_name="support-adapter-v1",
)

# Step 3: Deploy the adapter (composable with base model)
# Fireworks serves the adapter on top of the base model at inference time
deployment = fw.Deployments.create(
 model=job.output_model,
 min_replicas=1,
 max_replicas=4, # auto-scales based on traffic
)

# Step 4: Inference (adapter applied at serving time)
response = fw.ChatCompletion.create(
 model=f"accounts/my-org/models/support-adapter-v1",
 messages=[
 {"role": "user", "content": "I can't find my order confirmation."}
 ],
 temperature=0.7,
)
print(response.choices[0].message.content)

I am sorry to hear you cannot find your order confirmation. Let me help! Could you provide me with your order number or the email address used for the purchase? I can look it up in our system and resend the confirmation right away.

Code Fragment 14.4.8: Fine-tune a LoRA adapter on Fireworks AI with composable deployment

Key Insight

Adapter composition is a powerful pattern. With Fireworks (and increasingly with other platforms), you can train separate LoRA adapters for different capabilities: one for medical terminology, one for formal tone, one for JSON output formatting. At serving time, you select which adapters to apply per request. This avoids the combinatorial explosion of training separate models for every combination of behaviors and allows rapid iteration on individual capabilities without retraining the others.

Fun Fact

OpenAI's fine-tuning API distills the entire training process into a single API call with a JSONL file. What used to require a GPU cluster and a PhD now requires a credit card and a JSON formatter.

Key Insight

Choose your fine-tuning platform based on your model requirements, not just convenience. If you need a proprietary model (GPT-4o, Claude, Gemini), use the provider's API. If you need an open-weight model with managed infrastructure, use Together AI, Fireworks, or Anyscale. If you need maximum control or have strict data residency requirements, self-host with TRL. Many production systems use a combination: fine-tune on a third-party platform, then deploy the resulting weights on your own infrastructure.

6. Cost Analysis Framework

The true cost of API fine-tuning depends on your training data size, the number of epochs, and your expected inference volume. The following calculator helps you estimate and compare costs across providers and approaches.

Code Fragment 14.4.9 provides a cost comparison calculator that estimates expenses across providers.

# Compare fine-tuning costs across API providers and self-hosted options
# Estimates training cost, monthly inference cost, and annual totals
from dataclasses import dataclass

@dataclass
class FineTuningCostEstimate:
 """Compare fine-tuning costs across providers."""

 # Dataset parameters
 num_examples: int = 10_000
 avg_tokens_per_example: int = 500
 num_epochs: int = 3

 # Inference parameters (monthly)
 monthly_requests: int = 100_000
 avg_input_tokens: int = 300
 avg_output_tokens: int = 150

 def openai_cost(self, model: str = "gpt-4o-mini") -> dict:
 """Estimate OpenAI fine-tuning + inference costs."""
 # Training pricing (per 1M tokens)
 training_prices = {
 "gpt-4o-mini": {"train": 3.00},
 "gpt-4o": {"train": 25.00},
 }
 # Inference pricing (per 1M tokens, fine-tuned models)
 inference_prices = {
 "gpt-4o-mini": {"input": 0.30, "output": 1.20},
 "gpt-4o": {"input": 3.75, "output": 15.00},
 }

 train_price = training_prices[model]
 infer_price = inference_prices[model]

 # Training cost
 total_training_tokens = (
 self.num_examples * self.avg_tokens_per_example * self.num_epochs
 )
 training_cost = (total_training_tokens / 1_000_000) * train_price["train"]

 # Monthly inference cost
 monthly_input_tokens = self.monthly_requests * self.avg_input_tokens
 monthly_output_tokens = self.monthly_requests * self.avg_output_tokens
 monthly_inference = (
 (monthly_input_tokens / 1_000_000) * infer_price["input"] +
 (monthly_output_tokens / 1_000_000) * infer_price["output"]
 )

 return {
 "provider": f"OpenAI ({model})",
 "training_cost": f"${training_cost:.2f}",
 "monthly_inference": f"${monthly_{inference}:.2f}",
 "annual_{total}": f"${training_cost + monthly_inference * 12:.2f}",
 "training_tokens": f"{total_training_tokens:,}",
 }

 def self_hosted_cost(self, gpu_hourly: float = 2.50) -> dict:
 """Estimate self-hosted fine-tuning costs."""
 # Rough estimate: ~10K tokens/second on A100
 total_training_tokens = (
 self.num_examples * self.avg_tokens_per_example * self.num_epochs
 )
 training_hours = total_training_tokens / (10_000 * 3600)
 training_cost = training_hours * gpu_hourly

 # Serving: dedicated GPU instance
 serving_monthly = gpu_hourly * 24 * 30 # Always-on

 return {
 "provider": "Self-hosted (A100)",
 "training_cost": f"${training_{cost}:.2f}",
 "monthly_{inference}": f"${serving_monthly:.2f}",
 "annual_total": f"${training_cost + serving_monthly * 12:.2f}",
 "training_tokens": f"{total_training_tokens:,}",
 }

# Compare costs
estimator = FineTuningCostEstimate(
 num_examples=10_000,
 monthly_requests=100_000,
)

for result in [
 estimator.openai_cost("gpt-4o-mini"),
 estimator.openai_cost("gpt-4o"),
 estimator.self_hosted_cost(),
]:
 print(f"\n{result['provider']}:")
 for k, v in result.items():
 if k != "provider":
 print(f" {k}: {v}")

OpenAI (gpt-4o-mini): training_cost: $45.00 monthly_inference: $27.00 annual_{total}: $369.00 training_tokens: 15,000,000 OpenAI (gpt-4o): training_cost: $375.00 monthly_{inference}: $337.50 annual_total: $4,425.00 training_{tokens}: 15,000,000 Self-hosted (A100): training_{cost}: $1.04 monthly_inference: $1,800.00 annual_{total}: $21,601.04 training_tokens: 15,000,000

Code Fragment 14.4.9: Compare fine-tuning costs across API providers and self-hosted options

Key Insight

The breakeven point is about volume. API fine-tuning is cheaper at low to moderate inference volumes (under 500K requests per month for GPT-4o-mini). Self-hosted becomes cheaper at high volumes because you pay a fixed infrastructure cost regardless of how many requests you serve. For most startups and early-stage projects, API fine-tuning is the right starting point. Transition to self-hosted when your monthly API bill consistently exceeds the cost of a dedicated GPU instance.

Warning

Data privacy is non-negotiable for some industries. If you work in healthcare (HIPAA), finance (SOC 2), or government (FedRAMP), sending training data to a third-party API may violate compliance requirements. Always verify that your provider's data handling policies meet your regulatory obligations before uploading any data. When in doubt, use self-hosted fine-tuning to keep data within your controlled environment.

5. Best Practices for API Fine-Tuning

5.1 Iterative Refinement Workflow

Figure 14.4.3 outlines the iterative workflow: start small, evaluate, identify failure patterns, and refine.

Figure 14.4.3: Start with a small dataset, evaluate, identify failure patterns, and iteratively improve

Note

Start with 100 to 500 examples. Many teams over-invest in data collection before validating that fine-tuning will work for their use case. Begin with a small, high-quality dataset and run a quick fine-tuning job. If the results are promising, scale up the data. If the model does not improve, the problem may be with your task framing, data quality, or prompt design rather than data quantity.

Self-Check

Q1: What data format does OpenAI's fine-tuning API require?

Show Answer

OpenAI requires JSONL (JSON Lines) format where each line is a JSON object containing a messages array with role/content pairs. The roles must be "system" (optional), "user", and "assistant". The last message in each example must have the "assistant" role, as this is the response the model will learn to generate.

Q2: At what monthly request volume does self-hosted fine-tuning typically become cheaper than API fine-tuning with GPT-4o-mini?

Show Answer

Self-hosted becomes cheaper at approximately 500,000+ requests per month. Below this volume, the fixed cost of a dedicated GPU instance (roughly $1,800/month for an A100) exceeds the variable API costs. Above this threshold, the per-token savings from self-hosting accumulate enough to offset the infrastructure overhead. The exact breakeven depends on average sequence length and GPU utilization.

Q3: Why might a team choose Google Vertex AI fine-tuning over OpenAI for the same task?

Show Answer

Key reasons include: (1) The team already uses Google Cloud infrastructure and wants to keep data within their GCP environment. (2) Vertex AI supports adapter size selection (LoRA rank), giving more control over the efficiency/quality tradeoff. (3) Gemini models may perform better on certain multilingual or multimodal tasks. (4) Regulatory requirements mandate using a specific cloud provider. (5) Pricing may be more favorable for their specific usage pattern.

Q4: What is the recommended approach when starting API fine-tuning for a new use case?

Show Answer

Start with a small dataset of 100 to 500 high-quality examples and run a quick 1 to 2 epoch training job. Evaluate the results against your quality targets. If the results are promising, perform error analysis on the failures, add targeted examples to address the failure patterns, and retrain. Repeat this iterative cycle, gradually scaling up the dataset. This approach avoids investing weeks in data collection before validating that fine-tuning is viable for the task.

Q5: What are two key limitations of API-based fine-tuning compared to self-hosted training?

Show Answer

(1) Limited hyperparameter control: API providers expose only a few hyperparameters (epochs, learning rate multiplier, batch size), while self-hosted training gives you full control over the optimizer, scheduler, gradient accumulation, PEFT configuration, data collation, and more. (2) No access to model weights: with API fine-tuning, you can only use the model through the provider's API. You cannot export the weights, run the model locally, apply further techniques like quantization, or switch serving infrastructure.

Key Takeaways

API fine-tuning is the fastest path from data to a deployed fine-tuned model, requiring no GPU infrastructure or ML engineering expertise beyond data preparation.
OpenAI and Vertex AI both use the ChatML/messages JSONL format, making it easy to prepare data that works across providers.
Start small (100 to 500 examples) and iterate. Do not invest weeks in data collection before validating that fine-tuning improves your specific task.
API fine-tuning is cost-effective at low to moderate volumes (under 500K requests/month); self-hosted becomes cheaper at higher volumes.
Data privacy requirements may mandate self-hosted fine-tuning in regulated industries (healthcare, finance, government).
You trade control for convenience: API fine-tuning limits hyperparameter access and locks you into the provider's serving infrastructure.

Real-World Scenario: Fine-Tuning GPT-4o-mini via OpenAI API for Structured Report Generation

Who: A two-person data team at a real estate analytics company that needed to generate standardized property valuation reports from raw listing data.

Situation: They used GPT-4 with detailed prompting to generate reports, achieving 88% format compliance. Each report cost $0.12 in API fees, and they processed 5,000 reports per month ($600/month).

Problem: The remaining 12% of reports had formatting issues (missing sections, inconsistent currency formatting, incomplete comparable property analysis) that required manual correction, consuming 15 hours per month of analyst time.

Dilemma: They could add more examples to the prompt (approaching context window limits), build a post-processing pipeline to fix formatting (complex, fragile), or fine-tune a smaller model to consistently produce the exact report format.

Decision: They fine-tuned GPT-4o-mini through the OpenAI API using 400 manually verified report examples (input: raw listing data, output: correctly formatted report). The fine-tuning job cost $45 and completed in 2 hours.

How: They exported 400 of their best reports as JSONL in the ChatML format, validated the file with OpenAI's data preparation tool, submitted the fine-tuning job with default hyperparameters (3 epochs, auto batch size), and evaluated on a 50-report holdout set.

Result: Format compliance rose from 88% to 99.2%. Per-report cost dropped from $0.12 (GPT-4) to $0.008 (fine-tuned GPT-4o-mini), reducing monthly API costs from $600 to $40. Manual correction time fell from 15 hours to 1 hour per month. Total ROI was achieved in the first month.

Lesson: API fine-tuning is the fastest path from problem to solution when your goal is format consistency and style adaptation; even 300 to 500 high-quality examples can produce dramatic improvements, especially when moving from a large model to a fine-tuned smaller one.

Research Frontier

Provider fine-tuning APIs are evolving toward continuous fine-tuning workflows where models are incrementally updated as new data arrives, rather than retrained from scratch each time. Emerging research on federated fine-tuning enables organizations to collaboratively train shared adapters without exposing their private data to the API provider.

A key open challenge is developing standardized benchmarks for comparing fine-tuning quality across different providers, since model architectures, base weights, and hyperparameter ranges differ significantly.

Exercises

Exercise 14.4.1: Provider API comparison Conceptual

Compare OpenAI and Google Vertex AI fine-tuning APIs on three dimensions: supported models, data format requirements, and hyperparameter control.

Answer Sketch

OpenAI: supports GPT-4o-mini and GPT-4o fine-tuning; requires JSONL with messages format; limited hyperparameters (epochs, learning rate multiplier, batch size). Google Vertex AI: supports Gemini models; accepts JSONL or BigQuery; offers more hyperparameter control including adapter size and learning rate schedule. Both handle infrastructure automatically but OpenAI provides less visibility into training progress.

Exercise 14.4.2: API fine-tuning workflow Coding

Write the complete Python code to fine-tune a GPT-4o-mini model via the OpenAI API: upload the training file, create the fine-tuning job, and poll for completion.

Answer Sketch

Upload: file = client.files.create(file=open('train.jsonl','rb'), purpose='fine-tune'). Create job: job = client.fine_tuning.jobs.create(training_file=file.id, model='gpt-4o-mini-2024-07-18', hyperparameters={'n_epochs':2}). Poll: while job.status not in ['succeeded','failed']: time.sleep(60); job = client.fine_tuning.jobs.retrieve(job.id). Use the resulting job.fine_tuned_model name for inference.

Exercise 14.4.3: When to use managed vs. self-hosted Analysis

A healthcare company needs to fine-tune an LLM on patient data that cannot leave their infrastructure. Can they use OpenAI's fine-tuning API? What alternatives should they consider?

Answer Sketch

No, OpenAI's fine-tuning API sends data to OpenAI's servers, which violates data residency requirements for protected health information. Alternatives: (1) Self-hosted fine-tuning using an open-source model (Llama, Mistral) on their own GPU infrastructure. (2) Azure OpenAI with a private endpoint in their own Azure subscription (data stays in their tenant). (3) Google Vertex AI with VPC Service Controls. The key requirement is that training data never leaves their controlled environment.

Exercise 14.4.4: Evaluation of fine-tuned models Coding

Write a Python evaluation script that compares a base model and a fine-tuned model on a held-out test set. Compute accuracy, response length, and latency for each. Use the same prompts for both.

Answer Sketch

For each test example: call both client.chat.completions.create(model=base_model, ...) and client.chat.completions.create(model=ft_model, ...). Measure: (1) accuracy by comparing to ground truth (exact match or LLM-judge), (2) response length in tokens, (3) wall-clock latency. Aggregate into a comparison table. Plot accuracy and latency distributions as histograms. A good fine-tune should improve accuracy without significantly increasing latency.

Exercise 14.4.5: Cost estimation Conceptual

OpenAI charges $25 per million training tokens for GPT-4o-mini fine-tuning. You have 2,000 training examples averaging 500 tokens each. Estimate the fine-tuning cost for 3 epochs.

Answer Sketch

Total tokens per epoch: 2,000 * 500 = 1,000,000 tokens. Over 3 epochs: 3,000,000 tokens. Cost: 3 * $25 = $75. Note: this is the training cost only. Inference on the fine-tuned model is also more expensive than the base model (typically 2 to 3x the base per-token price). Factor in the higher inference cost when computing the total ROI of fine-tuning.

What Comes Next

In the next section, Section 14.5: Fine-Tuning for Representation Learning, we examine fine-tuning for representation learning, adapting models to produce better embeddings for downstream tasks. The structured output formats discussed here connect to the API parameter handling in Section 10.2.

Fun Fact

Provider fine-tuning APIs deliberately hide most hyperparameters from you, which feels limiting until you realize that 90% of fine-tuning failures come from bad data, not bad hyperparameters. They are protecting you from yourself.

References and Further Reading

Provider Documentation

OpenAI. (2024). Fine-Tuning Guide.

The official OpenAI fine-tuning documentation covering data format requirements, model selection, hyperparameter options, and cost estimation. This is the primary reference for the OpenAI workflow covered in this section. Essential for anyone planning to use OpenAI's managed fine-tuning service.

Documentation

Google Cloud. (2024). Vertex AI Model Tuning.

Google's guide to fine-tuning Gemini and PaLM models through Vertex AI, including supervised tuning and RLHF options. Covers the Google-specific workflow discussed in this section, including data formatting differences from OpenAI. Recommended for teams in the Google Cloud ecosystem.

Documentation

Anthropic. (2024). Fine-Tuning Claude.

Anthropic's documentation for Claude fine-tuning, covering eligibility requirements, data preparation, and the managed training process. Useful as a comparison point for teams evaluating multiple providers, and for understanding Anthropic's approach to safety during fine-tuning.

Documentation

Alternative Providers and Research

Together AI. (2024). Fine-Tuning API Documentation.

Together AI's fine-tuning documentation, which supports open-source models like Llama and Mistral through a managed API. Together AI offers more model choice and lower prices than closed-model providers at the cost of some convenience. Useful for teams wanting API simplicity with open-source model flexibility.

Documentation

Anyscale. (2024). Fine-Tuning LLMs: A Comprehensive Guide.

A practical guide from Anyscale covering fine-tuning best practices, common pitfalls, and performance optimization tips. Includes benchmarks comparing managed fine-tuning across providers. Helpful for teams evaluating the cost-quality tradeoffs between different managed fine-tuning options.

Guide

Zhou, C. et al. (2023). LIMA: Less Is More for Alignment. NeurIPS 2023.

Demonstrates that fine-tuning on just 1,000 carefully curated examples can produce a highly capable model, challenging the assumption that more data is always better. LIMA's findings directly support the "quality over quantity" principle emphasized in the data preparation guidance for API fine-tuning.

Paper