Platforms

Section 14.1

"An API key is a six-month commitment to a vendor's roadmap. Choose the vendor, not the model."

CompassCompass, Vendor-Locked-In-But-Honest AI Agent
Big Picture

"Platform" in Part III means something different from Parts I and II: in those parts the platform was a GPU cluster or an inference server, but here it is the LLM-as-a-service vendor whose API your agent code lives against (OpenAI, Anthropic, Google, Mistral, Cohere, Together, Groq, Replicate, Fireworks, Bedrock, Vertex). This section maps that vendor shelf and tells you when each platform earns its monthly invoice.

Prerequisites

This section assumes you have built simple LLM API calls from Section 11.1, understand auth and retry patterns from Section 11.4, and have a working mental model of the closed-versus-open split from Section 10.6.

"Platform" in Part III means something different from Parts I and II. Where Part II's question was "where do I run a 70-billion-parameter open-weights model when I do not own an H100", Part III's question is the inverse: "where do I call a frontier model I will never run myself, and what stack sits between my code and that inference endpoint". The answer is a small ecosystem of managed-inference providers, and the choice among them has gotten more interesting (not less) as the market matured through 2025.

This section maps that ecosystem the way Sections 6.1 and 12.1 mapped theirs: vendor by vendor, what they sell, when to pick them, when not to. The shape of the 2026 market is roughly two tiers: first-party model providers (OpenAI, Anthropic, Google) who own the underlying models, and aggregator / model-router platforms (AWS Bedrock, Azure OpenAI, Together, Replicate, Fireworks) who resell those plus their own. The tier you pick shapes pricing, latency, and most importantly which models you have access to.

Three-tier 2026 LLM API platform map with named vendors
Figure 14.1.1: The 2026 LLM API platform shelf divides into three tiers. First-party providers (OpenAI with GPT-5 and the o-series, Anthropic with Claude 4.5 family plus 1M context and 90% prompt-cache discount, Google AI Studio and Vertex with Gemini 2.5 Pro's 2M context, xAI Grok 4) win on model quality and feature freshness. Hyperscaler resellers (AWS Bedrock as a multi-vendor marketplace with VPC and FedRAMP, Azure OpenAI with PTU pricing) win on compliance and existing-cloud integration but cost 10 to 20% more and add 50 to 150 ms latency. Open-weights hosts (Together AI, Fireworks AI with industry-leading 150-to-300 ms P50, Replicate for multimodal) win on per-token cost; LM Studio and Ollama provide local dev fallbacks. Always route through LiteLLM or OpenRouter so a single provider outage does not take you down.

14.1.1 First-party model providers

OpenAI: the default for many production teams. USP: GPT-5 family (released 2025) plus the multimodal o-series reasoning models; the broadest tool-use ecosystem (Assistants, structured outputs, file search); the most mature batch API. Pick OpenAI when you want the broadest feature surface and you can tolerate vendor-lock to their tool-use formats. Anti-recommendation: do not pick OpenAI as your only provider; the 2023 and 2024 service incidents made multi-provider abstraction layers mandatory.

Anthropic: the default for long-context and tool-heavy workloads. USP: the Claude 4 family (Opus 4.5, Sonnet 4.5, Haiku 4.5; the Opus class powers this book's authoring stack) plus extended-thinking on Claude 3.7+; the 1M context window; prompt caching with the most aggressive pricing in the market (90% discount on cache hits, see Anthropic prompt caching docs). Pick Anthropic when you have long-context workloads (codebases, document review) or when cache hit rates dominate cost. Anti-recommendation: skip if your workload is high-frequency low-token chat where Haiku is overkill and a smaller open-weights model would be cheaper.

Google AI Studio and Vertex AI: Google's two front doors to Gemini. AI Studio is the developer-friendly free-tier-included entry; Vertex is the GCP-integrated enterprise version. USP: Gemini 2.5 Pro's native multimodality (image, video, audio in one model) and the cheapest frontier-class long-context tokens by a wide margin in 2025. Pick when you do video understanding, when you are already on GCP, or when long-context cost dominates. Anti-recommendation: avoid Vertex if you are not on GCP; the integration surface assumes you are.

14.1.2 The hyperscaler resellers

AWS Bedrock: a model marketplace inside AWS. USP: every major closed-weights provider (Anthropic, Cohere, Meta, AI21, Mistral, plus Amazon Nova) behind one IAM-controlled API, with VPC-private endpoints and full audit logging. Pick Bedrock when you are an AWS-shop enterprise where data residency, SOC2, or HIPAA controls are non-negotiable. Anti-recommendation: Bedrock pricing is a 10 to 20% markup over direct provider pricing; do not use it for cost-optimization, only for compliance.

Azure OpenAI: OpenAI's models hosted in Azure with private endpoints and enterprise SLAs. USP: same models as OpenAI direct but with Azure's compliance and data-residency story; provisioned-throughput-unit (PTU) pricing offers predictable latency for high-volume workloads. Pick if you are a Microsoft-shop with regulatory constraints. Anti-recommendation: do not use Azure OpenAI for experimentation; the deployment-creation flow is slow and the model catalog lags OpenAI direct by weeks.

14.1.3 The independent inference platforms

Together AI: the largest open-weights inference host. USP: serverless inference for hundreds of open-weights models (Llama 4 family, Qwen3, DeepSeek, Mixtral) at competitive per-token pricing, plus dedicated endpoints for production traffic. Pick Together when you want to serve open-weights models without operating GPUs. Anti-recommendation: not the cheapest for very high volume; at that point dedicated GPUs on RunPod or self-hosted vLLM are better.

Replicate: per-request serverless for any model published to its registry. USP: the easiest way to call image, audio, and video models alongside LLMs from one API; the registry is the most diverse in the industry. Pick Replicate for multi-modal applications where you mix-and-match models and accept warm-start latency. Anti-recommendation: avoid for latency-sensitive chat where the cold-start overhead is unacceptable.

Fireworks AI: open-weights specialist focused on throughput and latency optimization. USP: the fastest open-weights inference benchmarks in 2025 (often 2-3x throughput of Together on Llama-class models) and a quality-controlled fine-tuning service. Pick Fireworks when latency or tokens-per-second matter more than model breadth. Anti-recommendation: smaller model catalog than Together; if you need a very long tail of open-weights options, look elsewhere.

Beyond hosted APIs, three desktop / local-inference clients matter in 2026 even for API-first developers, because they make "let me try the open-weights equivalent" cheap. LM Studio (2024) is the dominant local Mac / Windows GUI, with a one-click MLX backend on Apple Silicon and an OpenAI-compatible local server. Ollama is the terminal-friendly equivalent. Msty (2024-25) is the emerging multi-model desktop client with side-by-side chat and provider routing built in. The pattern in 2026 is to ship API-first but keep a local fallback path for development and offline demos.

14.1.4 The platforms compared

Table 14.1.1a: 16.1.1 2026 LLM API platforms at a glance.
PlatformTierApprox pricing (input/output per 1M, frontier tier)Latency profileModel coverageEnterprise features
OpenAIFirst-party$5-$15 / $20-$60P50 ~400-800 ms first-tokenOpenAI onlySOC2, ZDR, Enterprise tier
AnthropicFirst-party$3-$15 / $15-$75P50 ~500-900 ms; aggressive prompt cacheClaude onlySOC2, HIPAA add-on, ZDR
Google AI StudioFirst-party$1.25-$10 / $5-$30P50 ~600-1000 ms; native multimodalGemini onlyVertex tier for enterprise
AWS BedrockReseller10-20% markup over direct+50-150 ms vs direct providerAnthropic, Meta, Cohere, Amazon, MistralVPC, IAM, full audit, SOC2/HIPAA/FedRAMP
Azure OpenAIResellerSame as OpenAI directSimilar to direct; PTU for guaranteedOpenAI catalog (delayed)Azure compliance suite, private endpoints
Together AIOpen-weights host$0.20-$1.20 / $0.20-$1.20 (Llama-class)P50 ~200-400 msHundreds of open-weightsSOC2, dedicated endpoints
ReplicateMulti-modal hostPer-second compute pricingVariable; cold starts possibleLLMs + image + audio + videoCustom deployments
Fireworks AIOpen-weights host$0.15-$1.00 / $0.15-$1.00P50 ~150-300 ms (industry-leading)Curated open-weights + customFine-tuning, dedicated endpoints, HIPAA
Tip: route through an abstraction layer

In 2026, hard-coding to one provider's SDK is a 2022 mistake. Use LiteLLM, OpenRouter, or the OpenAI-compatible endpoint everyone now ships. The cost of staying provider-agnostic is small (an extra HTTP hop); the value when a provider has an outage, deprecates a model, or quietly slows down is enormous.

Warning: pricing tiers move; latency budgets do not

Every number in the table above is roughly correct as of late 2025 and will be wrong by the time you read this. Pricing for frontier models drops by 20-50% year-over-year and the latency story shifts with each new generation of inference hardware. What stays stable: your application's latency budget. If your end-to-end UX requires sub-300 ms first-token latency, you have ~3-5 options regardless of pricing. If your UX is batch-async (overnight report generation), pricing matters and latency does not. Pick the dimension your product actually depends on.

Key Insight: Two Tiers, Three Reasons to Choose

Two tiers, three reasons to choose. Tier 1 (first-party providers: OpenAI, Anthropic, Google) wins on model quality and feature freshness. Tier 2 (hyperscaler resellers: Bedrock, Azure OpenAI) wins on compliance and integration. Tier 3 (independent inference hosts: Together, Replicate, Fireworks) wins on open-weights access and unit cost. Pick by which dimension your product is most constrained on, and always route through an abstraction layer.

14.1.5 Default recommendations

For Part III's exercises: start with the free tiers of OpenAI, Anthropic, and Google AI Studio in parallel, hidden behind LiteLLM. Switch to AWS Bedrock or Azure OpenAI only when compliance forces it. Reach for Together, Replicate, or Fireworks when an open-weights model is the right tool. Never call only one provider in production code; the abstraction layer is mandatory.

Section 14.2 next turns from the inference endpoint to the client library that calls it: SDKs, agent frameworks, prompt-management tools, and the observability surface that makes any of this debuggable in production.

API Keys and Secrets Management

Almost every LLM workflow needs a credential: a Hugging Face token for gated models, an OpenAI or Anthropic key for inference, a Weights & Biases token for experiments, or a cloud-provider key for object storage. Where you put them determines whether you become a data-leak news story. This is the canonical home for "where do I put my keys?" and "how do I keep them out of Git?"

The threat model is not theoretical: GitGuardian's 2024 "State of Secrets Sprawl" counted 23+ million secrets exposed in public GitHub commits in 2023 alone, with OpenAI, Anthropic, and Hugging Face API keys in the top-ten leaked credential types. The fix is straightforward: never commit a secret, pick the right backend for the context, rotate when something slips, use IAM roles when possible. The Git-side discipline that catches leaks after they happen is in Section 5.2 (Libraries & Frameworks); this is the prevention counterpart.

Warning: Never Commit a Key

The rule that prevents 99% of credential leaks: every secret lives in an environment variable, a keyring entry, or a managed secret store. Never a hard-coded string in a Python file, notebook output, or Git-tracked config. If a key ends up in Git history, rotate it immediately: rewriting history does not erase the value from anyone who already cloned, from forks, or from bots that scrape new commits within minutes.

1. The .env File and python-dotenv

The default starting point for local development: a .env file plus python-dotenv. Keep keys in .env, add .env to .gitignore, load at process start.

# .env  (NEVER commit this; add to .gitignore)
OPENAI_API_KEY=sk-proj-abc123...
ANTHROPIC_API_KEY=sk-ant-...
HF_TOKEN=hf_...
WANDB_API_KEY=...
Code Fragment 14.1.2: The default starting point for local development: a .env file plus python-dotenv.
# app.py  (this is fine to commit)
import os
from dotenv import load_dotenv

# Load .env into os.environ; safe to call multiple times.
# By default looks for .env in cwd and walks upward.
load_dotenv()

# Then use the standard environment-variable pattern; most SDKs
# auto-pick these up without explicit configuration.
from openai import OpenAI
from anthropic import Anthropic

openai_client = OpenAI()      # reads OPENAI_API_KEY
anthropic_client = Anthropic() # reads ANTHROPIC_API_KEY

# For HuggingFace, the standard env var is HF_TOKEN (also HUGGING_FACE_HUB_TOKEN);
# huggingface_hub picks it up automatically:
from huggingface_hub import login
# login() with no args uses the env var
Code Fragment 14.1.3: Load .env into os.environ; safe to call multiple times.

Pair every .env with a committed .env.example that documents variables without values so new contributors know what to fill in. Add .env, .env.local, and *.env to .gitignore on day one of every repo.

Library Shortcut

Most LLM SDKs in 2025-2026 auto-detect their standard environment variables: OPENAI_API_KEY, ANTHROPIC_API_KEY, OPENROUTER_API_KEY, HF_TOKEN, COHERE_API_KEY, GOOGLE_API_KEY, WANDB_API_KEY. You almost never need to pass the key explicitly in code; load_dotenv() followed by client construction is enough.

Show code
# Standard auto-detect pattern: load .env, then construct clients with no args.
from dotenv import load_dotenv
from openai import OpenAI
from anthropic import Anthropic

load_dotenv()                       # reads .env into os.environ
openai_client = OpenAI()            # picks up OPENAI_API_KEY
anthropic_client = Anthropic()      # picks up ANTHROPIC_API_KEY

resp = openai_client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "ping"}],
)
print(resp.choices[0].message.content)
Code Fragment 14.1.1b: Standard auto-detect pattern: load .env, then construct clients with no args.

2. OS Keyrings: keyring for Personal Machines

.env is plaintext on disk. The personal-machine upgrade: the OS-native credential store via the keyring Python library, which uses macOS Keychain, Windows Credential Manager, or Linux Secret Service. Keys are encrypted at rest by the OS, unlocked by your session, and never sit in plaintext.

import keyring

# Set once (e.g. from an interactive shell or a setup script)
# keyring.set_password("openai", "default", "sk-proj-abc123...")

# Read at runtime
api_key = keyring.get_password("openai", "default")

import os
os.environ["OPENAI_API_KEY"] = api_key  # then standard SDK auto-pickup works
Code Fragment 14.1.4: .env is plaintext on disk.

The keyring CLI lets you set values without shell history (keyring set openai default prompts). Trade-off: entries do not travel with the repo, so CI and new machines need their own setup.

3. Team and Production: Managed Secret Stores

For teams and production, a managed secret store replaces .env. Five options dominate.

Table 14.1.2: Managed Secret Stores Comparison (as of 2026).
ToolHostingBest ForPricing ModelRotation Support
HashiCorp VaultSelf-host (OSS) or HCP managedPolyglot teams, on-prem, hybrid cloudFree OSS; HCP managed by namespaceNative dynamic secrets
AWS Secrets ManagerAWS-managedAWS-native workloadsPer secret per month + API callsNative scheduled rotation (Lambda)
GCP Secret ManagerGCP-managedGCP-native workloadsPer active secret version + accessVersioning + manual / external rotation
Azure Key VaultAzure-managedAzure-native workloadsPer transaction + premium tierNative rotation policies (2024+)
DopplerSaaSPolyglot startups; cross-env syncPer seat / per workspaceIntegration-based rotation

HashiCorp Vault is most flexible and the standard for self-hosting, polyglot teams, or multi-cloud. Vault's dynamic-secrets feature issues short-lived credentials on demand, the strongest defense against exfiltration. AWS Secrets Manager, GCP Secret Manager, and Azure Key Vault are cloud-native choices with first-class IAM, audit, and rotation. Doppler is the SaaS option startups pick because it syncs secrets across .env, GitHub Actions, Vercel, AWS, and similar targets.

4. IAM Roles: The Cloud Workload Pattern

For workloads on a cloud you control (EC2, ECS, Fargate, GKE, AKS, Cloud Run), the best secret-management answer is no secrets at all. Attach an IAM role to the instance; the SDK fetches short-lived credentials via the metadata service. The provider rotates automatically (typically every few hours); your code never sees a long-lived key.

import boto3

# No access key needed; boto3 picks up the IAM role attached to the
# EC2 instance, ECS task, or Lambda function via the metadata service.
s3 = boto3.client("s3")
response = s3.list_buckets()

# Same pattern works for cross-service: an IAM role can have permission
# to read secrets from Secrets Manager, so you fetch an OpenAI key
# without ever embedding an AWS access key.
secrets = boto3.client("secretsmanager")
openai_key = secrets.get_secret_value(SecretId="openai/prod")["SecretString"]
os.environ["OPENAI_API_KEY"] = openai_key
Code Fragment 14.1.5: For workloads on a cloud you control (EC2, ECS, Fargate, GKE, AKS, Cloud Run).

GCP equivalent: Workload Identity (GKE) or default service accounts (Cloud Run, Cloud Functions). Azure equivalent: Managed Identity. Same pattern: bind workload to identity, grant access to needed secrets, never embed a long-lived credential. This is the 2024-2026 production consensus and the explicit recommendation in every cloud's security guidance.

Key Insight

Quality hierarchy, worst to best: hard-coded, .env tracked in Git, .env in .gitignore, OS keyring, managed secret store, IAM role with no long-lived secret. Each step removes one leak class. Personal projects stop at .env; cloud services should reach IAM roles. Production with managed store and no long-lived keys is the gold standard.

5. CI/CD: Repository Secrets

For CI/CD (GitHub Actions, GitLab CI, CircleCI), use the platform's native repository secrets. They are encrypted at rest, decrypted into env vars at job runtime, and redacted from logs by default. The GitHub Actions pattern:

# .github/workflows/eval.yml
jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
      - run: pip install -r requirements.txt
      - run: python run_eval.py
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
          HF_TOKEN: ${{ secrets.HF_TOKEN }}
Code Fragment 14.1.6: For CI/CD (GitHub Actions, GitLab CI, CircleCI), use the platform's native repository secrets.

Never echo a secret (redaction is best-effort), never pass one as a CLI arg (it shows in process listings), prefer env vars over files. For OIDC-supporting platforms, GitHub Actions can assume an AWS or GCP IAM role directly, eliminating long-lived cloud keys from the secrets list entirely.

Practical Example
OIDC role assumption replaces long-lived keys

A team's nightly evaluation job needed AWS access to read a calibration dataset from S3. The legacy setup stored an AWS_ACCESS_KEY_ID as a GitHub repo secret, which rotated quarterly via a Slack reminder that was missed twice in 2024. The fix took 30 minutes: create an IAM role with a trust policy that names the repo's OIDC issuer (https://token.actions.githubusercontent.com), drop the secrets, and replace the env block with permissions: id-token: write plus the aws-actions/configure-aws-credentials@v4 action. The job now exchanges a short-lived OIDC token for a 1-hour AWS session at every run. Total long-lived secrets removed: 4. Incidents avoided since: 0 (which was the whole point).

6. Rotation: It Has to Be Routine

Rotation is what teams skip until they have an incident. Defensible policy: production keys rotate quarterly (90 days), staging twice yearly, personal keys on developer departure or annually. Managed stores support scheduled rotation (AWS via Lambda, GCP via external scheduler, Vault via dynamic secrets). Hugging Face, OpenAI, and Anthropic require manual rotation: generate new, update store, verify, revoke old.

7. Leak Detection: Pre-Commit and Continuous Scanning

The complement to good hygiene is automated detection. Three open-source tools dominate: detect-secrets (Yelp), gitleaks (Zach Rice), trufflehog (Truffle Security). All three run as pre-commit hooks and in CI.

# .pre-commit-config.yaml
repos:
  - repo: https://github.com/Yelp/detect-secrets
    rev: v1.5.0
    hooks:
      - id: detect-secrets
        args: [--baseline, .secrets.baseline]
  - repo: https://github.com/gitleaks/gitleaks
    rev: v8.21.0
    hooks:
      - id: gitleaks
Code Fragment 14.1.7: The complement to good hygiene is automated detection.

Run detect-secrets scan > .secrets.baseline once; subsequent commits fail on new secrets not in the baseline. The 2024-2025 GitHub-native answer is Push Protection, which scans on every push and blocks commits containing recognized secrets across hundreds of providers; enable on every repo with credentials.

Postmortem: Notebooks Are a Leak Vector

The most common 2024-2025 OpenAI / Hugging Face leak in publicly-disclosed incidents was not .env committed by mistake; it was a Jupyter notebook with os.environ["OPENAI_API_KEY"] = "sk-..." hard-coded in a cell. Notebooks save source and outputs; even if you clear the cell, the key may live in the JSON. Mitigations: (1) never hard-code in a notebook; load from .env; (2) use nbstripout as a pre-commit hook; (3) enable GitHub Push Protection. Cost of cleanup: hours. Cost of prevention: one pre-commit hook.

See Also
Key Takeaways
  • The hierarchy: hard-coded (worst), .env in .gitignore (prototype default), OS keyring (personal), managed secret store (teams), IAM role with no long-lived secret (gold standard).
  • For local development, .env + python-dotenv + .env.example. Most LLM SDKs auto-detect standard env vars, so load_dotenv() at process start is usually all the code you write.
  • For team production, pick a managed secret store: Vault (self-host), AWS / GCP / Azure (cloud-native), Doppler (SaaS sync). All support scheduled rotation; use it.
  • For cloud workloads, IAM roles eliminate long-lived credentials entirely. Bind workload identity, grant access, never embed long-lived keys in container images.
  • Layer leak detection over hygiene: detect-secrets and gitleaks pre-commit, GitHub Push Protection on every repo, nbstripout for notebooks. Hard-coded notebook keys are the dominant 2024-2025 leak pattern.
  • Rotate routinely: 90 days for production, twice yearly for staging, on-event for personal. The difference between "near miss" and "customer notifications."

What Comes Next

Continue to Section 5.2 (Libraries & Frameworks) to run the smoke test that confirms your CUDA stack, library versions, and credential plumbing are all functional.

Further Reading
GitGuardian (2024). "State of Secrets Sprawl 2024." GitGuardian Annual Report. gitguardian.com/state-of-secrets-sprawl. The 23+ million exposed secrets number; AI / LLM API keys in the top-ten leaked credential types.
HashiCorp (2024). "Vault 1.18: Dynamic Secrets and Workload Identity." HashiCorp Documentation. developer.hashicorp.com/vault/docs. Canonical reference for dynamic secrets and the workload-identity pattern in self-hosted environments.
AWS (2024). "Use IAM Roles Instead of Long-Lived Credentials." AWS Security Best Practices. docs.aws.amazon.com/IAM/best-practices. Official AWS guidance establishing IAM roles as the recommended pattern for cloud workloads.
GitHub (2024). "Secret Scanning and Push Protection." GitHub Documentation. docs.github.com/secret-scanning. 2024 expansion of Push Protection to cover hundreds of providers including OpenAI, Anthropic, Hugging Face.
Yelp (2024). "detect-secrets: An Enterprise Friendly Way of Detecting and Preventing Secrets in Code." github.com/Yelp/detect-secrets. github.com/Yelp/detect-secrets. Pre-commit hook for secret detection with baseline-aware diffs.
Gitleaks (2025). "Gitleaks v8.21: Open-Source Secret Scanning." github.com/gitleaks/gitleaks. github.com/gitleaks/gitleaks. The most widely-adopted open-source alternative to GitGuardian's commercial scanner.
OWASP (2025). "Top 10 LLM Applications: LLM02 - Sensitive Information Disclosure." OWASP GenAI Top 10. genai.owasp.org/llmrisk/llm02. 2025 OWASP guidance treating API keys in prompts and logs as a first-class LLM application risk.