Eval-as-Product: Braintrust, Latitude, Laminar

Section 44.7

"Observability tells you what already broke. Eval-as-product tells you what will break tomorrow if you ship the new prompt today."

EvalEval, Replay-Native AI Agent
Big Picture

Between 2023 and 2026, evaluation moved out of CI and into a new product category. The CI-gated eval pipeline of Section 42.1 is still essential, but it leaves three jobs undone: scoring live production traffic as it happens, replaying historical traffic against new prompts and models, and comparing dozens of prompt-and-model variants in a UI that PMs and domain experts can actually use. A new generation of tools, Braintrust, Latitude, Laminar, LangSmith, Helicone, and others, has emerged to fill that gap, and the category has a name: eval-as-product. This section is about when to adopt one of these platforms, which one fits which team, and what the integration looks like in code. It is also about when not to adopt one: for fully automated, CI-gated evaluation, the platform fees and integration cost rarely pay back. The decision rests on whether human review and replay loops are central to your iteration cycle.

Prerequisites

This section assumes familiarity with the eval-as-CI pattern from Section 42.1 and basic observability concepts (traces, spans, metrics) from Section 44.4. Some familiarity with promptfoo or DeepEval from Section 45.2 helps with the comparison.

44.7.1 From Eval-as-CI to Eval-as-Product

Production Pattern: The Replay Loop

When: any team iterating on prompts for a production LLM application with non-trivial traffic (more than a few hundred user-facing requests per day). How: sample 200-2000 representative production traces per week, store them in an eval platform's dataset, and re-run them against every prompt/model candidate before merging. Score each replay with the same LLM-as-judge or rule-based scorers, and surface side-by-side diffs in the platform UI. Watch for: sample-bias drift (production traffic shifts, your dataset stays stale), PII leakage (production traces contain user data, the eval dataset inherits it), and judge-rubric rot (the judge prompt that worked six months ago no longer matches what reviewers actually care about). Result: prompt regressions are caught in a UI review before they ship, not by users.

The eval-as-CI pattern from Section 42.1 is necessary but incomplete. It runs the same fixed golden set on every PR, blocks merges on metric thresholds, and reports a pass/fail. That works well for regression prevention: it catches cases where a new prompt or model degrades a well-understood behavior. It does not handle the three jobs that emerged as LLM apps got more complex:

Eval-as-product is the category of tools that do those three jobs in one UI. The first generation (Promptfoo, OpenAI Evals) emphasized the static-eval job and had limited replay and live-traffic features. The second generation (Braintrust, Latitude, Laminar, LangSmith, Helicone, plus offerings from the major model providers) treats live-scoring, replay, and variant-comparison as first-class operations alongside static eval.

Eval-as-CI is a one-shot pass/fail gate; eval-as-product is a continuous experiment-score-compare-iterate loop. The two compose: CI still gates merges, while eval-as-product platforms (Braintrust, Latitude, Laminar) let PMs sweep prompt and model variants in a UI and replay historical traffic against each candidate.
Figure 44.7.1: Eval-as-CI is a one-shot pass/fail gate; eval-as-product is a continuous experiment-score-compare-iterate loop. The two compose: CI still gates merges, while eval-as-product platforms (Braintrust, Latitude, Laminar) let PMs sweep prompt and model variants in a UI and replay historical traffic against each candidate.
Note: Eval-as-Product Is Not the Same as Observability

It is easy to confuse eval-as-product with LLM observability platforms (Langfuse, Arize Phoenix, Honeycomb's LLM features). Observability is read-only: it captures traces, lets you query them, and computes aggregate metrics. Eval-as-product adds the active components: you can replay a captured trace against a new prompt, score the historical run against new judges, and A/B compare variants. The line is blurry, modern platforms increasingly do both, but the core distinction is read-only vs replay-capable.

44.7.2 The Eval-First Workflow: Experiment, Score, Compare, Iterate

The workflow that eval-as-product platforms optimize for has four steps, and the shape of the workflow is what distinguishes good platforms from bad ones.

Step 1: Experiment. Define a hypothesis (a new prompt, a new model, a new tool, a new retrieval strategy), and a dataset (production traces, a curated golden set, a synthetic eval set). The platform runs the hypothesis against the dataset and produces a result table: one row per dataset item, one column per metric.

Step 2: Score. Each row's output is scored along multiple axes: rule-based metrics (exact-match, JSON validity), LLM-as-judge scores (helpfulness, faithfulness), and custom metrics (latency, cost, token count). The good platforms make adding a custom metric a 5-minute job: write a function that takes (input, output, expected) and returns a score, register it as a scorer.

Step 3: Compare. The platform diffs your experiment against a baseline (the production prompt, the last shipped version). Per-row diffs highlight which items improved, which regressed, and by how much. Aggregate diffs show overall direction. This is the step where most teams catch the cases that CI gates miss: a 0.5% aggregate improvement that masks a 20% regression on one specific user segment.

Step 4: Iterate. Based on the comparison, refine the prompt, retry, score, compare. The fastest platforms make this loop tight (sub-minute for small datasets). The slow ones make it slow enough that teams stop using them after a week.

The crucial insight is that this is the same loop that ML researchers run for fine-tuning experiments, but with a UI tailored for prompt-engineering iteration cycles instead of training-run cycles. The platforms that have succeeded in 2025-2026 are the ones that nailed the UI ergonomics, the ones that treated this as an ML-platform problem in disguise.

44.7.3 Braintrust: The Polished Default

Braintrust (braintrust.dev) is the most polished of the three platforms covered here, and it has captured a large share of the venture-backed AI-product market. Its design choices reflect that audience: rich playground for prompt iteration, deep integration with OpenAI and Anthropic SDKs, automatic regression detection against the previous experiment, and a Python/TypeScript SDK that maps cleanly onto existing eval patterns.

The core abstractions are datasets (collections of input/expected-output pairs), experiments (a run of your task against a dataset), and scorers (functions that judge each row of an experiment). The Braintrust scorer library includes pre-built scorers for many common metrics: Factuality (LLM-as-judge for factual accuracy), Faithfulness (RAG-style faithfulness scoring), AnswerRelevancy, JSONDiff (structural diff for JSON outputs), Levenshtein, NumericDiff, Sql (SQL-equivalence checking), and others. You can add custom scorers by writing a Python or TypeScript function.

Strengths:

Weak spots:

44.7.4 Latitude: The Open-Source Alternative

Latitude (latitude.so) takes the opposite design stance: open-source under an MIT-style license, self-hostable, prompt-versioning as the central abstraction. The hosted SaaS version exists, but Latitude's pitch is the on-premises option. For regulated industries (finance, healthcare, defense) and for teams that already run Postgres-and-Docker infrastructure, the self-hostable property dominates other considerations.

Latitude's primary abstraction is a prompt, with versioning, parameters, and metadata as first-class properties. Datasets and experiments build on top of prompts. The LLM-as-judge functionality is built in: you write a judge prompt in the same DSL as a regular prompt, and Latitude scores experiments using it.

Strengths:

Weak spots:

44.7.5 Laminar: Observability Plus Eval, Span-Based

Laminar (lmnr.ai) sits in a third position: it merges observability and eval into a single span-based system, treating every LLM call (and every tool call, every retrieval, every span in your trace) as both an observability event and a potential eval target. The unit of work is the span, and you can attach scorers to spans the same way you attach metrics to traces.

Laminar's design choices favor agent-style applications where the trace is the primary object of analysis: long traces, many tool calls, multi-turn conversations, hierarchical span structure. For teams whose evaluation question is "did this 30-step agent trajectory succeed?" rather than "did this single LLM call produce the right answer?", Laminar's span-centric model maps more cleanly.

Strengths:

Weak spots:

Comparison table

The decision matrix below summarizes the three platforms (and includes LangSmith and OpenAI Evals as anchors) across the dimensions that typically drive selection.

Property Braintrust Latitude Laminar LangSmith OpenAI Evals
Self-hostableNoYes (MIT)Partial (open-core)Enterprise onlyYes (Apache-2)
Free tierGenerousCloud free; OSS freeFree up to N spansLimitedN/A (self-run)
SOC-2Type IIType II (cloud)In progressType IIN/A
Python SDKMatureGoodGoodMatureBasic
TypeScript SDKMatureGoodGoodMatureCommunity
Pre-built scorers~25~10~15~12~10
Multi-judge supportYesYesYesYesManual
Live-traffic scoringYesYesYes (span-based)YesNo
Replay UXBest-in-classGoodSpan-basedGoodCLI only
Best forPolished defaultRegulated / OSSAgents / spansLangChain stacksStatic CI eval
Warning: Comparison Table Half-Life

This category is moving fast. The features, pricing, and SOC-2 status of each platform have shifted at least once per quarter through 2024-2026. Treat the table above as a starting point for a 2026 selection, not a permanent reference. Verify the specifics from each vendor's docs before committing to a contract; a feature that wasn't available six months ago may be GA by the time you read this section.

44.7.6 Wiring Braintrust Into a Production Agent Loop

The simplest useful Braintrust integration: instrument your production agent to log every run to Braintrust, then run a nightly job that scores the previous day's runs and surfaces the results in the Braintrust UI.

# Input: prod traces logged to Braintrust by braintrust.wrap_openai()
# Output: nightly scored experiment in the Braintrust UI
import braintrust
from braintrust import init_dataset, Eval
from autoevals import Factuality, AnswerRelevancy, JSONDiff

# 1) at app startup, wrap OpenAI so every prod call is auto-logged
from openai import OpenAI
client = braintrust.wrap_openai(OpenAI())      # traces -> project "prod-agent"

# 2) nightly: pull yesterday's traces as a dataset, score against scorers
def nightly_score():
    dataset = init_dataset(project="prod-agent", name="yesterday",
                           filter="created > now() - interval '1 day'")
    Eval(
        name="prod-agent-nightly",
        data=lambda: dataset,
        task=lambda row: row["output"],   # replay-mode: score existing outputs
        scores=[Factuality(), AnswerRelevancy(), JSONDiff()],
    )

if __name__ == "__main__":
    nightly_score()                                # cron: 0 3 * * *
Output: Eval prod-agent-nightly created. View: https://www.braintrust.dev/app/.../experiments/prod-agent-nightly. 1284 rows scored. Factuality mean: 4.21/5. AnswerRelevancy mean: 0.87. JSONDiff exact-match: 96.3%.
Code Fragment 44.7.1a: A nightly Braintrust scoring job for a production agent. The wrap_openai() call at app startup captures every LLM call without code changes; the nightly cron rescores the captured traces against your scorer library.

The same pattern works in TypeScript, in Latitude (with a different SDK shape), and in Laminar (with a span-filter instead of a trace-filter). The interesting part is not the SDK, it is the workflow: scoring happens after the fact, on real traffic, with the latest version of your scorer library. If you change a scorer's prompt, you can rescore the last month of traffic with one command and see whether the new scorer agrees with the old one on the same data.

44.7.7 When Not to Use Eval-as-Product

Eval-as-product is not a universal good. The platforms have real costs (subscription fees, integration effort, vendor lock-in), and there are workflows where those costs do not pay back.

Fully automated, CI-gated eval. If your eval is 100% rule-based or unit-test-style (exact-match against expected outputs, JSON-schema validation, latency thresholds), and there is no human review loop, then a simple promptfoo or pytest harness running in CI is sufficient. The eval-as-product features (replay UX, side-by-side comparison, live scoring) add little value to a workflow that does not use them.

Tiny traffic, slow iteration. A team shipping a low-volume internal tool that updates its prompts once a quarter does not need a replay loop. The fixed cost of adopting a platform exceeds the savings.

Regulated environments without self-hosting. If you cannot send production data to a third-party SaaS, Braintrust and Laminar's hosted tiers are off the table. Latitude's self-host option or a custom-built solution becomes the relevant choice.

When your eval workflow is fundamentally upstream of LLMs. If your team's evaluation is mostly about classical ML (a tabular classifier, an embedding retriever, a search-quality A/B test), MLflow or Weights and Biases are a better fit. Eval-as-product platforms are optimized for LLM-shaped evaluation; using them for non-LLM workloads is forcing a square peg into a round hole.

Postmortem: Adopting Eval-as-Product Too Early

An early-stage startup integrated Braintrust during their first three weeks, before they had any production traffic. They built scorer rubrics, set up nightly jobs, and configured dashboards. Over the next six weeks, the eval setup consumed about 30% of one engineer's time, and the dashboards showed nothing useful because there was no traffic. When traffic eventually arrived, the prompts had changed enough that the original scorers no longer matched the task. Lesson: eval-as-product platforms shine when there is enough live traffic to replay and enough prompt iteration to compare. Before that, a simple pytest harness is fine, and a paid platform is premature. The right time to adopt is usually somewhere between the first 1,000 production requests per week and the first major prompt-revision cycle.

44.7.8 The Platform vs. Build Decision

The recurring question: should we adopt a platform or build the equivalent in-house? For most teams in 2026, the answer is platform. The integration effort is hours to days, not weeks; the feature surface (replay, side-by-side, scorer library, live scoring, multi-judge orchestration) is hard to replicate quickly; and the platforms genuinely move faster than internal eval teams can.

The cases where building in-house pays back are narrow but real:

For everyone else, the platform-or-build calculus comes out heavily in favor of adopting one of the three platforms covered here, ideally after running a 2-week trial with at least two of them on your real data.

Research Frontier: Where Eval-as-Product Is Heading

Three trends to watch in 2026-2027:

Explore Further: Run a side-by-side trial of Braintrust and Latitude on the same 200 production traces. Define three scorers, two automatic and one LLM-as-judge, and run the same experiment in both platforms. Compare ingestion latency, UI ergonomics, and scorer agreement. The result is your team's data-driven choice between the two.

Fun Note: The "Stripe of Evals" Pitch Deck

In late 2024 Braintrust's seed-stage pitch deck reportedly opened with a slide that just read "Stripe of Evals". The line caught fire: by Q2 2025 at least three competitors (Latitude, Laminar, and Helicone's eval module) used variations of "Stripe for X" to describe their own positioning to investors. Bessemer Venture Partners' "State of the Cloud" recap that year had to add a footnote acknowledging that the LLM-eval segment had quietly invented a new sub-genre of pitch-deck cliche. The pattern persists: in 2026, calling a startup "the Stripe of evals" is both a compliment and a recruiter's red flag, depending on which side of the screen you are on.

Key Takeaways

Exercises

Exercise 37.7.1: Platform Selection Conceptual

You are choosing an eval platform for a healthcare-AI startup that processes patient-derived text and is subject to HIPAA. Production traffic is ~10,000 LLM calls per day, the team has 3 engineers, and the prompt iteration rate is weekly. Pick a platform, justify the choice, and identify the top three risks of your selection.

Answer Sketch

Latitude (self-hosted) is the strongest fit: HIPAA-compliance on a SaaS requires either a BAA (some platforms offer one, some do not) or self-hosting; the 10K/day traffic and weekly iteration rate justify a real platform over a custom harness; the team size is small enough that the OSS option is manageable. Risks: (1) UI maturity lags Braintrust, slowing the iteration loop; (2) smaller pre-built scorer library means more in-house scorer work; (3) Latitude OSS deployment maintenance is your team's responsibility, including security patches and upgrades. Mitigation: budget for 0.25 FTE on eval-platform maintenance.

Exercise 37.7.2: Replay Bias Analysis

You set up a replay loop on production traces, sampling 500 traces per week. After three months, your replay-based regression detection misses an important regression that users complain about. Hypothesize three reasons the replay sample missed it, and propose mitigations.

Answer Sketch

(1) Sampling bias: random sampling under-represents rare failure modes. Mitigation: stratified sampling that oversamples segments where past failures have occurred. (2) Distribution drift: production traffic changed faster than the sample refreshed (e.g., a new user segment arrived). Mitigation: shorten the replay window and add a freshness check that compares the current sample's distribution to the previous sample. (3) Scorer-rubric drift: the regression manifests on a dimension the current scorers do not measure (e.g., tone, brand voice). Mitigation: periodically review user-complaint themes and add scorers for dimensions that complaints surface.

Exercise 37.7.3: Side-by-Side Diff Workflow Coding

Implement a script that takes two experiment IDs from Braintrust (or Latitude), pulls the per-row scores from each, and reports the rows where experiment B regressed more than 20% relative to experiment A on any scorer. Sort the regressions by magnitude.

Answer Sketch

Use the platform's Python SDK to fetch both experiments' rows. Match rows by input ID (each row has a unique input identifier). For each matched pair, compute the per-scorer delta (B - A) and identify rows where any scorer's relative delta is below -20%. Sort by absolute relative delta, descending. Print a table of (input_id, scorer_name, score_A, score_B, relative_delta). Optionally, fetch the input text and outputs for the top 10 regressions and write them to a CSV for manual review.

Exercise 37.7.4: Multi-Judge Configuration Analysis

You want to use three different LLMs as judges (GPT-4o, Claude Opus, Gemini 2.5 Pro) and aggregate their scores. Describe (a) the aggregation rule, (b) how to handle the case where one judge consistently disagrees with the other two, and (c) the cost-benefit trade-off of multi-judge versus single-judge.

Answer Sketch

(a) For categorical judgments (win/lose/tie), use majority vote with ties handled explicitly; for numeric scores, use median (robust to single-judge drift). (b) Track per-judge agreement rate against the consensus; if one judge consistently disagrees, audit its prompt or replace it. Persistent disagreement may indicate genuine ambiguity in the task, not judge failure. (c) Multi-judge typically reduces noise and bias but costs 3x the API spend; the break-even depends on the decision stakes: high-stakes release gates justify multi-judge, day-to-day prompt iteration usually does not.

Exercise 37.7.5: Eval Platform vs. Custom Build Coding

You have a team of two ML engineers and a budget of one engineer-month to set up evaluation infrastructure. List the features you would build in-house first (rank them) and identify the point at which adopting a platform becomes the right move.

Answer Sketch

Priority order: (1) a static eval harness around pytest with a fixed golden set and rule-based scorers; (2) trace logging to Postgres with PII redaction; (3) a simple replay script that runs prompts against the trace store; (4) a basic LLM-as-judge integration for one or two judges. The point at which a platform pays back: when you start needing side-by-side UIs that PMs and domain experts can use, when you have more than three concurrent prompt experiments to track, or when your eval-set curation becomes a multi-engineer effort. In dollar terms, when the engineering time spent on internal eval tooling exceeds the platform's annual fee, switch.

What Comes Next

The next section covers self-hosted eval orchestration: when the off-the-shelf platforms don't fit, how to build a credible in-house eval pipeline using open-source primitives (Postgres for trace storage, Temporal for replay orchestration, Prefect for nightly jobs). Continue to Section 45.1: Platforms.

Further Reading

Platform Documentation

Braintrust Documentation. (2026). "Braintrust Eval Platform Reference." braintrust.dev/docs
Latitude Documentation. (2026). "Latitude: Open-Source Prompt Engineering and Evaluation." docs.latitude.so
Laminar Documentation. (2026). "Laminar: Observability and Evaluation for LLM Agents." docs.lmnr.ai

Comparative Resources

LangSmith Documentation. (2026). "LangSmith Evaluation Guide." docs.smith.langchain.com
OpenAI Evals. (2024-2026). "OpenAI Evals: A Framework for Evaluating LLMs and LLM Systems." github.com/openai/evals

Background Papers and Posts

Zheng, L., Chiang, W.L., Sheng, Y., et al. (2023). "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena." arXiv:2306.05685
Hamel Husain. (2024). "Your AI Product Needs Evals." hamel.dev/blog/posts/evals
Eugene Yan. (2024). "Evaluating LLM Applications: A Practical Guide." eugeneyan.com/writing/evals

Observability Platforms

Langfuse Documentation. (2026). "Open-source LLM Engineering Platform." Langfuse. langfuse.com/docs
Arize Phoenix Documentation. (2026). "Phoenix: ML Observability for LLM Applications." Arize AI. docs.arize.com/phoenix
Arize AI Documentation. (2026). "Arize AX: Production LLM Observability and Evaluation." Arize AI. docs.arize.com
Helicone Documentation. (2026). "Helicone: Open-Source LLM Observability and Cost Tracking." Helicone. docs.helicone.ai
Weights & Biases. (2026). "W&B Weave: Lightweight Toolkit for Tracking and Evaluating LLM Applications." Weights & Biases Documentation. weave-docs.wandb.ai

Telemetry Standards and GenAI Semantic Conventions

OpenTelemetry Authors. (2024-2026). "Semantic Conventions for Generative AI Systems." OpenTelemetry Specification. github.com/open-telemetry/semantic-conventions
The CNCF-incubated cross-vendor schema for LLM spans and metrics; the path to vendor-neutral observability and the basis for most platform exporters in 2026.
Traceloop. (2024-2026). "OpenLLMetry: Open-source observability for LLMs, built on OpenTelemetry." OpenLLMetry Specification & SDK. github.com/traceloop/openllmetry
The reference open-source implementation of the OTel GenAI semconv; the bridge that lets you export LangSmith/Langfuse-style traces to any OTel backend.
Sheng, S. (2025). "Monitoring AI in Production: Drift, Degradation, and Production Eval Loops." Personal Essay / Conference Talk. sashasheng.com
Practitioner-focused treatment of LLM production drift detection and the production-eval-loop pattern, drawn from real deployment postmortems.