OpenTelemetry for LLM Applications

Section 42.9

You cannot improve what you cannot observe. And in distributed systems, you cannot observe what you have not instrumented.

EvalEval, Trace-Sniffing AI Agent
Big Picture

OpenTelemetry provides the standardized observability backbone that makes LLM applications debuggable in production. Without structured tracing, a single user request that touches an embedding model, a vector database, multiple LLM calls, and several tool invocations becomes a black box when something goes wrong. The GenAI Semantic Conventions give every team a common vocabulary for LLM telemetry, enabling unified dashboards across providers and frameworks. This section shows how to instrument LLM applications with OpenTelemetry, from auto-instrumentation libraries to custom span attributes, building on the observability foundations from Section 44.4.

Prerequisites

This section builds on the observability foundations from Section 44.4 and the monitoring patterns in Section 44.6. Familiarity with LLM API calls and tool use patterns is recommended.

42.9.1 Why OpenTelemetry for LLM Systems

OpenTelemetry (OTel) has become the industry standard for distributed tracing, metrics, and logging across microservices. LLM applications pose unique observability challenges that generic application monitoring tools cannot address. A single user request may traverse an embedding model, a vector database, a retrieval pipeline, one or more LLM calls, and several tool invocations. Each step adds latency, cost, and potential failure points. Without structured tracing that captures LLM-specific metadata (token counts, model identifiers, prompt templates, temperature settings), debugging production issues becomes guesswork.

The OpenTelemetry Semantic Conventions for Generative AI (formerly LLM Semantic Conventions) define a standardized set of span attributes for model calls. These conventions ensure that traces from different LLM providers, frameworks, and custom code share a common vocabulary. Attributes like gen_ai.system, gen_ai.request.model, gen_ai.usage.input_tokens, and gen_ai.usage.output_tokens allow you to build unified dashboards regardless of whether you call OpenAI, Anthropic, or a self-hosted model. The deployment patterns in Section 70.5 benefit directly from this standardized telemetry.

The key benefit of OTel over proprietary tracing solutions (LangSmith, Langfuse) is vendor neutrality. OTel traces can be exported to any compatible backend: Jaeger, Grafana Tempo, Datadog, Honeycomb, or your own collector. This avoids lock-in and allows teams to integrate LLM observability into their existing monitoring stack rather than maintaining a separate system. That said, proprietary tools covered in Section 44.4 often provide richer LLM-specific UIs and are easier to set up for teams without existing observability infrastructure.

OpenTelemetry architecture for LLM applications: app SDK auto-instrumentors emit spans to a collector that fans out to multiple backends
Figure 42.9.1: The OpenTelemetry architecture for LLM systems. Auto-instrumentors emit GenAI-convention spans; an OTel collector fans them out to whichever backends the team chooses. Switching from Phoenix to Datadog (or running both) requires no SDK changes.
Key Insight

OpenTelemetry is not a replacement for LLM-specific observability platforms like LangSmith or Langfuse. It is the transport and data model layer that feeds into those platforms (or any other backend). Many LLM observability tools now accept OTel data natively, meaning you instrument once with OTel and send traces to multiple destinations. The choice is not "OTel or LangSmith" but rather "OTel as the instrumentation standard, with LangSmith (or Langfuse, or Datadog) as the visualization and analysis layer."

Fun Fact

The first version of the OpenTelemetry GenAI semantic conventions was drafted during a hackathon where engineers from six different LLM observability startups realized they had each invented their own incompatible span attribute names for "number of tokens used." The field name gen_ai.usage.input_tokens exists because seven companies spent a weekend arguing about whether to call it "prompt_tokens," "input_tokens," or "request_tokens."

42.9.2 Instrumenting LLM API Calls

The foundation of LLM observability is instrumenting every model call with a span that captures the request parameters, response metadata, and timing information. The OpenTelemetry Python SDK supplies the primitives; the opentelemetry-instrumentation-openai and similar auto-instrumentation packages handle the most common providers automatically.

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.resources import Resource

# Initialize the tracer with service metadata
resource = Resource.create({
 "service.name": "my-llm-app",
 "service.version": "1.2.0",
 "deployment.environment": "production",
})

provider = TracerProvider(resource=resource)
processor = BatchSpanProcessor(
 OTLPSpanExporter(endpoint="http://otel-collector:4317")
)
provider.add_span_processor(processor)
trace.set_tracer_provider(provider)

tracer = trace.get_tracer("llm-app.tracer")
Output: Logged run run_1712300400: accuracy=0.85 Logged run run_1712300401: accuracy=0.82 Run ID Accuracy Params ------------------------------------------------------------ run_1712300400 0.8500 {'model': 'logistic_regression', 'C': 1.0} run_1712300401 0.8200 {'model': 'logistic_regression', 'C': 0.1}
Code Fragment 42.9.1a: Using opentelemetry, trace, TracerProvider

With the tracer initialized, you can instrument individual LLM calls using the GenAI semantic conventions. Each span should capture the model name, provider, token usage, and any relevant request parameters. This manual instrumentation gives you full control over what is captured, which is important for sensitive applications where you may need to redact prompt content.

import openai
from opentelemetry import trace
from opentelemetry.semconv.attributes import gen_ai_attributes as GenAI

tracer = trace.get_tracer("llm-app.tracer")

async def traced_chat_completion(messages, model="gpt-4o", **kwargs):
    """Wrap an OpenAI chat completion with OTel tracing."""
    with tracer.start_as_current_span(
        "gen_ai.chat",
        attributes={
        GenAI.GEN_AI_SYSTEM: "openai",
        GenAI.GEN_AI_REQUEST_MODEL: model,
        GenAI.GEN_AI_REQUEST_TEMPERATURE: kwargs.get("temperature", 1.0),
        GenAI.GEN_AI_REQUEST_MAX_TOKENS: kwargs.get("max_tokens", 4096),
        }
        ) as span:
        try:
            client = openai.AsyncOpenAI()
            response = await client.chat.completions.create(
                model=model, messages=messages, **kwargs
                )

            # Record response metadata
            usage = response.usage
            span.set_attribute(GenAI.GEN_AI_RESPONSE_MODEL, response.model)
            span.set_attribute(GenAI.GEN_AI_USAGE_INPUT_TOKENS, usage.prompt_tokens)
            span.set_attribute(GenAI.GEN_AI_USAGE_OUTPUT_TOKENS, usage.completion_tokens)
            span.set_attribute("gen_ai.response.finish_reason",
                response.choices[0].finish_reason)

            # Record cost estimate (custom attribute)
            cost = estimate_cost(model, usage.prompt_tokens, usage.completion_tokens)
            span.set_attribute("gen_ai.usage.cost_usd", cost)

            return response

        except Exception as e:
            span.set_status(trace.StatusCode.ERROR, str(e))
            span.record_exception(e)
            raise
Code Fragment 42.9.2: Manual instrumentation of an OpenAI chat completion call using GenAI semantic conventions. The span captures request parameters (model, temperature, max_tokens), response metadata (actual model used, finish reason), token usage, and an estimated cost. Exceptions are recorded on the span for error correlation.
Library Shortcut: OpenLLMetry for LLM Observability

The same result in 3 lines with OpenLLMetry (auto-instruments all LLM calls):

Show code
# pip install traceloop-sdk
from traceloop.sdk import Traceloop

Traceloop.init(app_name="my-llm-app")
# All OpenAI, Anthropic, and LangChain calls are now automatically
# traced with GenAI semantic conventions. No manual spans needed.
Code Fragment 42.9.11: Pip install traceloop-sdk.

42.9.3 Trace Propagation Through Agent Chains

Single LLM calls are straightforward to trace. The real complexity emerges in agentic systems where a single user request triggers a chain of LLM calls, tool invocations, and sub-agent delegations. Consider a RAG pipeline: the user's query is first embedded, then used for vector search, then the retrieved documents are assembled into a prompt, and finally the LLM generates a response. Each step should be a child span under a single parent trace, preserving the causal chain.

For multi-agent systems, trace context must propagate across agent boundaries. When an orchestrator agent delegates a task to a specialist agent, the specialist's spans should appear as children of the orchestrator's delegation span. If agents communicate via message queues or HTTP, the OTel context propagation headers (traceparent, tracestate) must be injected into the message metadata and extracted on the receiving side.

from opentelemetry import trace, context
from opentelemetry.context.contextvars_context import ContextVarsRuntimeContext

tracer = trace.get_tracer("llm-app.agent-chain")

async def rag_pipeline(query: str):
    """Full RAG pipeline with hierarchical tracing."""
    with tracer.start_as_current_span("rag.pipeline") as pipeline_span:
        pipeline_span.set_attribute("rag.query", query)

        # Step 1: Embed the query
        with tracer.start_as_current_span("rag.embed_query") as embed_span:
            embedding = await embed_text(query)
            embed_span.set_attribute("rag.embedding_model", "text-embedding-3-small")
            embed_span.set_attribute("rag.embedding_dim", len(embedding))

            # Step 2: Vector search
            with tracer.start_as_current_span("rag.vector_search") as search_span:
                results = await vector_store.search(embedding, top_k=10)
                search_span.set_attribute("rag.results_count", len(results))
                search_span.set_attribute("rag.top_score", results[0].score)

                # Step 3: Rerank
                with tracer.start_as_current_span("rag.rerank") as rerank_span:
                    reranked = await reranker.rerank(query, results, top_k=5)
                    rerank_span.set_attribute("rag.reranked_count", len(reranked))

                    # Step 4: Generate response (this creates its own child span)
                    context_text = "\n\n".join(r.text for r in reranked)
                    response = await traced_chat_completion(
                        messages=[
                        {"role": "system", "content": f"Context:\n{context_text}"},
                        {"role": "user", "content": query}
                        ],
                        model="gpt-4o"
                        )

                    pipeline_span.set_attribute("rag.total_tokens",
                        response.usage.prompt_tokens + response.usage.completion_tokens)
                    return response.choices[0].message.content
Code Fragment 42.9.3: Hierarchical tracing for a RAG pipeline. Each step (embedding, vector search, reranking, generation) is a child span under the parent rag.pipeline span. The traced_chat_completion function from Code Fragment 42.9.6 automatically creates a nested child span for the LLM call. This structure allows you to see exactly where time is spent in the pipeline.
Real-World Scenario: Agent Tool-Call Tracing

Who: A platform engineer at a legal technology company operating a tool-using agent that researched case law and drafted legal summaries.

Situation: The agent used five tools (case search, statute lookup, citation validator, summarizer, and document formatter) and processed roughly 2,000 queries per day.

Problem: Users reported that some queries took over 45 seconds, but the team had no visibility into which tool calls were responsible. Aggregate latency metrics showed only the total request duration, not the per-tool breakdown.

Decision: The engineer instrumented each iteration of the agent loop as a child span under the agent's main OTel span, recording tool name, arguments, execution time, and result size on each tool span.

Result: Tracing revealed that the citation validator was responsible for 60% of total latency on slow queries because it made synchronous HTTP calls to an external API with no timeout. Adding a 3-second timeout and a local cache for frequently cited cases reduced P99 agent latency from 45 seconds to 12 seconds. Token cost attribution also showed that tool result parsing consumed 40% of the agent's total token budget, prompting the team to truncate verbose tool outputs. The cost control patterns in Section 49.3 rely on exactly this granularity of data.

Lesson: Without per-tool-call tracing, agent performance debugging is guesswork. Structured spans for each agent loop iteration make it trivial to identify which tool is the bottleneck and where tokens are being consumed.

42.9.4 Token Tracking and Cost Attribution

Token usage is the primary cost driver for LLM applications. OTel tracing provides the infrastructure to track tokens at every level of granularity: per call, per pipeline step, per user session, per feature, and per tenant. By recording gen_ai.usage.input_tokens and gen_ai.usage.output_tokens on every LLM span, you can aggregate costs across any dimension using your tracing backend's query language.

Cost attribution becomes critical in multi-tenant applications where different customers or features consume different amounts of LLM resources. By adding custom span attributes for tenant ID, feature name, and user tier, you can build per-tenant cost dashboards. This data feeds directly into billing systems and helps identify features or users that generate disproportionate costs. The cost control strategies in Section 49.3 depend on this level of cost visibility.

from opentelemetry import trace
from dataclasses import dataclass
from typing import Optional

@dataclass
class CostConfig:
    """Per-model pricing (USD per 1K tokens)."""
    input_cost_per_1k: float
    output_cost_per_1k: float

    PRICING = {
        "gpt-4o": CostConfig(0.0025, 0.01),
        "gpt-4o-mini": CostConfig(0.00015, 0.0006),
        "claude-sonnet-4-20250514": CostConfig(0.003, 0.015),
        "claude-haiku-4-20250414": CostConfig(0.0008, 0.004),
        }

class TokenTracker:
    """Track token usage and costs across the application."""

    def __init__(self):
        self.tracer = trace.get_tracer("llm-app.cost-tracker")

    def record_usage(
        self,
        model: str,
        input_tokens: int,
        output_tokens: int,
        tenant_id: Optional[str] = None,
        feature: Optional[str] = None,
        ):
        """Record token usage on the current span with cost attribution."""
        span = trace.get_current_span()
        pricing = PRICING.get(model, CostConfig(0.01, 0.03))

        input_cost = (input_tokens / 1000) * pricing.input_cost_per_1k
        output_cost = (output_tokens / 1000) * pricing.output_cost_per_1k
        total_cost = input_cost + output_cost

        span.set_attribute("gen_ai.usage.input_tokens", input_tokens)
        span.set_attribute("gen_ai.usage.output_tokens", output_tokens)
        span.set_attribute("gen_ai.usage.cost_usd", round(total_cost, 6))

        if tenant_id:
            span.set_attribute("app.tenant_id", tenant_id)
            if feature:
                span.set_attribute("app.feature", feature)

                # Also emit as a metric for real-time dashboards
                from opentelemetry import metrics
                meter = metrics.get_meter("llm-app.cost-meter")
                token_counter = meter.create_counter(
                    "gen_ai.tokens.total",
                    description="Total tokens consumed",
                    )
                cost_counter = meter.create_counter(
                    "gen_ai.cost.usd",
                    description="Estimated cost in USD",
                    )
                token_counter.add(input_tokens + output_tokens, {
                    "model": model, "token_type": "total",
                    "tenant_id": tenant_id or "unknown",
                    })
                cost_counter.add(total_cost, {
                    "model": model,
                    "tenant_id": tenant_id or "unknown",
                    })
Code Fragment 42.9.4: Token tracking with cost attribution. The TokenTracker records token counts and estimated costs both as span attributes (for per-request analysis) and as OTel metrics (for real-time dashboards and alerting). Custom attributes for tenant and feature enable cost allocation across business dimensions.

42.9.5 Auto-Instrumentation with OpenLLMetry

Manual instrumentation provides maximum control but requires modifying every LLM call site. For faster adoption, auto-instrumentation libraries can patch LLM client libraries at import time to automatically generate spans. OpenLLMetry (by Traceloop) is the most mature auto-instrumentation package for LLM applications. It supports OpenAI, Anthropic, Cohere, LangChain, LlamaIndex, ChromaDB, Pinecone, and many other libraries.

# Auto-instrument all supported LLM libraries with one call
from traceloop.sdk import Traceloop

Traceloop.init(
 app_name="my-llm-app",
 # Export to any OTel-compatible backend
 exporter_endpoint="http://otel-collector:4317",
 # Control what gets captured
 tracing_enabled=True,
 metrics_enabled=True,
 # Redact prompt/completion content for privacy
 should_enrich_metrics=True,
)

# Now all OpenAI, Anthropic, LangChain calls are automatically traced
import openai

client = openai.OpenAI()
# This call is automatically instrumented with GenAI semantic conventions
response = client.chat.completions.create(
 model="gpt-4o-mini",
 messages=[{"role": "user", "content": "Explain observability in one sentence."}]
)
# A span is automatically created with model, tokens, latency, and cost
Code Fragment 42.9.5: OpenLLMetry auto-instrumentation with a single initialization call. Once initialized, every call to supported LLM libraries (OpenAI, Anthropic, LangChain, LlamaIndex, vector databases) is automatically traced with GenAI semantic conventions. The should_enrich_metrics flag controls whether prompt and completion content is included in traces.

Auto-instrumentation is ideal for getting started quickly, but it captures everything indiscriminately. In production, you typically want a hybrid approach: auto-instrumentation for baseline coverage, plus manual spans for business-critical paths where you need custom attributes (tenant ID, feature flags, A/B test variants). The manual @workflow and @task decorators from Traceloop can annotate specific functions with semantic meaning.

Warning

Recording full prompt and completion content in traces creates significant privacy and compliance risks. User messages may contain personal information, health data, or other sensitive content subject to GDPR, HIPAA, or similar regulations. Always configure content redaction in production. Most auto-instrumentation libraries support content filtering; use it. Store full prompt/completion logs only in systems with appropriate access controls and retention policies. The privacy and compliance topics in Chapter 47 apply directly here.

Key Takeaways

What Comes Next

With OTel instrumentation in place, the next step is to visualize and alert on the resulting telemetry. Continue in Section 42.9a: OTel Dashboards for LLM Operations.

Further Reading

OpenTelemetry for LLMs

OpenTelemetry Project. (2024). "OpenTelemetry GenAI Semantic Conventions." opentelemetry.io
Traceloop. (2024). "OpenLLMetry: Open-source observability for LLMs." github.com/traceloop