Series overview: Production AI Agents — From Notebook to Production

Concepts, Signals & the OTel Standard covered the OTel GenAI standard and the four signals — traces, metrics, logs, events. This part covers the practical side: how to instrument agents, what tools exist, and what to alert on.

Correlation

A single user message triggers planning → parallel tool calls → internal services → vector DB → external API → final LLM response → guardrails. All must tie back to one trace_id.

Example: correlation ID propagation through an agent (web app)

Browser                Backend API           Agent Framework        LLM Provider
  │                       │                      │                     │
  │ GET /api/chat         │                      │                     │
  │ traceparent: 00-abc…  │                      │                     │
  │──────────────────────▶│                      │                     │
  │                       │ OTel extracts       │                     │
  │                       │ traceparent → sets  │                     │
  │                       │ Context.current()   │                     │
  │                       │ {trace_id: abc…}    │                     │
  │                       │                      │                     │
  │                       │ agent.run()          │                     │
  │                       │─────────────────────▶│                     │
  │                       │                      │ start_as_current_   │
  │                       │                      │ span() inherits     │
  │                       │                      │ trace_id from       │
  │                       │                      │ Context.current()   │
  │                       │                      │                     │
  │                       │                      │ POST /v1/chat       │
  │                       │                      │ traceparent: 00-abc…│
  │                       │                      │────────────────────▶│

The agent framework doesn’t “get” the trace_id from an HTTP header — it inherits it implicitly. The backend’s OTel HTTP instrumentation already extracted the traceparent header and stored {trace_id: abc…} in the current execution context. When the agent framework’s instrumentation code calls tracer.start_as_current_span(), it auto-detects the active parent from Context.current() and sets the new span’s parent_span_id — same trace_id, new span_id. The framework author wrote this once in their instrumentation library; your application code never touches trace propagation.

Propagation mechanisms:

  • HTTP headers: W3C traceparent — extracted by server instrumentation, injected by client instrumentation
  • Message queue metadata: inject trace context into message headers for async workflows
  • LLM provider metadata: if the provider doesn’t support traceparent, pass trace_id via user / metadata fields as a fallback

When propagation breaks — frameworks without OTel instrumentation produce no spans at all; async boundaries that don’t preserve OTel context orphan child spans; proxies see LLM calls but not agent-level reasoning.

Other starting points (no browser header to extract):

  • Desktop / mobile app: the client OTel SDK creates the root span locally — trace_id originates on the device, propagated via traceparent to the backend
  • CLI / background job: your code creates the root span with tracer.start_as_current_span() — no incoming header exists, the trace starts here
  • Message queue consumer: OTel instrumentation extracts trace context from the message envelope metadata — same mechanism as HTTP, different carrier

Architecture

Deployment patterns

Three ways to deploy instrumentation:

1. In-Process SDK

Agent code → OpenTelemetry SDK → OTLP exporter → Collector

Full control over span creation and attribute population. Couples instrumentation to agent code. Best for custom agent frameworks and small teams that need maximum flexibility.

2. Proxy / Sidecar

Agent code → localhost:4000 (LiteLLM) → OpenAI API

Sits between agent and LLM provider, captures all LLM calls transparently — no code changes needed. Best for brownfield systems and third-party frameworks you can’t modify.

Caveat: proxy sees LLM calls but not tool calls or internal agent logic. Combine with in-process spans or framework callbacks for full trace shape.

3. Framework-Native Callbacks

Agent framework hooks — LangChain callbacks, Semantic Kernel filters, AutoGen middleware. Register observers that fire on each LLM call, tool invocation, and agent turn. Best for framework-aware instrumentation (chain type capture, agent-to-agent handoff).

Instrumentation source: baked-in vs OTel library

Separate dimension — who provides the instrumentation code, the framework maintainer or the OTel community:

Approach Who provides it Examples Trade-off
Baked-in Framework maintainers CrewAI (native OTel) Simplest adoption; framework bloat, OTel dependency lag
OTel package Community/vendor OTel library OpenInference, microsoft-opentelemetry, Langtrace Decoupled obs from framework; fragmentation risk

Where to start

If your framework ships baked-in OTel instrumentation, turn it on — you get agent-level spans with zero work. If you need visibility into LLM calls outside the framework (raw SDK calls, multi-provider setups), add a proxy. If you migrate between frameworks or go custom, standardize on OTel GenAI conventions as the data format so traces remain comparable across stacks.

Tooling Landscape

Category Tools Best for
Open-source platforms Langfuse (29.9k ★, MIT), MLflow (26.7k ★, Apache 2.0), Opik (20k ★, Apache 2.0), Agenta (4.2k ★, MIT) Full obs + eval + prompt mgmt, self-hosted
Specialized tools Phoenix (10.3k ★, RAG/drift), Helicone (5.9k ★, proxy), AgentOps (5.7k ★, sessions), TruLens (3.4k ★, eval-first) Niche: retrieval quality, quick setup, session tracking
Full-stack + LLM SigNoz (27.5k ★, OTel-native), OpenObserve (19.5k ★, Apache 2.0), Datadog, Grafana Unified APM + LLM obs
Cloud-native platforms AWS: Bedrock AgentCore + CloudWatch GenAI Obs (GA Oct ‘25). Azure: AI Foundry Observability. GCP: Gemini Enterprise Agent Platform (Agent Engine) Obs integrated into the AI platform itself
Gateways + obs Portkey Gateway (12.2k ★), LiteLLM (51.8k ★) Multi-provider routing + unified logging
Instrumentation libs OpenLLMetry (7.2k ★), OpenLIT (2.6k ★), Langtrace (1.2k ★) Zero-code OTel instrumentation
SaaS platforms LangSmith, Braintrust, W&B Weave, Galileo, Pydantic Logfire (4.3k ★) Managed, opinionated, framework-specific

Tool Deep Dives

Langfuse

  • 29.9k GitHub stars, MIT, 10B+ obs/month, 19 of Fortune 50
  • Acquired by ClickHouse (early 2026)
  • Full trace viewer with LLM-specific rendering, built-in eval pipeline, prompt management
  • 100+ integrations across frameworks, model providers, and gateways
  • SDKs: Python, JS/TS native; any OTel-instrumented language (Go, Java, .NET, Ruby, PHP, Swift) can send traces via Langfuse’s OTLP endpoint
  • Self-hosting: Docker Compose, K8s (Helm), AWS/GCP/Azure (Terraform)
  • Enterprise: SOC 2 Type II, ISO 27001, GDPR, HIPAA eligible, EU & US data regions
  • MCP servers, CLI, coding agent skills for Claude Code, Cursor, Codex
  • Free tier: 50k observations/month

Agenta

  • OTel-native, MIT, SOC 2 Type II, self-hostable
  • Links prompt versions to traces, online + offline evals, experiment comparison
  • UI designed for engineers, PMs, and domain experts
  • Free tier: 5k traces/month

Phoenix by Arize

  • ELv2 (Elastic License 2.0), Jupyter-native
  • Embedding drift detection, retrieval evaluation (NDCG, MRR), trace+span viewer
  • Strong for RAG-heavy agents; notebook-first workflow

OpenLIT

  • Zero-code OTel auto-instrumentation for OpenAI, Anthropic, Cohere, HuggingFace, Ollama, etc.
  • Captures token counts, latency, costs, model parameters → standard OTel → any OTLP backend
  • Featured in official OTel blog

MLflow

  • Apache 2.0, extends classical ML platform into GenAI
  • LLM tracing: prompt versioning, trace replay (reproduce failure sequences), LLM-as-Judge eval
  • Self-hosted or Databricks-managed
  • Native LangChain, LlamaIndex, OpenAI, Anthropic integrations

Helicone

  • Proxy-based: change API base URL + auth header → instant logging
  • 100+ models, no code changes
  • Built-in caching, rate limiting, automatic failover
  • Cost tracking by model/user/feature
  • Not designed for deep agent reasoning — sees LLM calls only

SigNoz

  • OTel-native full-stack platform, open-source
  • Correlates LLM traces with infra metrics, K8s pod data, DB queries, microservice traces in one view
  • Custom dashboards + alerts on any telemetry
  • MCP server for AI-assisted troubleshooting
  • Self-hosted community edition or cloud (free trial)

Microsoft Foundry (Azure)

  • Built into Azure, powered by Azure Monitor Application Insights
  • Zero-code auto-instrumentation for Microsoft Agent Framework, Semantic Kernel
  • One-line setup for LangChain, LangGraph, OpenAI Agents SDK via microsoft-opentelemetry distro
  • Multi-agent conventions (co-developed with Cisco Outshift): execute_task, agent_to_agent_interaction, agent.state.management, agent_planning, agent orchestration
  • Pre-built evaluators: general quality, RAG quality, safety/security, agent quality
  • Content Safety APIs with severity scoring (0-7)
  • Security: content recording disabled via OTEL_INSTRUMENTATION_GENAI_CAPTURE_MESSAGE_CONTENT
  • Dynatrace and Arize integrations

AWS: Bedrock AgentCore + CloudWatch GenAI

  • ADOT auto-instruments Strands, LangChain, LangGraph, CrewAI
  • Two pre-built dashboards: Model Invocations (token usage, latency P90/P99, errors, cost) and AgentCore Agents (Agents, Memory, Built-in Tools, Gateways, Identity)
  • AgentCore Evaluations (preview, Dec ‘25): auto quality assessment
  • End-to-end prompt tracing: models → knowledge bases → guardrails → tools; X-Ray + W3C traceparent
  • Session propagation via session.id in OTEL baggage
  • Third-party routing: DISABLE_ADOT_OBSERVABILITY=true → Langfuse, Datadog, or any OTel backend

DIY with OpenTelemetry

Agent → OTel SDK + gen_ai conventions → OTel Collector → Grafana Tempo/Mimir/Loki
                                                     → Datadog
                                                     → Honeycomb / Jaeger

Use OTel GenAI Semantic Conventions (v1.37+) as data vocabulary. Add lightweight instrumentation (OpenLIT, OpenLLMetry) for auto-capture. Governance policies (redaction, sampling) enforced at OTel Collector level before data leaves network.

Local development: Aspire Dashboard — free OTLP viewer, single Docker container:

docker run --rm -p 18888:18888 -p 4317:18889 -p 4318:18890 -d --name aspire-dashboard \
    -e ASPIRE_DASHBOARD_UNSECURED_ALLOW_ANONYMOUS=true \
    mcr.microsoft.com/dotnet/aspire-dashboard:latest

Alerting

Not everything that can be measured should trigger a page. Focus on actionable degradation. These are examples — thresholds depend on your agent’s own baseline:

Alert Threshold (example) Action
Guardrail trigger rate spike >10% of requests trigger output guardrail (vs 2% baseline) Roll back prompt/model change
Tool failure rate >5% error rate on any tool for >5 min Check downstream service health
Token explosion p99 token count >3× rolling average Agent may be in reasoning loop
Tool calls per request spike avg tool calls per request >2× rolling average for >10 min Agent may be in reasoning loop or flailing
Cost anomaly Cost per 1K requests >2× daily baseline Audit for prompt bloat, routing issues

Alert on deviation from the agent’s own baseline, not raw values — agent requests are inherently variable, and what’s normal for one agent is broken for another.

References