Series overview: Production AI Agents — From Notebook to Production

Observability for AI agents is traditional application performance monitoring (APM) plus new signal types, a new industry standard (OTel GenAI conventions), and tooling that understands LLM calls, tool invocations, and agent reasoning. Without it, an agent that fails silently, cost-explodes, or loops indefinitely is indistinguishable from one that works correctly.

Try it now: VS Code Copilot, OpenAI Codex, and Claude Code all emit OpenTelemetry traces using GenAI semantic conventions. Enable github.copilot.chat.otel.enabled, point at a local OTLP endpoint (e.g., Aspire Dashboard), and see invoke_agentchatexecute_tool span trees from daily development work.

Map

Observability for AI Agents
│
├── CAPABILITIES — taxonomy: metrics, logs, traces, monitoring, traceability, diagnosability
│
├── STANDARD: OTel GenAI Semantic Conventions (v1.37+)
│   ├── Model spans (gen_ai.*) — LLM calls, widely adopted
│   ├── Agent spans (gen_ai.agent.*) — workflows, drafted
│   └── Agent framework conventions — in progress
│
├── SIGNALS — Traces, Metrics, Logs, Events
│   ├── Traces — LLM calls, tool invocations, plan phases
│   ├── Metrics — token usage, latency, cost, guardrail rate
│   ├── Logs — structured records with trace_id linkage
│   └── Events — discrete agent actions (API calls, handoffs)
│
├── CORRELATION — trace_id propagated through every downstream call
│
├── ARCHITECTURE
│   ├── In-process SDK — direct OTel instrumentation
│   ├── Proxy/Sidecar — transparent LLM call capture
│   ├── Framework callbacks — LangChain, Semantic Kernel hooks
│   └── Instrumentation source — baked-in (CrewAI) vs OTel library (OpenInference)
│
├── TOOLING
│   ├── Open-source platforms — Langfuse, MLflow, Agenta
│   ├── Specialized — Phoenix (RAG), Helicone (proxy)
│   ├── Full-stack — SigNoz, Datadog, Grafana
│   ├── Cloud-native — AWS Bedrock + CloudWatch, Azure Foundry
│   └── DIY — OTel SDK → Collector → existing stack
│
└── ALERTING — actionable thresholds for guardrails, cost, latency

Capability Taxonomy

Observability decomposes into distinct capabilities, each answering a different question:

Capability Answers For AI agents
Metrics What changed? Token counts, latency distributions, error rates, cost per interaction
Logs What happened? Prompt/completion pairs, tool inputs/outputs, guardrail decisions
Traces Where did it happen? LLM call → tool invocation → re-planning → response, with causal links
Monitoring Are known failures happening? Alerts on guardrail spikes, tool failures, cost anomalies
Observability Can we explore unknown failures? Ask novel questions without predefined dashboards
Traceability What path did this request take? Correlation IDs tying a user message to every downstream LLM and tool call
Diagnosability How fast can we find root cause? The faster you trace symptom to source, the better the obs design

Auditability — reconstructing who did what and when — sits in the governance layer. It consumes traceability data but adds identity, policy context, immutability, and retention.

The Standard: OTel GenAI Conventions

The industry is converging on OpenTelemetry GenAI Semantic Conventions (v1.37+) via OTel’s GenAI SIG. If you instrument with these conventions, telemetry is portable across Datadog, Langfuse, Phoenix, SigNoz, Grafana — any OTLP-speaking backend.

Convention What it covers Status
Model spans (gen_ai.*) LLM calls — model name, token counts, temperature, provider Development (widely adopted)
Agent application (gen_ai.agent.*) Agent workflows — planning, tool calls, task execution Draft (Google AI Agent whitepaper)
Agent framework Common convention across frameworks (LangGraph, CrewAI, AutoGen, IBM Bee, PydanticAI) In progress (#1530)

Application vs Framework: The GenAI SIG distinguishes an agent application (individual AI entity performing tasks autonomously) from an agent framework (infrastructure for building agents). The application convention is drafted; the framework convention is next priority. Source: OTel blog on evolving agent standards.

Adoption

Major providers already ship OTel GenAI:

  • Datadog LLM Observability — native OTel GenAI spans (v1.37+) since Dec 2025
  • AWS Bedrock AgentCore — OTel-compatible traces via ADOT
  • Microsoft Foundrymicrosoft-opentelemetry distro with GenAI conventions; co-developed multi-agent conventions with Cisco Outshift
  • Traceloop (OpenLLMetry) — donating instrumentation to OTel

OpenInference (Arize) adds llm.span_kind (LLM, TOOL, CHAIN, AGENT, RETRIEVER) on top of OTel. Industry direction is OTel-native gen_ai.* as the base, vendor extensions layered on top.

Signals

Traces

A trace for an AI agent is a tree where the agent decides the shape at runtime — how many LLM calls, which tools, how many re-planning iterations. The OTel spec defines the span types:

User Request (span)
├── Guardrail: input validation (span)
├── Plan: task decomposition (plan span)
│   └── LLM Call: generates plan (chat span)
│       ├── gen_ai.usage.input_tokens: 1240
│       ├── gen_ai.usage.output_tokens: 87
│       ├── gen_ai.request.model: "gpt-4o-mini"
│       └── latency_ms: 1432
├── Tool Call: search_docs("refund policy") (execute_tool span)
│   ├── input: {query: "refund policy", top_k: 5}
│   ├── output: {results: 3, total_ms: 89}
│   └── latency_ms: 91
├── LLM Call: final response (chat span)
│   ├── gen_ai.usage.input_tokens: 2890
│   ├── gen_ai.usage.output_tokens: 412
│   ├── gen_ai.request.model: "gpt-4o"
│   └── latency_ms: 3210
├── Guardrail: output validation (span)
└── Response to user

Spans are dynamic — tracing infrastructure must handle variable-depth, variable-width traces. The agent decides at runtime.

Span Types

The gen-ai-agent-spans.md spec defines four span types:

Span type (Weaver registry key) gen_ai.operation.name Span kind When to use
Invoke agent (local)gen_ai.invoke_agent.internal invoke_agent INTERNAL In-process agent invocation (LangChain, CrewAI). Span name: invoke_agent {agent.name}
Invoke agent (remote)gen_ai.invoke_agent.client invoke_agent CLIENT Remote agent services (OpenAI Assistants, Bedrock Agents). Span name: invoke_agent {agent.name}
Invoke workflowgen_ai.invoke_workflow.internal invoke_workflow INTERNAL Multi-agent orchestration (e.g., CrewAI crew). Span name: invoke_workflow {workflow.name}
Plangen_ai.plan.internal plan INTERNAL Agent planning/task decomposition. LLM call that produces the plan is a child span (chat); tool spans are siblings under invoke_agent. Span name: plan {agent.name}

Frameworks that can distinguish workflow from agent (CrewAI) SHOULD report invoke_workflow. Frameworks that can’t (Google ADK) SHOULD NOT report invoke_workflow — they report invoke_agent for all agent types.

Plan vs standard inference: plan spans SHOULD be reported only when instrumentation can reliably determine the operation is planning/decomposition, and SHOULD NOT be reported when it can’t distinguish planning from generic reasoning. If you’re intercepting raw HTTP calls, emit chat spans, not plan.

Span Attributes

LLM spansgen_ai.operation.name is set to one of: chat, text_completion, embeddings, generate_content, retrieval.

Attributes:

Attribute Type Purpose Opt-In?
gen_ai.request.model string Model identifier (gpt-4o-2024-08-06) No
gen_ai.usage.input_tokens / gen_ai.usage.output_tokens int Token counts No
gen_ai.usage.cache_read.input_tokens / gen_ai.usage.cache_creation.input_tokens int Cached-prompt cost attribution No
gen_ai.request.temperature / gen_ai.request.top_p / gen_ai.request.max_tokens double/int Request parameters No
gen_ai.provider.name string Cloud/provider (openai, aws.bedrock, anthropic) No
gen_ai.request.seed / gen_ai.request.stop_sequences / gen_ai.request.frequency_penalty int/string[]/double Advanced parameters No
gen_ai.request.stream boolean Whether the request was streaming No
gen_ai.response.finish_reasons string[] Stop reasons (stop, length, tool_calls) No
gen_ai.system.instructions any System prompt Yes
gen_ai.input.messages / gen_ai.output.messages any Full prompt/completion Yes
gen_ai.tool.definitions any Tool schemas passed to model Yes

Agent spans (invoke_agent, invoke_workflow, plan):

Attribute Type Purpose
gen_ai.agent.name string Human-readable name (Math Tutor)
gen_ai.agent.description string Free-form description
gen_ai.agent.version string Semver or date (1.0.0, 2025-05-01)
gen_ai.conversation.id string Session/thread identifier
gen_ai.workflow.name string Multi-agent workflow name
gen_ai.data_source.id string RAG/grounding data source
gen_ai.output.type string Output type: text, json, image, speech

Additional gen_ai.operation.name values for memory and persistence — not backed by dedicated span type definitions yet, but part of the same enumeration: create_memory_store, delete_memory_store, create_memory, upsert_memory, update_memory, delete_memory, search_memory (spec).

Metrics

OTel GenAI defines two official metrics (blog):

  • gen_ai.client.operation.duration — histogram of LLM call latencies (filterable by gen_ai.request.model)
  • gen_ai.client.token.usage — histogram of token consumption (filterable by gen_ai.token.type: input / output)

These two official metrics feed dashboards for latency and token distributions. Beyond them, track these operational signals — most require span-level queries, not just metric aggregations:

Metric What it measures How to calculate Why it matters
Tokens per request Prompt + completion tokens per agent turn Sum gen_ai.usage.input_tokens + output_tokens across all LLM spans in a trace Cost attribution, anomaly detection
LLM calls per user request Distribution of LLM rounds Count spans with gen_ai.operation.name in (chat, text_completion, generate_content) grouped by trace_id Spot agents in infinite planning loops
Tool success rate Per-tool error rate, latency distribution Filter spans where gen_ai.operation.name=execute_tool; error rate = count(status=ERROR) / total per tool Flaky/slow tools
Cache hit rate Prompt/embedding/cache hit rate gen_ai.usage.cache_read.input_tokens / (gen_ai.usage.cache_read.input_tokens + gen_ai.usage.input_tokens) ROI of caching infrastructure
Time to first token (TTFT) Request → first response token gen_ai.response.time_to_first_chunk — span attribute, set for streaming requests Perceived responsiveness
Rate limit hits 429 / rate-limit count, queue depth Provider-specific: check HTTP 429 status on LLM spans or provider error response fields. Not yet covered by gen_ai.response.finish_reasons. Group by provider/model Capacity planning, fallback models
Guardrail trigger rate Input/output guardrail fire rate Count spans tagged with guardrail trigger attributes (vendor-specific convention) vs total requests Content safety, prompt injection trends
Cost per successful interaction Total cost per completed request Sum (input_tokens × input_price + output_tokens × output_price + cache_tokens × cache_price) per trace_id, divided by count of traces with success status Example: GPT-4o at 5K in + 2K out ≈ $0.03. Complex agent with re-planning (20K in + 5K out) ≈ $0.10 (GPT-4o) to $0.26 (o3-pro). Pricing from models.dev

Logs

From the framework: when instrumented with OTel, agent frameworks emit spans (traces) following GenAI semantic conventions — each LLM call, tool invocation, and planning step produces a span with gen_ai.operation.name, token counts, model name, and other attributes. Log records are a separate OTel signal — frameworks may or may not emit them depending on their instrumentation depth.

From your code: integrate OTel logging to produce structured log records. When a span is active, the OTel SDK automatically attaches trace_id and span_id to every log record — no manual correlation code. Add application-specific fields as structured attributes (agent turn number, decision context, eval results).

Capturing prompt/completion content: frameworks can populate gen_ai.input.messages and gen_ai.output.messages span attributes with full prompt and response text — but this is opt-in (disabled by default, since prompts can contain sensitive data). Turn it on via instrumentation config (e.g., github.copilot.chat.otel.captureContent=true) rather than writing your own capture code. When enabled, the raw text is large — strategies to control storage cost:

Strategy How it works Best for
Sampling Log full prompt/completion for 5% of requests; for the other 95%, log only token counts and model name Production at scale — gives you enough samples for eval without the storage cost
Separate store Write full prompts to cheap blob storage (S3, Azure Blob). Store only a blob pointer + metadata in your log aggregator When you need 100% capture for compliance or audit trails
Eval pipeline Stream sampled traces to an evaluation pipeline that scores them (relevance, groundedness, safety) and discards raw text. Store only the scores Continuous quality monitoring — you watch scores, not raw text

Events (The Fourth Signal)

Discrete, semantically meaningful agent actions — structured records, not raw log lines. IBM’s observability framework calls this the MELT model, where Events is the fourth signal alongside Metrics, Logs, and Traces:

Event type Example Why it matters
API call Agent calls search API Track tool usage and cost
LLM call Agent sends prompt to GPT-4o Quality analysis
Failed tool call DB query returns connection error Alert, root cause
Human handoff Agent escalates refund dispute Autonomy rate, capability gaps
Guardrail trigger Output filter blocks response Safety system effectiveness

OTel gen-ai-events.md defines structured input/output message schemas (JSON). Events are less granular than spans but more structured than free-text logs — useful for audit trails and compliance.

The companion article Architecture, Tooling & Alerting covers correlation, evaluation, instrumentation patterns, the tooling landscape, and alerting thresholds.

References