This software is for Alpha preview only. This code may be discontinued, include breaking changes and may require code changes to use it.
Provide a stable, extensible core abstraction (GenAI Types + TelemetryHandler + CompositeEmitter + Evaluator hooks) separating instrumentation capture from telemetry flavor emission so that:
- Instrumentation authors create neutral GenAI data objects once.
- Different telemetry flavors (semantic conventions, vendor enrichments, events vs attributes, aggregated evaluation results, cost / agent metrics) are produced by pluggable emitters without touching instrumentation code.
- Evaluations (LLM-as-a-judge, quality metrics) run asynchronously and re-emit results through the same handler/emitter pipeline.
- Third parties can add / replace / augment emitters in well-defined category chains.
- Configuration is primarily environment-variable driven; complexity is opt-in.
Non-goal: Replace the OpenTelemetry SDK pipeline. Emitters sit above the SDK using public Span / Metrics / Logs / Events APIs.
Implemented dataclasses (in types.py):
GenAI- base classLLMInvocationEmbeddingInvocationRetrievalInvocationWorkflowAgentInvocationStepToolCallEvaluationResultErrorClassification— enum (REAL_ERROR,INTERRUPT,CANCELLATION) governing span status behavior
Base dataclass: – fields include timing (start_time, end_time), identity (run_id, parent_run_id), context (provider, framework, agent_*, system, conversation_id, data_source_id), plus attributes: dict[str, Any] for free-form metadata.
Semantic attributes: fields tagged with metadata={"semconv": <attr name>} feed semantic_convention_attributes() which returns only populated values; emitters rely on this reflective approach (no hard‑coded attribute lists).
Messages: InputMessage / OutputMessage each hold role and parts (which may be Text, ToolCall, ToolCallResponse, or arbitrary parts). Output messages have an optional finish_reason (meaningful for LLM responses, omitted for agent/workflow outputs).
EvaluationResult fields: metric_name, optional score (float), label (categorical outcome), explanation, error (contains type, message), attributes (additional evaluator-specific key/values). No aggregate wrapper class yet.
TelemetryHandler provides external APIs for GenAI Types lifecycle
Capabilities:
- Type-specific lifecycle:
start_llm,stop_llm,fail_llm, plusstart/stop/failfor embedding, tool call, workflow, agent, step. - Generic dispatchers:
start(obj),finish(obj),fail(obj, error). - Dynamic content capture refresh (
_refresh_capture_content) each LLM / agentic start (re-reads env + experimental gating). - Delegation to
CompositeEmitter(on_start,on_end,on_error,on_evaluation_results). - Completion callback registry (
CompletionCallback); Evaluation Manager auto-registers if evaluators present. - Evaluation emission via
evaluation_results(invocation, list[EvaluationResult]).
Invocation objects hold a span reference.
EmitterProtocol offers: on_start(obj), on_end(obj), on_error(error, obj), on_evaluation_results(results, obj=None).
EmitterMeta supplies role, name, optional override, and a default handles(obj) returning True. Role names are informational and may not match category names (e.g., MetricsEmitter.role == "metric").
Defines ordered category dispatch with explicit sequences:
- Start order:
span,metrics,content_events - End/error order:
evaluation,metrics,content_events,span(span ends last so other emitters can enrich attributes first; evaluation emitters appear first in end sequence to allow flush behavior).
Public API (current): iter_emitters(categories), emitters_for(category), add_emitter(category, emitter). A richer register_emitter(..., position, mode) API is not yet implemented.
Entry point group: opentelemetry_util_genai_emitters (vendor packages contribute specs).
EmitterSpec fields:
namecategory(span,metrics,content_events,evaluation)factory(context)mode(append,prepend,replace-category,replace-same-name)after,before(ordering hints – currently unused / inert)invocation_types(allow-list; implemented via dynamichandleswrapping)
Ordering hints will either gain a resolver or be removed (open item).
Baseline selection: OTEL_INSTRUMENTATION_GENAI_EMITTERS (comma-separated tokens):
span(default)span_metricspan_metric_event- Additional tokens -> extra emitters (e.g.
traceloop_compat). If the only token istraceloop_compat, semconv span is suppressed (only_traceloop_compat).
Category overrides (OTEL_INSTRUMENTATION_GENAI_EMITTERS_<CATEGORY> with <CATEGORY> = SPAN|METRICS|CONTENT_EVENTS|EVALUATION) support directives: append:, prepend:, replace: (alias for replace-category), replace-category:, replace-same-name:.
Implemented through EmitterSpec.invocation_types; configuration layer replaces/augments each emitter’s handles method to short‑circuit dispatch cheaply. No explicit positional insertion API yet; runtime additions can call add_emitter (append only).
Supported modes: append, prepend, replace-category (alias replace), replace-same-name. Ordering hints (after / before) are present but inactive.
CompositeEmitter wraps all emitter calls; failures are debug‑logged. Error metrics hook (genai.emitter.errors) is not yet implemented (planned enhancement).
The Error dataclass includes a classification field (ErrorClassification enum) that controls how the span emitter sets span status:
| Classification | Span Status | Use Case |
|---|---|---|
REAL_ERROR (default) |
ERROR with description |
Genuine failures |
INTERRUPT |
UNSET (default) + gen_ai.interrupt=true |
Framework-level interrupts (e.g., LangGraph GraphInterrupt) requiring human input |
CANCELLATION |
UNSET (default) |
Task cancellations (asyncio.CancelledError) |
For INTERRUPT and CANCELLATION, set_status() is intentionally not called — the span retains its default UNSET status. Per the OTel Trace Spec, UNSET means "no error" without the stronger assertion of OK ("validated as successfully completed"). Most backends treat both UNSET and OK as non-error for alerting purposes.
Instrumentation libraries classify errors by inspecting exception type hierarchies. For example, the LangChain instrumentation recognizes GraphInterrupt, NodeInterrupt, and Interrupt as interrupt types, and CancelledError / TaskCancelledError as cancellation types — without importing LangGraph (uses type name string matching).
Step spans additionally set gen_ai.step.status to interrupted or cancelled for non-error classifications.
Emits semantic attributes, optional input/output message content, system instructions, function definitions, token usage, and agent context. Finalization order ensures attributes set before span closure.
Records durations and token usage to histograms: gen_ai.client.operation.duration, gen_ai.client.token.usage, plus agentic histograms (gen_ai.workflow.duration, gen_ai.agent.duration, gen_ai.step.duration). Role string is metric (singular) – may diverge from category name metrics.
Emits one structured log record summarizing an entire LLM invocation (inputs, outputs, system instructions) — a deliberate deviation from earlier message-per-event concept to reduce event volume. Agent/workflow/step event emission is commented out (future option).
Always present:
EvaluationMetricsEmitter– emits evaluation scores to histograms. Behavior depends onOTEL_INSTRUMENTATION_GENAI_EVALS_USE_SINGLE_METRIC:- Single metric mode (default, when unset or
true): All evaluation scores are emitted to a single histogramgen_ai.evaluation.scorewith the evaluation type distinguished by thegen_ai.evaluation.nameattribute. - Multiple metric mode (when
OTEL_INSTRUMENTATION_GENAI_EVALS_USE_SINGLE_METRIC=false): Separate histograms per evaluation type:gen_ai.evaluation.relevancegen_ai.evaluation.hallucinationgen_ai.evaluation.sentimentgen_ai.evaluation.toxicitygen_ai.evaluation.bias(Legacy dynamicgen_ai.evaluation.score.<metric>instruments removed.)
- Single metric mode (default, when unset or
EvaluationEventsEmitter– event perEvaluationResult; optional legacy variant viaOTEL_GENAI_EVALUATION_EVENT_LEGACY.
Aggregation flag affects batching only (emitters remain active either way).
Emitted attributes (core):
gen_ai.evaluation.name– metric name (always present; distinguishes evaluation type in single metric mode)gen_ai.evaluation.score.value– numeric score (events only; histogram carries values)gen_ai.evaluation.score.label– categorical label (pass/fail/neutral/etc.)gen_ai.evaluation.score.units– units of the numeric score (currentlyscore)gen_ai.evaluation.passed– boolean derived when label clearly indicates pass/fail (e.g.pass,success,fail); numeric-only heuristic currently disabled to prevent ambiguous semantics- Agent/workflow identity:
gen_ai.agent.name,gen_ai.workflow.idwhen available. - Provider/model context:
gen_ai.provider.name,gen_ai.request.modelwhen available. - Server context:
server.address,server.portwhen available. gen_ai.operation.name– set to"evaluation"only in multiple metric mode (not set in single metric mode).
An example of the third-party emitter:
- Splunk evaluation aggregation / extra metrics (
opentelemetry-util-genai-emitters-splunk).
| Variable | Purpose | Notes |
|---|---|---|
OTEL_INSTRUMENTATION_GENAI_EMITTERS |
Baseline + extras selection | Values: span, span_metric, span_metric_event, plus extras |
OTEL_INSTRUMENTATION_GENAI_EMITTERS_<CATEGORY> |
Category overrides | Directives: append / prepend / replace / replace-category / replace-same-name |
OTEL_INSTRUMENTATION_GENAI_CAPTURE_MESSAGE_CONTENT |
Enable/disable message capture | Truthy enables capture; default disabled |
OTEL_INSTRUMENTATION_GENAI_CAPTURE_MESSAGE_CONTENT_MODE |
SPAN_ONLY or EVENT_ONLY or SPAN_AND_EVENT or NONE |
Defaults to SPAN_AND_EVENT when capture enabled |
OTEL_INSTRUMENTATION_GENAI_EVALS_EVALUATORS |
Evaluator config grammar | Evaluator(Type(metric(opt=val))) syntax supported |
OTEL_INSTRUMENTATION_GENAI_EVALS_RESULTS_AGGREGATION |
Aggregate vs per-evaluator emission | Boolean |
OTEL_INSTRUMENTATION_GENAI_EVALS_INTERVAL |
Eval worker poll interval | Default 5.0 seconds |
OTEL_INSTRUMENTATION_GENAI_EVALUATION_SAMPLE_RATE |
Trace-id ratio sampling | Float (0–1], default 1.0 |
OTEL_INSTRUMENTATION_GENAI_EVALUATION_RATE_LIMIT_ENABLE |
Enable evaluation rate limiting | Boolean (default: true). Set to 'false' to disable rate limiting |
OTEL_INSTRUMENTATION_GENAI_EVALUATION_RATE_LIMIT_RPS |
Evaluation request rate limit (requests per second) | int (default: 0, disabled). Example: 1 = 1 request per second |
OTEL_INSTRUMENTATION_GENAI_EVALUATION_RATE_LIMIT_BURST |
Maximum burst size for rate limiting | int (default: 4). Allows short bursts beyond the base rate |
OTEL_GENAI_EVALUATION_EVENT_LEGACY |
Emit legacy evaluation event shape | Adds second event per result |
OTEL_INSTRUMENTATION_GENAI_EVALS_USE_SINGLE_METRIC |
Use single gen_ai.evaluation.score histogram vs separate histograms per evaluation type |
Boolean (default: true) |
OTEL_INSTRUMENTATION_GENAI_EVALUATION_QUEUE_SIZE |
Evaluation queue size | int (default: 100) |
OTEL_INSTRUMENTATION_GENAI_CONTEXT_INCLUDE_IN_METRICS |
Context attributes as metric dimensions | Empty (none included). Set to all or comma-separated keys |
OTEL_INSTRUMENTATION_GENAI_CONTEXT_PROPAGATION |
Enable/disable context propagation to child spans | true (enabled by default) |
Adds gen_ai.conversation.id and custom association properties that auto-propagate to all GenAI spans within scope.
from opentelemetry.util.genai import genai_context, set_genai_context
# Context manager (recommended)
with genai_context(
conversation_id="conv-123",
properties={"user.id": "alice", "customer.id": "acme"},
):
result = chain.invoke({"input": "Hello"})
# All spans get:
# gen_ai.conversation.id = "conv-123"
# gen_ai.association.properties.user.id = "alice"
# gen_ai.association.properties.customer.id = "acme"
# Imperative API
set_genai_context(conversation_id="conv-123", properties={"user.id": "alice"})
# Read / clear
ctx = get_genai_context()
clear_genai_context()For LangGraph applications, gen_ai.conversation.id is automatically inferred from configurable.thread_id — no manual wrapping needed:
# thread_id is automatically mapped to gen_ai.conversation.id
config = {"configurable": {"thread_id": "session-123"}}
app.stream(state, config)
# All spans get: gen_ai.conversation.id = "session-123"The instrumentation checks metadata for conversation_id first, then thread_id. Explicit genai_context() always takes priority:
# Explicit context overrides inferred thread_id
config = {"configurable": {"thread_id": "session-123"}}
with genai_context(conversation_id="custom-id"):
app.stream(state, config)
# gen_ai.conversation.id = "custom-id" (explicit wins)Context attributes are resolved (highest to lowest):
- Explicit value on invocation — set directly on the GenAI type object
- ContextVars — set via
set_genai_context()orgenai_context() - Framework inference — e.g. LangGraph
thread_idfrom metadata
Association properties from context and invocation are merged: context properties applied first, invocation-level properties override same keys.
By default, context attributes propagate to all child GenAI spans. To disable:
export OTEL_INSTRUMENTATION_GENAI_CONTEXT_PROPAGATION=falseWhen disabled, only values explicitly set on each invocation object are emitted.
| Attribute | Source | Example |
|---|---|---|
gen_ai.conversation.id |
conversation_id param |
"conv-123" |
gen_ai.association.properties.<key> |
properties dict |
"alice" |
By default, no context attributes are added to metrics (they are high-cardinality). To opt in, set OTEL_INSTRUMENTATION_GENAI_CONTEXT_INCLUDE_IN_METRICS:
# Include all context attributes (conversation_id + all association properties)
export OTEL_INSTRUMENTATION_GENAI_CONTEXT_INCLUDE_IN_METRICS=all
# Include only specific attributes (comma-separated keys)
# Use the property key (without prefix) or the full attribute name
export OTEL_INSTRUMENTATION_GENAI_CONTEXT_INCLUDE_IN_METRICS=user.id,customer.id
# Include only conversation_id in metrics
export OTEL_INSTRUMENTATION_GENAI_CONTEXT_INCLUDE_IN_METRICS=gen_ai.conversation.idWhen a key matches, the corresponding attribute is added as a metric dimension to all GenAI metrics (duration histograms, token histograms). For association properties, either the short key (user.id) or the full prefixed key (gen_ai.association.properties.user.id) can be used.
GenAIContext(conversation_id=None, properties={})— dataclass holding context stategenai_context(conversation_id=None, properties=None)— context manager with auto-restoreset_genai_context(conversation_id=None, properties=None)— set context imperativelyget_genai_context() -> GenAIContext— read current contextclear_genai_context()— reset to empty
See API reference for full details and examples.
- Parse baseline & extras.
- Register built-ins (span/metrics/content/evaluation).
- Load entry point emitter specs & register.
- Apply category overrides.
- Instantiate
CompositeEmitterwith resolved category lists.
EmitterSpec.invocation_types drives dynamic handles wrapper (fast pre-dispatch predicate). Evaluation emitters see results independently of invocation type filtering.
Note: Evaluators depend on opentelemetry-util-genai-evals to be installed as a completion_callback.
Evaluator package entry point groups:
opentelemetry_util_genai_completion_callbacks(completion callback plug-ins; evaluation manager registers here).opentelemetry_util_genai_evaluators(per-evaluator factories/registrations discovered by the evaluation manager).
Default loading honours two environment variables:
OTEL_INSTRUMENTATION_GENAI_COMPLETION_CALLBACKS– optional comma-separated filter applied before instantiation.OTEL_INSTRUMENTATION_GENAI_DISABLE_DEFAULT_COMPLETION_CALLBACKS– when truthy, skips loading built-in callbacks (e.g., evaluation manager).
Evaluation Manager behaviour (shipped from opentelemetry-util-genai-evals):
- Instantiated lazily when the evaluation completion callback binds to
TelemetryHandler. - Trace-id ratio sampling via
OTEL_INSTRUMENTATION_GENAI_EVALUATION_SAMPLE_RATE(falls back to enqueue if span context missing). - Parses evaluator grammar into per-type plans (metric + options) sourced from registered evaluators.
- Aggregation flag merges buckets into a single list when true (
OTEL_INSTRUMENTATION_GENAI_EVALS_RESULTS_AGGREGATION). - Emits lists of
EvaluationResulttohandler.evaluation_results. - Marks invocation
attributes["gen_ai.evaluation.executed"] = Truepost emission.
start_* -> CompositeEmitter.on_start(span, metrics, content_events)
finish_* -> CompositeEmitter.on_end(evaluation, metrics, content_events, span)
-> completion callbacks (Evaluation Manager enqueues)
Evaluation worker -> evaluate -> handler.evaluation_results(list) -> CompositeEmitter.on_evaluation_results(evaluation)
| Scenario | Configuration | Outcome |
|---|---|---|
| Add Traceloop compat span | OTEL_INSTRUMENTATION_GENAI_EMITTERS=span,traceloop_compat |
Semconv + compat span |
| Only Traceloop compat span | OTEL_INSTRUMENTATION_GENAI_EMITTERS=traceloop_compat |
Compat span only |
| Replace evaluation emitters | OTEL_INSTRUMENTATION_GENAI_EMITTERS_EVALUATION=replace:SplunkEvaluationAggregator |
Only Splunk evaluation emission |
| Prepend custom metrics | OTEL_INSTRUMENTATION_GENAI_EMITTERS_METRICS=prepend:MyMetrics |
Custom metrics run first |
| Replace content events | OTEL_INSTRUMENTATION_GENAI_EMITTERS_CONTENT_EVENTS=replace:VendorContent |
Vendor events only |
| Agent-only cost metrics | (future) programmatic add with invocation_types filter | Metrics limited to agent invocations |
- Emitters sandboxed (exceptions suppressed & debug logged).
- No error metric yet (planned:
genai.emitter.errors). - Content capture gated by experimental opt-in to prevent accidental large data egress.
- Single content event per invocation reduces volume.
- Invocation-type filtering occurs before heavy serialization.
- Error classification (
INTERRUPT,CANCELLATION) prevents false-positive error alerts on expected control-flow exceptions.
emitters/utils.py includes: semantic attribute filtering, message serialization, enumeration builders (prompt/completion), function definition mapping, finish-time token usage application. Truncation / hashing helpers & PII redaction are not yet implemented (privacy work deferred).
- Implement ordering resolver for
after/beforehints. - Programmatic rich registration API (mode + position) & removal.
- Error metrics instrumentation.
- Aggregated
EvaluationResultswrapper (with evaluator latency, counts). - Privacy redaction & size-limiting/truncation helpers.
- Async emitters & dynamic hot-reload (deferred).
- Backpressure strategies for high-volume content events.
Get the packages installed:
Setup a virtual env (Note: will erase your .venv in the current folder)
deactivate ; rm -rf .venv; python --version ; python -m venv .venv && . .venv/bin/activate && python -m ensurepip && python -m pip install --upgrade pip && python -m pip install pre-commit -c dev-requirements.txt && pre-commit install && python -m pip install rstcheckpip install -e util/opentelemetry-util-genai --no-deps
pip install -e util/opentelemetry-util-genai-evals --no-deps
pip install -e util/opentelemetry-util-genai-evals-deepeval --no-deps
pip install -e util/opentelemetry-util-genai-emitters-splunk --no-deps
pip install -e util/opentelemetry-util-genai-traceloop-translator --no-deps
pip install -e instrumentation-genai/opentelemetry-instrumentation-langchain --no-deps
pip install -r dev-genai-requirements.txt
pip install -r instrumentation-genai/opentelemetry-instrumentation-langchain/examples/manual/requirements.txt
export OTEL_SEMCONV_STABILITY_OPT_IN=gen_ai_latest_experimental
export OTEL_INSTRUMENTATION_GENAI_EMITTERS=span_metric_event,splunk
export OTEL_INSTRUMENTATION_GENAI_CAPTURE_MESSAGE_CONTENT=true
export OTEL_INSTRUMENTATION_GENAI_CAPTURE_MESSAGE_CONTENT_MODE=SPAN_AND_EVENT
export OTEL_INSTRUMENTATION_GENAI_EVALS_EVALUATORS="Deepeval(LLMInvocation(bias,toxicity))"
export OTEL_INSTRUMENTATION_GENAI_EVALS_RESULTS_AGGREGATION=trueSudo-code to create LLMInvocation for your in-code llm code
from opentelemetry.util.genai.handler import get_telemetry_handler
from opentelemetry.util.genai.types import LLMInvocation, InputMessage, OutputMessage, Text
handler = get_telemetry_handler()
user_input = "Hello"
inv = LLMInvocation(request_model="gpt-5-nano", input_messages=[InputMessage(role="user", parts=[Text(user_input))])], provider="openai")
handler.start_llm(inv)
# your code which actuall invokes llm here
# response = client.chat.completions.create(...)
# ....
inv.output_messages = [OutputMessage(role="assistant", parts=[Text("Hi!")], finish_reason="stop")]
handler.stop_llm(inv)Additionally, for aidefense
pip install -e instrumentation-genai/opentelemetry-instrumentation-aidefense
export AI_DEFENSE_API_KEY="your-ai-defense-key"
python instrumentation-genai/opentelemetry-instrumentation-aidefense/examples/multi_agent_travel_planner/main.pyThis project uses pre-commit hooks to automatically check and fix linting and formatting issues before committing.
Install and configure pre-commit hooks (recommended to run in a virtual environment):
pip install pre-commit
pre-commit installOnce installed, the hooks will automatically run on every git commit and will:
- Fix linting issues with ruff
- Format code with ruff
- Check RST documentation files
- Update dependency locks
To run pre-commit checks on all files (not just staged files):
pre-commit run --all-filesThis is useful for:
- Fixing existing lint failures in CI
- Checking the entire codebase before pushing
- Running checks without committing
If the CI lint job fails on your PR:
Some instrumentation packages include a Makefile with a lint recipe that automatically fixes all linting and formatting issues.
Note: It's recommended to run this in a virtual environment to avoid conflicts with system packages.
cd instrumentation-genai/opentelemetry-instrumentation-weaviate
make lintThis will:
- Install the correct version of ruff
- Fix all linting issues with
ruff check --fix - Format all code with
ruff format - Verify that all fixes pass CI checks
Then commit and push the changes:
git add .
git commit -m "fix: auto-fix linting issues"
git push-
Run pre-commit on all files:
pre-commit run --all-files
-
Review and stage the fixes:
git add . -
Commit and push:
git commit -m "fix: auto-fix linting issues" git push
The CI lint job checks:
- Linting:
ruff check .- code quality issues (unused imports, undefined names, etc.) - Formatting:
ruff format --check .- code formatting consistency
Pre-commit hooks use the same ruff version and configuration as CI, ensuring local checks match CI requirements.
The splunk-otel-genai-emitters-test package provides tools for testing and validating the evaluation framework:
- Test Emitter: Captures all telemetry in memory for testing and validation
- Evaluation Performance Test: CLI tool for validating evaluation metrics against known test samples
For detailed usage instructions, see util/opentelemetry-util-genai-emitters-test/README.md.
Quick example:
# Install the test emitter (development only, not published to PyPI)
pip install -e ./util/opentelemetry-util-genai-emitters-test
pip install -e ./util/opentelemetry-util-genai-evals-deepeval
# Run evaluation performance test
python -m opentelemetry.util.genai.emitters.eval_perf_test \
--samples 120 --concurrent --workers 4 --output results.json- Unit tests: env parsing, category overrides, evaluator grammar, sampling, content capture gating.
- Future: ordering hints tests once implemented.
- Smoke: vendor emitters (Traceloop + Splunk) side-by-side replacement/append semantics.