Skip to content

fix(openai-agents): fix realtime session event handling for prompts, completions, and usage#3688

Merged
galkleinman merged 2 commits intotraceloop:mainfrom
EliJaghab:fix/realtime-event-handling
Feb 19, 2026
Merged

fix(openai-agents): fix realtime session event handling for prompts, completions, and usage#3688
galkleinman merged 2 commits intotraceloop:mainfrom
EliJaghab:fix/realtime-event-handling

Conversation

@EliJaghab
Copy link
Copy Markdown
Contributor

@EliJaghab EliJaghab commented Feb 15, 2026

Summary

  • Add history_updated event handler to capture assistant transcript updates
  • Fix dict-based data access in response.done handler (getattr on dicts silently returned None instead of using .get())
  • Fix dict-case event unwrapping where the data variable was not updated to the nested level
  • Remove dead response event handler (no SDK session event has type="response")

Why

The realtime instrumentation had several event type mismatches with the openai-agents SDK (v0.6.0+):

  1. history_updated events were completely ignored, so assistant transcript updates from multi-turn conversations were never captured as completions.
  2. In the response.done handler, getattr() was used on dict objects instead of .get(), so completion extraction silently failed (usage was captured correctly via the dict path, but output/content was not).
  3. The dict-case unwrapping in raw_model_event did not update the data reference after finding nested data, so subsequent dict-path code looked at the wrong nesting level.
  4. The response event handler was dead code -- no SDK session event has type="response".

Closes #3685

Test plan

  • All 46 package tests pass
  • Added test for history_updated assistant completion capture via transcript
  • Added test for dict-based response.done usage and completion extraction
  • Ruff lint passes

Important

Fix event handling in OpenAI Agents SDK by adding history_updated handler, correcting response.done data extraction, and removing dead code.

  • Behavior:
    • Add history_updated event handler in _realtime_wrappers.py to capture assistant transcript updates.
    • Fix response.done handler to use .get() instead of getattr() for dicts, ensuring completion extraction.
    • Correct dict-case event unwrapping in raw_model_event to update data to the nested level.
    • Remove dead response event handler.
  • Tests:
    • Add test for history_updated event handling in test_realtime_session.py.
    • Add test for dict-based response.done handling in test_realtime_session.py.

This description was created by Ellipsis for ba639de. You can customize this summary. It will automatically update as commits are pushed.

Summary by CodeRabbit

  • New Features

    • Improved realtime tracing for OpenAI agent interactions: more reliable capture of assistant completions and usage metrics across varied event types and payload formats.
  • Tests

    • Added tests validating history-based completion extraction and response payload parsing (with and without usage) to ensure tracing accuracy and robustness.

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Feb 15, 2026

📝 Walkthrough

Walkthrough

Unifies and refactors realtime event parsing in the OpenAI Agents instrumentation: adds helpers to extract content, unwrap nested raw events, and extract response/usage; fixes handling of history_updated, raw_model_event, and response.done payload shapes; and adds tests covering assistant completions and usage extraction.

Changes

Cohort / File(s) Summary
Realtime event parsing and helpers
packages/opentelemetry-instrumentation-openai-agents/opentelemetry/instrumentation/openai_agents/_realtime_wrappers.py
Introduces internal helpers (_extract_content_text, _unwrap_raw_event_data, _extract_response_and_usage); replaces duplicated inline extraction logic; properly unwraps raw_model_event/raw_server_event nesting; adds history_updated handling and uniform handling for dict- or object-shaped items and response.done payloads.
Realtime session tests
packages/opentelemetry-instrumentation-openai-agents/tests/test_realtime_session.py
Adds TestTracedPutEventHandlers with tests for: capturing assistant completion from history_updated; extracting usage and assistant completion from dict-shaped response.done; and capturing completion when usage is absent.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 I hopped through events, peeled each layer thin,
Found transcripts tucked where signals had been,
Unwrapped the raw, fetched usage and text,
Now spans tell the story—no step is perplexed. 🥕✨

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately and concisely describes the main fix: addressing event handling for prompts, completions, and usage in OpenAI realtime sessions.
Linked Issues check ✅ Passed The PR fully addresses all coding requirements from issue #3685: fixed response.done detection via raw_server_event nesting, added history_updated handler, corrected dict-based data access, and removed dead response handler.
Out of Scope Changes check ✅ Passed All changes are directly scoped to fixing the realtime session event handling issues. The implementation adds internal helpers and tests without introducing unrelated functionality.
Docstring Coverage ✅ Passed Docstring coverage is 83.33% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Tip

Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@EliJaghab
Copy link
Copy Markdown
Contributor Author

Tested locally in our production voice relay (OpenAI Agents SDK realtime mode) with Jaeger trace collection.

With this fix applied, openai.realtime LLM spans now correctly capture:

  • gen_ai.prompt.0.content -- e.g. "I want the iPhone 17 Pro Max"
  • gen_ai.completion.0.content -- e.g. "Got it. The iPhone 17 Pro Max is selected. Which color would you like..."
  • gen_ai.usage.input_tokens -- e.g. 9192
  • gen_ai.usage.output_tokens -- e.g. 27

Before this fix, all of these attributes were missing on realtime spans. Confirmed across a 3-message automated conversation with multiple tool calls, producing 8 openai.realtime spans total.

@EliJaghab EliJaghab marked this pull request as ready for review February 16, 2026 15:33
Copy link
Copy Markdown
Contributor

@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important

Looks good to me! 👍

Reviewed everything up to ba639de in 28 seconds. Click for details.
  • Reviewed 231 lines of code in 2 files
  • Skipped 0 files when reviewing.
  • Skipped posting 0 draft comments. View those below.
  • Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.

Workflow ID: wflow_6VG6uIcIdAzkFRtY

You can customize Ellipsis by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In
`@packages/opentelemetry-instrumentation-openai-agents/opentelemetry/instrumentation/openai_agents/_realtime_wrappers.py`:
- Around line 610-621: In the response.done handler, dict-based parts only check
part.get("text") (and the object path only checks getattr(part, "text")), so
audio parts with a "transcript" field are missed; update that block to mirror
history_added/history_updated/item_updated by falling back to
part.get("transcript") for dicts and getattr(part, "transcript", None) for
objects before calling state.record_completion(role, text), ensuring transcript
is used when text is absent.
- Around line 587-621: The completion extraction logic is incorrectly nested
under the `if usage:` guard so completions are skipped when `usage` is absent;
separate concerns by keeping `state.record_usage(usage)` inside the `if usage:`
block but move the entire output/completion extraction code (the logic that
inspects `response`/`output`/`item`/`item_content` and calls
`state.record_completion(role, text)`) out one level so it runs regardless of
`usage`. Locate the block using the symbols `state.record_usage`, `response`,
`output`, `item_type`, `role`, `item_content`, and `state.record_completion` in
`_realtime_wrappers.py` and de-indent that output-iteration block so usage
recording remains conditional but completion extraction always executes.
🧹 Nitpick comments (2)
packages/opentelemetry-instrumentation-openai-agents/tests/test_realtime_session.py (2)

480-529: Tests replicate handler logic inline instead of exercising traced_put_event — coverage gap.

Both test_history_updated_* and test_response_done_dict_* manually reproduce the extraction logic from traced_put_event (lines 509–521 and 562–579) rather than dispatching a mock event through the actual handler. This means any future drift between the handler and these inline copies won't be caught.

Consider creating a thin helper that constructs a mock event (with .type and relevant attributes), then calls traced_put_event on a mock session wired with a RealtimeTracingState. This would turn these into true integration tests of the handler dispatch path. That said, the current tests still validate RealtimeTracingState correctness, so this can be deferred.


531-591: Test doesn't cover the if usage: gating issue — no test for response.done without usage.

As flagged in the wrapper review, completion extraction is currently gated on usage being present. Consider adding a test with a response.done payload that has output but no usage to document the expected behavior (and to catch the regression once the gating is fixed).

Copy link
Copy Markdown
Contributor

@galkleinman galkleinman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be great if we could take the opportunity to extract some of the nested event types handling into dedicated helper methods. up to you tho.

LGTM

Comment on lines +539 to +546
if item_content and isinstance(item_content, list):
for part in item_content:
text = getattr(part, "text", None) or getattr(
part, "transcript", None
)
if text:
state.record_completion(role, text)
break
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this capture all assistant messages or just the most recent one?

Copy link
Copy Markdown
Contributor Author

@EliJaghab EliJaghab Feb 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just the most recent one, intentionally.

There are three event paths that can capture an assistant completion:

  1. history_added -- fires when a single new item is added to history
  2. history_updated -- fires when the full cumulative history changes (contains ALL messages from the entire session)
  3. response.done (inside raw_model_event) -- fires when the model finishes responding

Whichever fires first creates the LLM span via record_completion -> create_llm_span. The seen_completions hash set in record_completion deduplicates, so subsequent handlers that try to record the same text become no-ops.

For history_updated specifically: the history list is cumulative -- on turn N it contains all messages from turns 1 through N. Using reversed() + break grabs only the newest assistant message (the one relevant to the current turn). Without this, we'd iterate through all previous assistant messages on every turn. The dedup set would ultimately prevent duplicate spans, but it's unnecessary work.

Here is the actual event ordering from an E2E test against the agents SDK relay (one-turn conversation, "Tell me about the iPhone 17"):

# Session starts -- history_updated fires with system message only, no assistant content yet
[OTEL DEBUG] traced_put_event CALLED, event_type=history_updated

# ... ~750 raw_model_event events (audio deltas, transcription deltas, etc.) ...

# Model finishes responding -- history_updated fires with full history including the new assistant message
[OTEL DEBUG] traced_put_event CALLED, event_type=history_updated
[OTEL DEBUG] record_completion called: role=assistant, content=The iPhone 17 is Apple's latest flagship phone...

# response.done fires next -- record_usage captures tokens, record_completion is deduplicated
[OTEL DEBUG] traced_put_event CALLED, event_type=raw_model_event
[OTEL DEBUG] record_usage called: usage type=dict
[OTEL DEBUG] record_completion called: role=assistant, content=The iPhone 17 is Apple's latest flagship phone...
[OTEL DEBUG] record_completion: duplicate, skipping

# Agent ends
[OTEL DEBUG] traced_put_event CALLED, event_type=agent_end

This shows: history_updated captures the completion first, then response.done captures usage and correctly skips the duplicate completion.

…completions, and usage

Handle history_updated events to capture assistant transcript updates.
Fix dict-based data access in response.done handler where getattr was
used on dicts instead of .get(), silently returning None. Fix dict-case
event unwrapping where the data variable was not updated to the nested
level. Remove dead response event handler that could never match.

Closes traceloop#3685
…ript fallback

Address PR review feedback:
- Extract _extract_content_text, _unwrap_raw_event_data, and
  _extract_response_and_usage helpers to reduce nesting
- De-nest completion extraction from if-usage guard so completions
  are captured even when usage is absent from response.done
- Add transcript fallback in response.done handler to match
  history_updated and item_updated handlers
- Add test for response.done without usage
@EliJaghab EliJaghab force-pushed the fix/realtime-event-handling branch from ba639de to ba7d1cc Compare February 17, 2026 19:00
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
packages/opentelemetry-instrumentation-openai-agents/tests/test_realtime_session.py (2)

628-637: Inline extraction logic diverges from the helper and the actual handler.

test_response_done_dict_captures_usage_and_completion manually extracts text via part.get("text") (Line 634) without the transcript fallback, while test_response_done_without_usage_still_captures_completion correctly delegates to _extract_content_text (Line 687). The production handler at _realtime_wrappers.py Line 627 uses _extract_content_text everywhere now, so this test wouldn't catch a regression on transcript-only content parts.

Consider using _extract_content_text consistently in all three tests (it's already imported).

Proposed fix
                         if item.get("type") == "message" and item.get("role") == "assistant":
                             item_content = item.get("content")
-                            if item_content and isinstance(item_content, list):
-                                for part in item_content:
-                                    text = part.get("text") if isinstance(part, dict) else None
-                                    if text:
-                                        state.record_completion("assistant", text)
-                                        break
+                            text = _extract_content_text(item_content)
+                            if text:
+                                state.record_completion("assistant", text)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@packages/opentelemetry-instrumentation-openai-agents/tests/test_realtime_session.py`
around lines 628 - 637, The test manually extracts text using part.get("text")
in test_response_done_dict_captures_usage_and_completion; change that extraction
to call the already-imported helper _extract_content_text so it matches
test_response_done_without_usage_still_captures_completion and the production
handler in _realtime_wrappers.py; locate the loop in
tests/test_realtime_session.py that iterates output items (the block using
isinstance(item, dict) and item.get("type") == "message") and replace the direct
part.get("text") usage with a call to _extract_content_text(part) before passing
the result to state.record_completion("assistant", text).

538-698: Tests replicate handler logic inline rather than exercising traced_put_event.

All three tests manually reconstruct the extraction logic (iterating history, unwrapping dicts, etc.) instead of invoking the actual event handler path. This means the tests validate RealtimeTracingState.record_* methods but do not cover the wiring inside traced_put_event — the very code this PR is fixing. A regression in traced_put_event (e.g., a typo in an event_type check or a wrong nesting level) would not be caught.

Consider adding at least one integration-style test that constructs a mock event object with the appropriate .type and .data attributes, then calls the handler logic (or a thin wrapper around it) to verify end-to-end span creation.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@packages/opentelemetry-instrumentation-openai-agents/tests/test_realtime_session.py`
around lines 538 - 698, Add an integration-style test that exercises the
traced_put_event handler end-to-end instead of reproducing its logic inline:
create a mock event object with .type (e.g., "history.updated" or
"response.done") and .data shaped like the real payload, call the
traced_put_event function (or a small wrapper that routes to it) while using
RealtimeTracingState and the tracer fixture, then assert
exporter.get_finished_spans() contains the expected "openai.realtime" span
attributes (gen_ai.completion.* and gen_ai.usage.*). Ensure the test references
traced_put_event and RealtimeTracingState so it will fail if event-type checks
or nesting handling inside traced_put_event regress.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@packages/opentelemetry-instrumentation-openai-agents/opentelemetry/instrumentation/openai_agents/_realtime_wrappers.py`:
- Around line 632-640: The item_updated branch currently uses getattr(data,
"item", None) which skips dict-shaped events; change the handler for data_type
== "item_updated" to mirror the response.done logic by checking isinstance(data,
dict) and pulling item = data.get("item") (and otherwise using getattr as
before), then extract role/content from the item via item.get("role") /
item.get("content") when item is a dict, and finally pass the content through
_extract_content_text and call state.record_completion(role, text) when role ==
"assistant" and text is truthy so dict-based events are handled consistently.

---

Nitpick comments:
In
`@packages/opentelemetry-instrumentation-openai-agents/tests/test_realtime_session.py`:
- Around line 628-637: The test manually extracts text using part.get("text") in
test_response_done_dict_captures_usage_and_completion; change that extraction to
call the already-imported helper _extract_content_text so it matches
test_response_done_without_usage_still_captures_completion and the production
handler in _realtime_wrappers.py; locate the loop in
tests/test_realtime_session.py that iterates output items (the block using
isinstance(item, dict) and item.get("type") == "message") and replace the direct
part.get("text") usage with a call to _extract_content_text(part) before passing
the result to state.record_completion("assistant", text).
- Around line 538-698: Add an integration-style test that exercises the
traced_put_event handler end-to-end instead of reproducing its logic inline:
create a mock event object with .type (e.g., "history.updated" or
"response.done") and .data shaped like the real payload, call the
traced_put_event function (or a small wrapper that routes to it) while using
RealtimeTracingState and the tracer fixture, then assert
exporter.get_finished_spans() contains the expected "openai.realtime" span
attributes (gen_ai.completion.* and gen_ai.usage.*). Ensure the test references
traced_put_event and RealtimeTracingState so it will fail if event-type checks
or nesting handling inside traced_put_event regress.

Comment on lines 632 to +640
elif data_type == "item_updated":
item = getattr(data, "item", None)
if item:
role = getattr(item, "role", None)
item_content = getattr(item, "content", None)
if (
role == "assistant"
and item_content
and isinstance(item_content, list)
):
for part in item_content:
text = getattr(part, "text", None) or getattr(
part, "transcript", None
)
if text:
state.record_completion(role, text)
break
if role == "assistant":
text = _extract_content_text(item_content)
if text:
state.record_completion(role, text)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

item_updated handler doesn't handle dict-based data — inconsistent with response.done.

After _unwrap_raw_event_data, data may be a dict (Lines 59–62 return a nested dict). Here, getattr(data, "item", None) on a dict returns None, so dict-shaped item_updated events would be silently skipped. The response.done block (Lines 610–629) correctly branches on isinstance(…, dict).

Proposed fix
                         elif data_type == "item_updated":
-                            item = getattr(data, "item", None)
+                            if isinstance(data, dict):
+                                item = data.get("item")
+                            else:
+                                item = getattr(data, "item", None)
                             if item:
-                                role = getattr(item, "role", None)
-                                item_content = getattr(item, "content", None)
+                                if isinstance(item, dict):
+                                    role = item.get("role")
+                                    item_content = item.get("content")
+                                else:
+                                    role = getattr(item, "role", None)
+                                    item_content = getattr(item, "content", None)
                                 if role == "assistant":
                                     text = _extract_content_text(item_content)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@packages/opentelemetry-instrumentation-openai-agents/opentelemetry/instrumentation/openai_agents/_realtime_wrappers.py`
around lines 632 - 640, The item_updated branch currently uses getattr(data,
"item", None) which skips dict-shaped events; change the handler for data_type
== "item_updated" to mirror the response.done logic by checking isinstance(data,
dict) and pulling item = data.get("item") (and otherwise using getattr as
before), then extract role/content from the item via item.get("role") /
item.get("content") when item is a dict, and finally pass the content through
_extract_content_text and call state.record_completion(role, text) when role ==
"assistant" and text is truthy so dict-based events are handled consistently.

@galkleinman galkleinman merged commit 02ae790 into traceloop:main Feb 19, 2026
9 of 10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Realtime session spans missing prompt, completion, and token usage attributes

2 participants