fix(openai-agents): fix realtime session event handling for prompts, completions, and usage by EliJaghab · Pull Request #3688 · traceloop/openllmetry

EliJaghab · 2026-02-15T22:40:58Z

Summary

Add history_updated event handler to capture assistant transcript updates
Fix dict-based data access in response.done handler (getattr on dicts silently returned None instead of using .get())
Fix dict-case event unwrapping where the data variable was not updated to the nested level
Remove dead response event handler (no SDK session event has type="response")

Why

The realtime instrumentation had several event type mismatches with the openai-agents SDK (v0.6.0+):

history_updated events were completely ignored, so assistant transcript updates from multi-turn conversations were never captured as completions.
In the response.done handler, getattr() was used on dict objects instead of .get(), so completion extraction silently failed (usage was captured correctly via the dict path, but output/content was not).
The dict-case unwrapping in raw_model_event did not update the data reference after finding nested data, so subsequent dict-path code looked at the wrong nesting level.
The response event handler was dead code -- no SDK session event has type="response".

Closes #3685

Test plan

All 46 package tests pass
Added test for history_updated assistant completion capture via transcript
Added test for dict-based response.done usage and completion extraction
Ruff lint passes

Important

Fix event handling in OpenAI Agents SDK by adding history_updated handler, correcting response.done data extraction, and removing dead code.

Behavior:
- Add history_updated event handler in _realtime_wrappers.py to capture assistant transcript updates.
- Fix response.done handler to use .get() instead of getattr() for dicts, ensuring completion extraction.
- Correct dict-case event unwrapping in raw_model_event to update data to the nested level.
- Remove dead response event handler.
Tests:
- Add test for history_updated event handling in test_realtime_session.py.
- Add test for dict-based response.done handling in test_realtime_session.py.

^{This description was created by}^{for ba639de. You can customize this summary. It will automatically update as commits are pushed.}

Summary by CodeRabbit

New Features
- Improved realtime tracing for OpenAI agent interactions: more reliable capture of assistant completions and usage metrics across varied event types and payload formats.
Tests
- Added tests validating history-based completion extraction and response payload parsing (with and without usage) to ensure tracing accuracy and robustness.

coderabbitai · 2026-02-15T22:41:05Z

📝 Walkthrough

Walkthrough

Unifies and refactors realtime event parsing in the OpenAI Agents instrumentation: adds helpers to extract content, unwrap nested raw events, and extract response/usage; fixes handling of history_updated, raw_model_event, and response.done payload shapes; and adds tests covering assistant completions and usage extraction.

Changes

Cohort / File(s)	Summary
Realtime event parsing and helpers `packages/opentelemetry-instrumentation-openai-agents/opentelemetry/instrumentation/openai_agents/_realtime_wrappers.py`	Introduces internal helpers (`_extract_content_text`, `_unwrap_raw_event_data`, `_extract_response_and_usage`); replaces duplicated inline extraction logic; properly unwraps `raw_model_event`/`raw_server_event` nesting; adds `history_updated` handling and uniform handling for dict- or object-shaped items and `response.done` payloads.
Realtime session tests `packages/opentelemetry-instrumentation-openai-agents/tests/test_realtime_session.py`	Adds `TestTracedPutEventHandlers` with tests for: capturing assistant completion from `history_updated`; extracting usage and assistant completion from dict-shaped `response.done`; and capturing completion when usage is absent.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 I hopped through events, peeled each layer thin,
Found transcripts tucked where signals had been,
Unwrapped the raw, fetched usage and text,
Now spans tell the story—no step is perplexed. 🥕✨

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately and concisely describes the main fix: addressing event handling for prompts, completions, and usage in OpenAI realtime sessions.
Linked Issues check	✅ Passed	The PR fully addresses all coding requirements from issue `#3685`: fixed response.done detection via raw_server_event nesting, added history_updated handler, corrected dict-based data access, and removed dead response handler.
Out of Scope Changes check	✅ Passed	All changes are directly scoped to fixing the realtime session event handling issues. The implementation adds internal helpers and tests without introducing unrelated functionality.
Docstring Coverage	✅ Passed	Docstring coverage is 83.33% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Tip

Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

EliJaghab · 2026-02-16T01:47:05Z

Tested locally in our production voice relay (OpenAI Agents SDK realtime mode) with Jaeger trace collection.

With this fix applied, openai.realtime LLM spans now correctly capture:

gen_ai.prompt.0.content -- e.g. "I want the iPhone 17 Pro Max"
gen_ai.completion.0.content -- e.g. "Got it. The iPhone 17 Pro Max is selected. Which color would you like..."
gen_ai.usage.input_tokens -- e.g. 9192
gen_ai.usage.output_tokens -- e.g. 27

Before this fix, all of these attributes were missing on realtime spans. Confirmed across a 3-message automated conversation with multiple tool calls, producing 8 openai.realtime spans total.

ellipsis-dev

Important

Looks good to me! 👍

Reviewed everything up to ba639de in 28 seconds. Click for details.

Reviewed 231 lines of code in 2 files
Skipped 0 files when reviewing.
Skipped posting 0 draft comments. View those below.
Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.

Workflow ID: wflow_6VG6uIcIdAzkFRtY

^{You can customize}^{by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.}

coderabbitai

Actionable comments posted: 2

🤖 Fix all issues with AI agents

In
`@packages/opentelemetry-instrumentation-openai-agents/opentelemetry/instrumentation/openai_agents/_realtime_wrappers.py`:
- Around line 610-621: In the response.done handler, dict-based parts only check
part.get("text") (and the object path only checks getattr(part, "text")), so
audio parts with a "transcript" field are missed; update that block to mirror
history_added/history_updated/item_updated by falling back to
part.get("transcript") for dicts and getattr(part, "transcript", None) for
objects before calling state.record_completion(role, text), ensuring transcript
is used when text is absent.
- Around line 587-621: The completion extraction logic is incorrectly nested
under the `if usage:` guard so completions are skipped when `usage` is absent;
separate concerns by keeping `state.record_usage(usage)` inside the `if usage:`
block but move the entire output/completion extraction code (the logic that
inspects `response`/`output`/`item`/`item_content` and calls
`state.record_completion(role, text)`) out one level so it runs regardless of
`usage`. Locate the block using the symbols `state.record_usage`, `response`,
`output`, `item_type`, `role`, `item_content`, and `state.record_completion` in
`_realtime_wrappers.py` and de-indent that output-iteration block so usage
recording remains conditional but completion extraction always executes.

🧹 Nitpick comments (2)

packages/opentelemetry-instrumentation-openai-agents/tests/test_realtime_session.py (2)

480-529: Tests replicate handler logic inline instead of exercising traced_put_event — coverage gap.

Both test_history_updated_* and test_response_done_dict_* manually reproduce the extraction logic from traced_put_event (lines 509–521 and 562–579) rather than dispatching a mock event through the actual handler. This means any future drift between the handler and these inline copies won't be caught.

Consider creating a thin helper that constructs a mock event (with .type and relevant attributes), then calls traced_put_event on a mock session wired with a RealtimeTracingState. This would turn these into true integration tests of the handler dispatch path. That said, the current tests still validate RealtimeTracingState correctness, so this can be deferred.

531-591: Test doesn't cover the if usage: gating issue — no test for response.done without usage.

As flagged in the wrapper review, completion extraction is currently gated on usage being present. Consider adding a test with a response.done payload that has output but no usage to document the expected behavior (and to catch the regression once the gating is fixed).

...trumentation-openai-agents/opentelemetry/instrumentation/openai_agents/_realtime_wrappers.py

galkleinman

Would be great if we could take the opportunity to extract some of the nested event types handling into dedicated helper methods. up to you tho.

LGTM

galkleinman · 2026-02-17T09:14:17Z

...trumentation-openai-agents/opentelemetry/instrumentation/openai_agents/_realtime_wrappers.py

+                                if item_content and isinstance(item_content, list):
+                                    for part in item_content:
+                                        text = getattr(part, "text", None) or getattr(
+                                            part, "transcript", None
+                                        )
+                                        if text:
+                                            state.record_completion(role, text)
+                                            break


Should this capture all assistant messages or just the most recent one?

Just the most recent one, intentionally.

There are three event paths that can capture an assistant completion:

history_added -- fires when a single new item is added to history

history_updated -- fires when the full cumulative history changes (contains ALL messages from the entire session)

response.done (inside raw_model_event) -- fires when the model finishes responding

Whichever fires first creates the LLM span via record_completion -> create_llm_span. The seen_completions hash set in record_completion deduplicates, so subsequent handlers that try to record the same text become no-ops.

For history_updated specifically: the history list is cumulative -- on turn N it contains all messages from turns 1 through N. Using reversed() + break grabs only the newest assistant message (the one relevant to the current turn). Without this, we'd iterate through all previous assistant messages on every turn. The dedup set would ultimately prevent duplicate spans, but it's unnecessary work.

Here is the actual event ordering from an E2E test against the agents SDK relay (one-turn conversation, "Tell me about the iPhone 17"):

# Session starts -- history_updated fires with system message only, no assistant content yet [OTEL DEBUG] traced_put_event CALLED, event_type=history_updated # ... ~750 raw_model_event events (audio deltas, transcription deltas, etc.) ... # Model finishes responding -- history_updated fires with full history including the new assistant message [OTEL DEBUG] traced_put_event CALLED, event_type=history_updated [OTEL DEBUG] record_completion called: role=assistant, content=The iPhone 17 is Apple's latest flagship phone... # response.done fires next -- record_usage captures tokens, record_completion is deduplicated [OTEL DEBUG] traced_put_event CALLED, event_type=raw_model_event [OTEL DEBUG] record_usage called: usage type=dict [OTEL DEBUG] record_completion called: role=assistant, content=The iPhone 17 is Apple's latest flagship phone... [OTEL DEBUG] record_completion: duplicate, skipping # Agent ends [OTEL DEBUG] traced_put_event CALLED, event_type=agent_end

This shows: history_updated captures the completion first, then response.done captures usage and correctly skips the duplicate completion.

…completions, and usage Handle history_updated events to capture assistant transcript updates. Fix dict-based data access in response.done handler where getattr was used on dicts instead of .get(), silently returning None. Fix dict-case event unwrapping where the data variable was not updated to the nested level. Remove dead response event handler that could never match. Closes traceloop#3685

…ript fallback Address PR review feedback: - Extract _extract_content_text, _unwrap_raw_event_data, and _extract_response_and_usage helpers to reduce nesting - De-nest completion extraction from if-usage guard so completions are captured even when usage is absent from response.done - Add transcript fallback in response.done handler to match history_updated and item_updated handlers - Add test for response.done without usage

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (2)

packages/opentelemetry-instrumentation-openai-agents/tests/test_realtime_session.py (2)
628-637: Inline extraction logic diverges from the helper and the actual handler.

test_response_done_dict_captures_usage_and_completion manually extracts text via part.get("text") (Line 634) without the transcript fallback, while test_response_done_without_usage_still_captures_completion correctly delegates to _extract_content_text (Line 687). The production handler at _realtime_wrappers.py Line 627 uses _extract_content_text everywhere now, so this test wouldn't catch a regression on transcript-only content parts.

Consider using _extract_content_text consistently in all three tests (it's already imported).
Proposed fix
                         if item.get("type") == "message" and item.get("role") == "assistant":
                             item_content = item.get("content")
-                            if item_content and isinstance(item_content, list):
-                                for part in item_content:
-                                    text = part.get("text") if isinstance(part, dict) else None
-                                    if text:
-                                        state.record_completion("assistant", text)
-                                        break
+                            text = _extract_content_text(item_content)
+                            if text:
+                                state.record_completion("assistant", text)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@packages/opentelemetry-instrumentation-openai-agents/tests/test_realtime_session.py`
around lines 628 - 637, The test manually extracts text using part.get("text")
in test_response_done_dict_captures_usage_and_completion; change that extraction
to call the already-imported helper _extract_content_text so it matches
test_response_done_without_usage_still_captures_completion and the production
handler in _realtime_wrappers.py; locate the loop in
tests/test_realtime_session.py that iterates output items (the block using
isinstance(item, dict) and item.get("type") == "message") and replace the direct
part.get("text") usage with a call to _extract_content_text(part) before passing
the result to state.record_completion("assistant", text).
538-698: Tests replicate handler logic inline rather than exercising traced_put_event.

All three tests manually reconstruct the extraction logic (iterating history, unwrapping dicts, etc.) instead of invoking the actual event handler path. This means the tests validate RealtimeTracingState.record_* methods but do not cover the wiring inside traced_put_event — the very code this PR is fixing. A regression in traced_put_event (e.g., a typo in an event_type check or a wrong nesting level) would not be caught.

Consider adding at least one integration-style test that constructs a mock event object with the appropriate .type and .data attributes, then calls the handler logic (or a thin wrapper around it) to verify end-to-end span creation.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@packages/opentelemetry-instrumentation-openai-agents/tests/test_realtime_session.py`
around lines 538 - 698, Add an integration-style test that exercises the
traced_put_event handler end-to-end instead of reproducing its logic inline:
create a mock event object with .type (e.g., "history.updated" or
"response.done") and .data shaped like the real payload, call the
traced_put_event function (or a small wrapper that routes to it) while using
RealtimeTracingState and the tracer fixture, then assert
exporter.get_finished_spans() contains the expected "openai.realtime" span
attributes (gen_ai.completion.* and gen_ai.usage.*). Ensure the test references
traced_put_event and RealtimeTracingState so it will fail if event-type checks
or nesting handling inside traced_put_event regress.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@packages/opentelemetry-instrumentation-openai-agents/opentelemetry/instrumentation/openai_agents/_realtime_wrappers.py`:
- Around line 632-640: The item_updated branch currently uses getattr(data,
"item", None) which skips dict-shaped events; change the handler for data_type
== "item_updated" to mirror the response.done logic by checking isinstance(data,
dict) and pulling item = data.get("item") (and otherwise using getattr as
before), then extract role/content from the item via item.get("role") /
item.get("content") when item is a dict, and finally pass the content through
_extract_content_text and call state.record_completion(role, text) when role ==
"assistant" and text is truthy so dict-based events are handled consistently.

---

Nitpick comments:
In
`@packages/opentelemetry-instrumentation-openai-agents/tests/test_realtime_session.py`:
- Around line 628-637: The test manually extracts text using part.get("text") in
test_response_done_dict_captures_usage_and_completion; change that extraction to
call the already-imported helper _extract_content_text so it matches
test_response_done_without_usage_still_captures_completion and the production
handler in _realtime_wrappers.py; locate the loop in
tests/test_realtime_session.py that iterates output items (the block using
isinstance(item, dict) and item.get("type") == "message") and replace the direct
part.get("text") usage with a call to _extract_content_text(part) before passing
the result to state.record_completion("assistant", text).
- Around line 538-698: Add an integration-style test that exercises the
traced_put_event handler end-to-end instead of reproducing its logic inline:
create a mock event object with .type (e.g., "history.updated" or
"response.done") and .data shaped like the real payload, call the
traced_put_event function (or a small wrapper that routes to it) while using
RealtimeTracingState and the tracer fixture, then assert
exporter.get_finished_spans() contains the expected "openai.realtime" span
attributes (gen_ai.completion.* and gen_ai.usage.*). Ensure the test references
traced_put_event and RealtimeTracingState so it will fail if event-type checks
or nesting handling inside traced_put_event regress.

coderabbitai · 2026-02-17T19:04:18Z

...trumentation-openai-agents/opentelemetry/instrumentation/openai_agents/_realtime_wrappers.py

                        elif data_type == "item_updated":
                            item = getattr(data, "item", None)
                            if item:
                                role = getattr(item, "role", None)
                                item_content = getattr(item, "content", None)
-                                if (
-                                    role == "assistant"
-                                    and item_content
-                                    and isinstance(item_content, list)
-                                ):
-                                    for part in item_content:
-                                        text = getattr(part, "text", None) or getattr(
-                                            part, "transcript", None
-                                        )
-                                        if text:
-                                            state.record_completion(role, text)
-                                            break
+                                if role == "assistant":
+                                    text = _extract_content_text(item_content)
+                                    if text:
+                                        state.record_completion(role, text)


⚠️ Potential issue | 🟡 Minor

item_updated handler doesn't handle dict-based data — inconsistent with response.done.

After _unwrap_raw_event_data, data may be a dict (Lines 59–62 return a nested dict). Here, getattr(data, "item", None) on a dict returns None, so dict-shaped item_updated events would be silently skipped. The response.done block (Lines 610–629) correctly branches on isinstance(…, dict).

Proposed fix

elif data_type == "item_updated": - item = getattr(data, "item", None) + if isinstance(data, dict): + item = data.get("item") + else: + item = getattr(data, "item", None) if item: - role = getattr(item, "role", None) - item_content = getattr(item, "content", None) + if isinstance(item, dict): + role = item.get("role") + item_content = item.get("content") + else: + role = getattr(item, "role", None) + item_content = getattr(item, "content", None) if role == "assistant": text = _extract_content_text(item_content)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@packages/opentelemetry-instrumentation-openai-agents/opentelemetry/instrumentation/openai_agents/_realtime_wrappers.py` around lines 632 - 640, The item_updated branch currently uses getattr(data, "item", None) which skips dict-shaped events; change the handler for data_type == "item_updated" to mirror the response.done logic by checking isinstance(data, dict) and pulling item = data.get("item") (and otherwise using getattr as before), then extract role/content from the item via item.get("role") / item.get("content") when item is a dict, and finally pass the content through _extract_content_text and call state.record_completion(role, text) when role == "assistant" and text is truthy so dict-based events are handled consistently.

EliJaghab mentioned this pull request Feb 16, 2026

fix(openai-agents): add functools.wraps to dont_throw decorator #3687

Merged

2 tasks

EliJaghab marked this pull request as ready for review February 16, 2026 15:33

ellipsis-dev bot reviewed Feb 16, 2026

View reviewed changes

coderabbitai bot reviewed Feb 16, 2026

View reviewed changes

...trumentation-openai-agents/opentelemetry/instrumentation/openai_agents/_realtime_wrappers.py Outdated Show resolved Hide resolved

...trumentation-openai-agents/opentelemetry/instrumentation/openai_agents/_realtime_wrappers.py Outdated Show resolved Hide resolved

galkleinman approved these changes Feb 17, 2026

View reviewed changes

EliJaghab added 2 commits February 17, 2026 13:53

EliJaghab force-pushed the fix/realtime-event-handling branch from ba639de to ba7d1cc Compare February 17, 2026 19:00

coderabbitai bot reviewed Feb 17, 2026

View reviewed changes

galkleinman merged commit 02ae790 into traceloop:main Feb 19, 2026
9 of 10 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(openai-agents): fix realtime session event handling for prompts, completions, and usage#3688

fix(openai-agents): fix realtime session event handling for prompts, completions, and usage#3688
galkleinman merged 2 commits intotraceloop:mainfrom
EliJaghab:fix/realtime-event-handling

EliJaghab commented Feb 15, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Feb 15, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

EliJaghab commented Feb 16, 2026

Uh oh!

ellipsis-dev bot left a comment

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

galkleinman left a comment

Uh oh!

galkleinman Feb 17, 2026

Uh oh!

EliJaghab Feb 17, 2026 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Feb 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

EliJaghab commented Feb 15, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Feb 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

EliJaghab commented Feb 16, 2026

Uh oh!

ellipsis-dev bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

galkleinman left a comment

Choose a reason for hiding this comment

Uh oh!

galkleinman Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

EliJaghab Feb 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

EliJaghab commented Feb 15, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Feb 15, 2026 •

edited

Loading

EliJaghab Feb 17, 2026 •

edited

Loading