fix(openai-agents): fix realtime session event handling for prompts, completions, and usage#3688
Conversation
📝 WalkthroughWalkthroughUnifies and refactors realtime event parsing in the OpenAI Agents instrumentation: adds helpers to extract content, unwrap nested raw events, and extract response/usage; fixes handling of Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Tip Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
Tested locally in our production voice relay (OpenAI Agents SDK realtime mode) with Jaeger trace collection. With this fix applied,
Before this fix, all of these attributes were missing on realtime spans. Confirmed across a 3-message automated conversation with multiple tool calls, producing 8 |
There was a problem hiding this comment.
Important
Looks good to me! 👍
Reviewed everything up to ba639de in 28 seconds. Click for details.
- Reviewed
231lines of code in2files - Skipped
0files when reviewing. - Skipped posting
0draft comments. View those below. - Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.
Workflow ID: wflow_6VG6uIcIdAzkFRtY
You can customize by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In
`@packages/opentelemetry-instrumentation-openai-agents/opentelemetry/instrumentation/openai_agents/_realtime_wrappers.py`:
- Around line 610-621: In the response.done handler, dict-based parts only check
part.get("text") (and the object path only checks getattr(part, "text")), so
audio parts with a "transcript" field are missed; update that block to mirror
history_added/history_updated/item_updated by falling back to
part.get("transcript") for dicts and getattr(part, "transcript", None) for
objects before calling state.record_completion(role, text), ensuring transcript
is used when text is absent.
- Around line 587-621: The completion extraction logic is incorrectly nested
under the `if usage:` guard so completions are skipped when `usage` is absent;
separate concerns by keeping `state.record_usage(usage)` inside the `if usage:`
block but move the entire output/completion extraction code (the logic that
inspects `response`/`output`/`item`/`item_content` and calls
`state.record_completion(role, text)`) out one level so it runs regardless of
`usage`. Locate the block using the symbols `state.record_usage`, `response`,
`output`, `item_type`, `role`, `item_content`, and `state.record_completion` in
`_realtime_wrappers.py` and de-indent that output-iteration block so usage
recording remains conditional but completion extraction always executes.
🧹 Nitpick comments (2)
packages/opentelemetry-instrumentation-openai-agents/tests/test_realtime_session.py (2)
480-529: Tests replicate handler logic inline instead of exercisingtraced_put_event— coverage gap.Both
test_history_updated_*andtest_response_done_dict_*manually reproduce the extraction logic fromtraced_put_event(lines 509–521 and 562–579) rather than dispatching a mock event through the actual handler. This means any future drift between the handler and these inline copies won't be caught.Consider creating a thin helper that constructs a mock event (with
.typeand relevant attributes), then callstraced_put_eventon a mock session wired with aRealtimeTracingState. This would turn these into true integration tests of the handler dispatch path. That said, the current tests still validateRealtimeTracingStatecorrectness, so this can be deferred.
531-591: Test doesn't cover theif usage:gating issue — no test for response.done without usage.As flagged in the wrapper review, completion extraction is currently gated on usage being present. Consider adding a test with a
response.donepayload that hasoutputbut nousageto document the expected behavior (and to catch the regression once the gating is fixed).
...trumentation-openai-agents/opentelemetry/instrumentation/openai_agents/_realtime_wrappers.py
Outdated
Show resolved
Hide resolved
...trumentation-openai-agents/opentelemetry/instrumentation/openai_agents/_realtime_wrappers.py
Outdated
Show resolved
Hide resolved
galkleinman
left a comment
There was a problem hiding this comment.
Would be great if we could take the opportunity to extract some of the nested event types handling into dedicated helper methods. up to you tho.
LGTM
| if item_content and isinstance(item_content, list): | ||
| for part in item_content: | ||
| text = getattr(part, "text", None) or getattr( | ||
| part, "transcript", None | ||
| ) | ||
| if text: | ||
| state.record_completion(role, text) | ||
| break |
There was a problem hiding this comment.
Should this capture all assistant messages or just the most recent one?
There was a problem hiding this comment.
Just the most recent one, intentionally.
There are three event paths that can capture an assistant completion:
history_added-- fires when a single new item is added to historyhistory_updated-- fires when the full cumulative history changes (contains ALL messages from the entire session)response.done(insideraw_model_event) -- fires when the model finishes responding
Whichever fires first creates the LLM span via record_completion -> create_llm_span. The seen_completions hash set in record_completion deduplicates, so subsequent handlers that try to record the same text become no-ops.
For history_updated specifically: the history list is cumulative -- on turn N it contains all messages from turns 1 through N. Using reversed() + break grabs only the newest assistant message (the one relevant to the current turn). Without this, we'd iterate through all previous assistant messages on every turn. The dedup set would ultimately prevent duplicate spans, but it's unnecessary work.
Here is the actual event ordering from an E2E test against the agents SDK relay (one-turn conversation, "Tell me about the iPhone 17"):
# Session starts -- history_updated fires with system message only, no assistant content yet
[OTEL DEBUG] traced_put_event CALLED, event_type=history_updated
# ... ~750 raw_model_event events (audio deltas, transcription deltas, etc.) ...
# Model finishes responding -- history_updated fires with full history including the new assistant message
[OTEL DEBUG] traced_put_event CALLED, event_type=history_updated
[OTEL DEBUG] record_completion called: role=assistant, content=The iPhone 17 is Apple's latest flagship phone...
# response.done fires next -- record_usage captures tokens, record_completion is deduplicated
[OTEL DEBUG] traced_put_event CALLED, event_type=raw_model_event
[OTEL DEBUG] record_usage called: usage type=dict
[OTEL DEBUG] record_completion called: role=assistant, content=The iPhone 17 is Apple's latest flagship phone...
[OTEL DEBUG] record_completion: duplicate, skipping
# Agent ends
[OTEL DEBUG] traced_put_event CALLED, event_type=agent_end
This shows: history_updated captures the completion first, then response.done captures usage and correctly skips the duplicate completion.
…completions, and usage Handle history_updated events to capture assistant transcript updates. Fix dict-based data access in response.done handler where getattr was used on dicts instead of .get(), silently returning None. Fix dict-case event unwrapping where the data variable was not updated to the nested level. Remove dead response event handler that could never match. Closes traceloop#3685
…ript fallback Address PR review feedback: - Extract _extract_content_text, _unwrap_raw_event_data, and _extract_response_and_usage helpers to reduce nesting - De-nest completion extraction from if-usage guard so completions are captured even when usage is absent from response.done - Add transcript fallback in response.done handler to match history_updated and item_updated handlers - Add test for response.done without usage
ba639de to
ba7d1cc
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
packages/opentelemetry-instrumentation-openai-agents/tests/test_realtime_session.py (2)
628-637: Inline extraction logic diverges from the helper and the actual handler.
test_response_done_dict_captures_usage_and_completionmanually extracts text viapart.get("text")(Line 634) without thetranscriptfallback, whiletest_response_done_without_usage_still_captures_completioncorrectly delegates to_extract_content_text(Line 687). The production handler at_realtime_wrappers.pyLine 627 uses_extract_content_texteverywhere now, so this test wouldn't catch a regression on transcript-only content parts.Consider using
_extract_content_textconsistently in all three tests (it's already imported).Proposed fix
if item.get("type") == "message" and item.get("role") == "assistant": item_content = item.get("content") - if item_content and isinstance(item_content, list): - for part in item_content: - text = part.get("text") if isinstance(part, dict) else None - if text: - state.record_completion("assistant", text) - break + text = _extract_content_text(item_content) + if text: + state.record_completion("assistant", text)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/opentelemetry-instrumentation-openai-agents/tests/test_realtime_session.py` around lines 628 - 637, The test manually extracts text using part.get("text") in test_response_done_dict_captures_usage_and_completion; change that extraction to call the already-imported helper _extract_content_text so it matches test_response_done_without_usage_still_captures_completion and the production handler in _realtime_wrappers.py; locate the loop in tests/test_realtime_session.py that iterates output items (the block using isinstance(item, dict) and item.get("type") == "message") and replace the direct part.get("text") usage with a call to _extract_content_text(part) before passing the result to state.record_completion("assistant", text).
538-698: Tests replicate handler logic inline rather than exercisingtraced_put_event.All three tests manually reconstruct the extraction logic (iterating history, unwrapping dicts, etc.) instead of invoking the actual event handler path. This means the tests validate
RealtimeTracingState.record_*methods but do not cover the wiring insidetraced_put_event— the very code this PR is fixing. A regression intraced_put_event(e.g., a typo in anevent_typecheck or a wrong nesting level) would not be caught.Consider adding at least one integration-style test that constructs a mock
eventobject with the appropriate.typeand.dataattributes, then calls the handler logic (or a thin wrapper around it) to verify end-to-end span creation.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/opentelemetry-instrumentation-openai-agents/tests/test_realtime_session.py` around lines 538 - 698, Add an integration-style test that exercises the traced_put_event handler end-to-end instead of reproducing its logic inline: create a mock event object with .type (e.g., "history.updated" or "response.done") and .data shaped like the real payload, call the traced_put_event function (or a small wrapper that routes to it) while using RealtimeTracingState and the tracer fixture, then assert exporter.get_finished_spans() contains the expected "openai.realtime" span attributes (gen_ai.completion.* and gen_ai.usage.*). Ensure the test references traced_put_event and RealtimeTracingState so it will fail if event-type checks or nesting handling inside traced_put_event regress.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In
`@packages/opentelemetry-instrumentation-openai-agents/opentelemetry/instrumentation/openai_agents/_realtime_wrappers.py`:
- Around line 632-640: The item_updated branch currently uses getattr(data,
"item", None) which skips dict-shaped events; change the handler for data_type
== "item_updated" to mirror the response.done logic by checking isinstance(data,
dict) and pulling item = data.get("item") (and otherwise using getattr as
before), then extract role/content from the item via item.get("role") /
item.get("content") when item is a dict, and finally pass the content through
_extract_content_text and call state.record_completion(role, text) when role ==
"assistant" and text is truthy so dict-based events are handled consistently.
---
Nitpick comments:
In
`@packages/opentelemetry-instrumentation-openai-agents/tests/test_realtime_session.py`:
- Around line 628-637: The test manually extracts text using part.get("text") in
test_response_done_dict_captures_usage_and_completion; change that extraction to
call the already-imported helper _extract_content_text so it matches
test_response_done_without_usage_still_captures_completion and the production
handler in _realtime_wrappers.py; locate the loop in
tests/test_realtime_session.py that iterates output items (the block using
isinstance(item, dict) and item.get("type") == "message") and replace the direct
part.get("text") usage with a call to _extract_content_text(part) before passing
the result to state.record_completion("assistant", text).
- Around line 538-698: Add an integration-style test that exercises the
traced_put_event handler end-to-end instead of reproducing its logic inline:
create a mock event object with .type (e.g., "history.updated" or
"response.done") and .data shaped like the real payload, call the
traced_put_event function (or a small wrapper that routes to it) while using
RealtimeTracingState and the tracer fixture, then assert
exporter.get_finished_spans() contains the expected "openai.realtime" span
attributes (gen_ai.completion.* and gen_ai.usage.*). Ensure the test references
traced_put_event and RealtimeTracingState so it will fail if event-type checks
or nesting handling inside traced_put_event regress.
| elif data_type == "item_updated": | ||
| item = getattr(data, "item", None) | ||
| if item: | ||
| role = getattr(item, "role", None) | ||
| item_content = getattr(item, "content", None) | ||
| if ( | ||
| role == "assistant" | ||
| and item_content | ||
| and isinstance(item_content, list) | ||
| ): | ||
| for part in item_content: | ||
| text = getattr(part, "text", None) or getattr( | ||
| part, "transcript", None | ||
| ) | ||
| if text: | ||
| state.record_completion(role, text) | ||
| break | ||
| if role == "assistant": | ||
| text = _extract_content_text(item_content) | ||
| if text: | ||
| state.record_completion(role, text) |
There was a problem hiding this comment.
item_updated handler doesn't handle dict-based data — inconsistent with response.done.
After _unwrap_raw_event_data, data may be a dict (Lines 59–62 return a nested dict). Here, getattr(data, "item", None) on a dict returns None, so dict-shaped item_updated events would be silently skipped. The response.done block (Lines 610–629) correctly branches on isinstance(…, dict).
Proposed fix
elif data_type == "item_updated":
- item = getattr(data, "item", None)
+ if isinstance(data, dict):
+ item = data.get("item")
+ else:
+ item = getattr(data, "item", None)
if item:
- role = getattr(item, "role", None)
- item_content = getattr(item, "content", None)
+ if isinstance(item, dict):
+ role = item.get("role")
+ item_content = item.get("content")
+ else:
+ role = getattr(item, "role", None)
+ item_content = getattr(item, "content", None)
if role == "assistant":
text = _extract_content_text(item_content)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In
`@packages/opentelemetry-instrumentation-openai-agents/opentelemetry/instrumentation/openai_agents/_realtime_wrappers.py`
around lines 632 - 640, The item_updated branch currently uses getattr(data,
"item", None) which skips dict-shaped events; change the handler for data_type
== "item_updated" to mirror the response.done logic by checking isinstance(data,
dict) and pulling item = data.get("item") (and otherwise using getattr as
before), then extract role/content from the item via item.get("role") /
item.get("content") when item is a dict, and finally pass the content through
_extract_content_text and call state.record_completion(role, text) when role ==
"assistant" and text is truthy so dict-based events are handled consistently.
Summary
history_updatedevent handler to capture assistant transcript updatesresponse.donehandler (getattron dicts silently returnedNoneinstead of using.get())datavariable was not updated to the nested levelresponseevent handler (no SDK session event hastype="response")Why
The realtime instrumentation had several event type mismatches with the openai-agents SDK (v0.6.0+):
history_updatedevents were completely ignored, so assistant transcript updates from multi-turn conversations were never captured as completions.response.donehandler,getattr()was used on dict objects instead of.get(), so completion extraction silently failed (usage was captured correctly via the dict path, but output/content was not).raw_model_eventdid not update thedatareference after finding nested data, so subsequent dict-path code looked at the wrong nesting level.responseevent handler was dead code -- no SDK session event hastype="response".Closes #3685
Test plan
history_updatedassistant completion capture via transcriptresponse.doneusage and completion extractionImportant
Fix event handling in OpenAI Agents SDK by adding
history_updatedhandler, correctingresponse.donedata extraction, and removing dead code.history_updatedevent handler in_realtime_wrappers.pyto capture assistant transcript updates.response.donehandler to use.get()instead ofgetattr()for dicts, ensuring completion extraction.raw_model_eventto updatedatato the nested level.responseevent handler.history_updatedevent handling intest_realtime_session.py.response.donehandling intest_realtime_session.py.This description was created by
for ba639de. You can customize this summary. It will automatically update as commits are pushed.
Summary by CodeRabbit
New Features
Tests