.NET: Fix intermittent checkpoint-restore race in in-process workflow runs#5134
.NET: Fix intermittent checkpoint-restore race in in-process workflow runs#5134
Conversation
There was a problem hiding this comment.
Pull request overview
Fixes an intermittent in-process checkpoint restore race where pre-restore queued external deliveries could be applied after checkpoint state import, leading to flaky replay behavior.
Changes:
- Clear queued external deliveries during
ImportStateAsyncso stale responses/messages from a superseded timeline can’t be applied post-restore. - Add a regression unit test to ensure a restored run remains
PendingRequestsuntil a fresh post-restore response is provided.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| dotnet/src/Microsoft.Agents.AI.Workflows/InProc/InProcessRunnerContext.cs | Clears queued external deliveries during checkpoint state import to prevent stale delivery application after restore. |
| dotnet/tests/Microsoft.Agents.AI.Workflows.UnitTests/CheckpointResumeTests.cs | Adds a regression test validating queued responses from the superseded timeline don’t complete the restored run. |
|
|
||
| // Discard queued external deliveries from the superseded timeline so a runtime | ||
| // restore cannot apply stale responses after importing the checkpoint state. | ||
| while (this._queuedExternalDeliveries.TryDequeue(out _)) |
There was a problem hiding this comment.
The dequeue loop clears the queue in-place, which can inadvertently discard new external deliveries enqueued concurrently with the restore (e.g., a response arriving just after restore starts). It can also spin longer than intended if enqueues continue. Consider swapping the queue atomically at the start of import (e.g., Interlocked.Exchange to a new ConcurrentQueue, then drain the old instance) so only pre-restore deliveries are dropped and post-restore deliveries are preserved.
| while (this._queuedExternalDeliveries.TryDequeue(out _)) | |
| // Atomically swap the queue so only deliveries that were already queued before | |
| // restore started are discarded; deliveries arriving concurrently are preserved | |
| // on the new queue. | |
| var queuedExternalDeliveries = Interlocked.Exchange(ref this._queuedExternalDeliveries, new()); | |
| while (queuedExternalDeliveries.TryDequeue(out _)) |
Description
During live
RestoreCheckpointAsync, queued external deliveries from the superseded timeline could survive restore and be applied after checkpoint state was imported. This caused flaky replay behavior in unit test sample execution, including incorrect prompt/order after restore.The change clears queued external deliveries during checkpoint import and adds a regression test to verify restored runs remain pending until a fresh post-restore response is sent.
Contribution Checklist