Skip to content

.NET: Fix intermittent checkpoint-restore race in in-process workflow runs#5134

Open
peibekwe wants to merge 7 commits intomainfrom
peibekwe/workflow-unit-tests
Open

.NET: Fix intermittent checkpoint-restore race in in-process workflow runs#5134
peibekwe wants to merge 7 commits intomainfrom
peibekwe/workflow-unit-tests

Conversation

@peibekwe
Copy link
Copy Markdown
Contributor

@peibekwe peibekwe commented Apr 7, 2026

Description

During live RestoreCheckpointAsync, queued external deliveries from the superseded timeline could survive restore and be applied after checkpoint state was imported. This caused flaky replay behavior in unit test sample execution, including incorrect prompt/order after restore.
The change clears queued external deliveries during checkpoint import and adds a regression test to verify restored runs remain pending until a fresh post-restore response is sent.

Contribution Checklist

  • The code builds clean without any errors or warnings
  • The PR follows the Contribution Guidelines
  • All unit tests pass, and I have added new tests where possible
  • Is this a breaking change? If yes, add "[BREAKING]" prefix to the title of the PR.

@moonbox3 moonbox3 added .NET workflows Related to Workflows in agent-framework labels Apr 7, 2026
@github-actions github-actions bot changed the title Fix intermittent checkpoint-restore race in in-process workflow runs .NET: Fix intermittent checkpoint-restore race in in-process workflow runs Apr 7, 2026
@peibekwe peibekwe marked this pull request as ready for review April 7, 2026 15:00
Copilot AI review requested due to automatic review settings April 7, 2026 15:00
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes an intermittent in-process checkpoint restore race where pre-restore queued external deliveries could be applied after checkpoint state import, leading to flaky replay behavior.

Changes:

  • Clear queued external deliveries during ImportStateAsync so stale responses/messages from a superseded timeline can’t be applied post-restore.
  • Add a regression unit test to ensure a restored run remains PendingRequests until a fresh post-restore response is provided.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
dotnet/src/Microsoft.Agents.AI.Workflows/InProc/InProcessRunnerContext.cs Clears queued external deliveries during checkpoint state import to prevent stale delivery application after restore.
dotnet/tests/Microsoft.Agents.AI.Workflows.UnitTests/CheckpointResumeTests.cs Adds a regression test validating queued responses from the superseded timeline don’t complete the restored run.


// Discard queued external deliveries from the superseded timeline so a runtime
// restore cannot apply stale responses after importing the checkpoint state.
while (this._queuedExternalDeliveries.TryDequeue(out _))
Copy link

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The dequeue loop clears the queue in-place, which can inadvertently discard new external deliveries enqueued concurrently with the restore (e.g., a response arriving just after restore starts). It can also spin longer than intended if enqueues continue. Consider swapping the queue atomically at the start of import (e.g., Interlocked.Exchange to a new ConcurrentQueue, then drain the old instance) so only pre-restore deliveries are dropped and post-restore deliveries are preserved.

Suggested change
while (this._queuedExternalDeliveries.TryDequeue(out _))
// Atomically swap the queue so only deliveries that were already queued before
// restore started are discarded; deliveries arriving concurrently are preserved
// on the new queue.
var queuedExternalDeliveries = Interlocked.Exchange(ref this._queuedExternalDeliveries, new());
while (queuedExternalDeliveries.TryDequeue(out _))

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

.NET workflows Related to Workflows in agent-framework

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants