Skip to content

Fix flaky test_max_pending_count by preventing premature GC#857

Open
xyaz1313 wants to merge 1 commit into
NVIDIA:mainfrom
xyaz1313:fix/flaky-test-max-pending-count
Open

Fix flaky test_max_pending_count by preventing premature GC#857
xyaz1313 wants to merge 1 commit into
NVIDIA:mainfrom
xyaz1313:fix/flaky-test-max-pending-count

Conversation

@xyaz1313
Copy link
Copy Markdown

Problem

TestDeallocation.test_max_pending_count is flaky — intermittently fails with .

Root Cause

The test creates DeviceNDArray objects without storing references:

for i in range(config.CUDA_DEALLOCS_COUNT):
    cuda.to_device(np.arange(1))  # no reference stored
    self.assertEqual(len(deallocs), i + 1)

Each cuda.to_device() returns a DeviceNDArray → OwnedPointer → weakref.finalizer. When Python's GC non-deterministically collects these temporaries between loop iterations, their finalizers fire early and add deallocations to the pending queue. This makes len(deallocs) exceed the expected i + 1.

Fix

  1. Store all device arrays in a list during creation
  2. Delete them one by one in a controlled loop
  3. Call gc.collect() after each deletion to ensure finalizers fire

This makes deallocation timing fully deterministic.

Reproduction

The flakiness depends on GC timing — it happens more frequently with high memory pressure or when running the full test suite (where previous tests accumulate GC-tracked objects).

Fixes #856

…mature GC

The test creates DeviceNDArray objects without storing references, allowing
Python's GC to non-deterministically collect them between loop iterations.
This causes extra deallocations to appear in the pending queue, making
assertions like 'len(deallocs) == i + 1' fail intermittently.

Fix: store all arrays in a list, then explicitly delete them one by one.
Added gc.collect() calls to ensure finalizers fire before assertions.

Fixes NVIDIA#856
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Apr 14, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@gmarkall
Copy link
Copy Markdown
Contributor

I don't understand the root cause analysis here. My understanding is that by storing no reference to the DeviceNDArray returned by cuda.to_device(), its refcount would immediately drop to 0 and cause the finalizer to add its pointer to the list of deallocations. I would not have expected the GC to be involved here, because my understanding is that it's for cleaning up lost objects that participate in reference cycles.

My concern is that the real underlying issue could be that with all the changes we've made to leverage cuda-core / cuda-python more, we've accidentally introduced reference cycles that keep device memory alive even when nothing "downstream" of the call to cuda.to_device() is holding a reference to it anymore. If that were the case, then these test changes will make the test no longer flaky, but completely mask the underlying issue.

Perhaps I have some misunderstanding about the Python GC or the deallocation behaviour is supposed to work - if so, could you help me understand how things fit together please?

@gmarkall gmarkall added the 4 - Waiting on author Waiting for author to respond to review label Apr 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

4 - Waiting on author Waiting for author to respond to review

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Test: TestDeallocation.test_max_pending_count is flaky

2 participants