Fix flaky test_max_pending_count by preventing premature GC by xyaz1313 · Pull Request #857 · NVIDIA/numba-cuda

xyaz1313 · 2026-04-14T21:12:44Z

Problem

TestDeallocation.test_max_pending_count is flaky — intermittently fails with .

Root Cause

The test creates DeviceNDArray objects without storing references:

for i in range(config.CUDA_DEALLOCS_COUNT):
    cuda.to_device(np.arange(1))  # no reference stored
    self.assertEqual(len(deallocs), i + 1)

Each cuda.to_device() returns a DeviceNDArray → OwnedPointer → weakref.finalizer. When Python's GC non-deterministically collects these temporaries between loop iterations, their finalizers fire early and add deallocations to the pending queue. This makes len(deallocs) exceed the expected i + 1.

Fix

Store all device arrays in a list during creation
Delete them one by one in a controlled loop
Call gc.collect() after each deletion to ensure finalizers fire

This makes deallocation timing fully deterministic.

Reproduction

The flakiness depends on GC timing — it happens more frequently with high memory pressure or when running the full test suite (where previous tests accumulate GC-tracked objects).

Fixes #856

…mature GC The test creates DeviceNDArray objects without storing references, allowing Python's GC to non-deterministically collect them between loop iterations. This causes extra deallocations to appear in the pending queue, making assertions like 'len(deallocs) == i + 1' fail intermittently. Fix: store all arrays in a list, then explicitly delete them one by one. Added gc.collect() calls to ensure finalizers fire before assertions. Fixes NVIDIA#856

copy-pr-bot · 2026-04-14T21:12:48Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

gmarkall · 2026-04-15T08:36:42Z

I don't understand the root cause analysis here. My understanding is that by storing no reference to the DeviceNDArray returned by cuda.to_device(), its refcount would immediately drop to 0 and cause the finalizer to add its pointer to the list of deallocations. I would not have expected the GC to be involved here, because my understanding is that it's for cleaning up lost objects that participate in reference cycles.

My concern is that the real underlying issue could be that with all the changes we've made to leverage cuda-core / cuda-python more, we've accidentally introduced reference cycles that keep device memory alive even when nothing "downstream" of the call to cuda.to_device() is holding a reference to it anymore. If that were the case, then these test changes will make the test no longer flaky, but completely mask the underlying issue.

Perhaps I have some misunderstanding about the Python GC or the deallocation behaviour is supposed to work - if so, could you help me understand how things fit together please?

xyaz1313 mentioned this pull request Apr 14, 2026

Test: TestDeallocation.test_max_pending_count is flaky #856

Open

gmarkall added the 4 - Waiting on author Waiting for author to respond to review label Apr 15, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix flaky test_max_pending_count by preventing premature GC#857

Fix flaky test_max_pending_count by preventing premature GC#857
xyaz1313 wants to merge 1 commit into
NVIDIA:mainfrom
xyaz1313:fix/flaky-test-max-pending-count

xyaz1313 commented Apr 14, 2026

Uh oh!

copy-pr-bot Bot commented Apr 14, 2026

Uh oh!

gmarkall commented Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

xyaz1313 commented Apr 14, 2026

Problem

Root Cause

Fix

Reproduction

Uh oh!

copy-pr-bot Bot commented Apr 14, 2026

Uh oh!

gmarkall commented Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants