Skip to content

Conversation

@Andy-Jost
Copy link
Contributor

@Andy-Jost Andy-Jost commented Jan 15, 2026

Summary

Add support for multi-GPU CI testing by introducing GPU_COUNT field to the test matrix and adding t4 and h100 2-GPU configurations.

Changes

  • ci/test-matrix.yml:

    • Added GPU_COUNT field to all entries for consistency
    • Added two new multi-GPU entries: t4 and h100 with GPU_COUNT: '2'
    • Removed special_runners section - entries now integrated directly into pull-request matrix
    • Aligned columns for readability (can be reverted if needed)
  • .github/workflows/test-wheel-linux.yml:

    • Updated runs-on to use ${{ matrix.GPU_COUNT }} instead of hardcoded -1
    • Updated job name to show (x2) suffix for multi-GPU tests (e.g., t4(x2))
    • Removed special_runners handling logic (no longer needed)
  • .github/workflows/test-wheel-windows.yml:

    • Updated runs-on to use ${{ matrix.GPU_COUNT }} for consistency
  • cuda_core/examples/simple_multi_gpu_example.py:

    • Switched from old CuPy RNG (cp.random.random()) to new RNG (cp.random.default_rng()) to avoid requiring libcurand.so
  • cuda_core/tests/test_launcher.py:

    • Switched to new CuPy RNG to avoid libcurand dependency

Test Coverage

  • Multi-GPU runners are now included in the standard PR test matrix
  • Job names clearly indicate GPU count: py3.13, 13.1.0, local, t4(x2)

Closes #1501

@Andy-Jost Andy-Jost self-assigned this Jan 15, 2026
@copy-pr-bot
Copy link
Contributor

copy-pr-bot bot commented Jan 15, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@Andy-Jost
Copy link
Contributor Author

/ok to test f97ef26

@github-actions

This comment has been minimized.

@Andy-Jost
Copy link
Contributor Author

/ok to test 9163804

Add GPU_COUNT field to test matrix to support multi-GPU configurations.
This enables rigorous testing of peer access, device switching, and
other multi-GPU functionality in CI.

Closes NVIDIA#1501
Simplifies CI configuration by moving special runner entries directly
into the pull-request matrix rather than handling them separately.
Replace cp.random.default_rng() with cp.arange() to avoid requiring
cuRAND library which may not match the installed CUDA version.
@Andy-Jost
Copy link
Contributor Author

/ok to test 074a6ca

@Andy-Jost
Copy link
Contributor Author

Andy-Jost commented Jan 15, 2026

Notes on test-matrix.yml changes:

  • Added GPU_COUNT field to all entries for consistency and to support multi-GPU runners (t4 and h100 with 2 GPUs).
  • Aligned columns for readability. This is optional and can be reverted if anyone objects.

@Andy-Jost
Copy link
Contributor Author

/ok to test 3184316

Switch from cp.random.random() (old RNG requiring libcurand.so) to
cp.random.default_rng().random() (new RNG with pre-compiled curand
device libs bundled in CuPy).
@Andy-Jost
Copy link
Contributor Author

/ok to test 4214c2c

@Andy-Jost Andy-Jost changed the title [WIP] Test multi-GPU CI runners Add multi-GPU CI runners (t4 and h100 with 2 GPUs) Jan 15, 2026
@Andy-Jost Andy-Jost marked this pull request as ready for review January 15, 2026 23:50
@kkraus14
Copy link
Collaborator

Do we want to run all of our CI jobs on these multi-GPU runners or can we only run multi-GPU specific tests / examples / etc. as needed?

Copy link
Member

@leofang leofang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to run all of our CI jobs on these multi-GPU runners or can we only run multi-GPU specific tests / examples / etc. as needed?

I suggested offline that we add 2 dual-GPU jobs on the per-PR basis for now, and monitor the usage in the next few days. We don't have the infra for as-needed tests yet (#299 is a good start).

Comment on lines -53 to -56
special_runners:
amd64:
- { ARCH: 'amd64', PY_VER: '3.13', CUDA_VER: '13.0.2', LOCAL_CTK: '1', GPU: 'H100', DRIVER: 'latest' }
- { ARCH: 'amd64', PY_VER: '3.13', CUDA_VER: '13.1.0', LOCAL_CTK: '1', GPU: 'H100', DRIVER: 'latest' }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My only nitpick is that it'd be nice to have a comment or code block that separates out "special runners" (including the 2-GPU ones introduced in this PR) from the regular matrix. It's easier to eyeball and update.

Comment on lines -91 to +93
a = cp.random.random(size, dtype=dtype)
b = cp.random.random(size, dtype=dtype)
rng = cp.random.default_rng()
a = rng.random(size, dtype=dtype)
b = rng.random(size, dtype=dtype)
Copy link
Member

@leofang leofang Jan 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keep a note on this change for posterity:

  1. We want to encourage end users to use the new NumPy/CuPy RNG interface
  2. But the real reason that we must do it in this PR is: The new RNG does not require libcurand to be installed. CuPy is self-contained -- for all of our use cases we just need NVRTC and driver. Our CI relies on this assumption (to save resources).

@leofang leofang added this to the cuda.core beta 12 milestone Jan 16, 2026
@leofang leofang added P0 High priority - Must do! CI/CD CI/CD infrastructure enhancement Any code-related improvements labels Jan 16, 2026
@mdboom
Copy link
Contributor

mdboom commented Jan 16, 2026

🎉 There are a few APIs about to land in cuda.core.system related to multiple GPUs that could benefit from this testing.

@leofang leofang merged commit 53c8d4a into NVIDIA:main Jan 16, 2026
88 checks passed
@github-actions
Copy link

Doc Preview CI
Preview removed because the pull request was closed or merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI/CD CI/CD infrastructure enhancement Any code-related improvements P0 High priority - Must do!

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Enable multi-GPU CI testing

4 participants