-
Notifications
You must be signed in to change notification settings - Fork 240
Add multi-GPU CI runners (t4 and h100 with 2 GPUs) #1505
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
/ok to test f97ef26 |
This comment has been minimized.
This comment has been minimized.
f97ef26 to
9163804
Compare
|
/ok to test 9163804 |
Add GPU_COUNT field to test matrix to support multi-GPU configurations. This enables rigorous testing of peer access, device switching, and other multi-GPU functionality in CI. Closes NVIDIA#1501
Simplifies CI configuration by moving special runner entries directly into the pull-request matrix rather than handling them separately.
Replace cp.random.default_rng() with cp.arange() to avoid requiring cuRAND library which may not match the installed CUDA version.
9163804 to
074a6ca
Compare
|
/ok to test 074a6ca |
8c0c7f4 to
3184316
Compare
|
Notes on test-matrix.yml changes:
|
|
/ok to test 3184316 |
Switch from cp.random.random() (old RNG requiring libcurand.so) to cp.random.default_rng().random() (new RNG with pre-compiled curand device libs bundled in CuPy).
|
/ok to test 4214c2c |
|
Do we want to run all of our CI jobs on these multi-GPU runners or can we only run multi-GPU specific tests / examples / etc. as needed? |
leofang
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to run all of our CI jobs on these multi-GPU runners or can we only run multi-GPU specific tests / examples / etc. as needed?
I suggested offline that we add 2 dual-GPU jobs on the per-PR basis for now, and monitor the usage in the next few days. We don't have the infra for as-needed tests yet (#299 is a good start).
| special_runners: | ||
| amd64: | ||
| - { ARCH: 'amd64', PY_VER: '3.13', CUDA_VER: '13.0.2', LOCAL_CTK: '1', GPU: 'H100', DRIVER: 'latest' } | ||
| - { ARCH: 'amd64', PY_VER: '3.13', CUDA_VER: '13.1.0', LOCAL_CTK: '1', GPU: 'H100', DRIVER: 'latest' } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My only nitpick is that it'd be nice to have a comment or code block that separates out "special runners" (including the 2-GPU ones introduced in this PR) from the regular matrix. It's easier to eyeball and update.
| a = cp.random.random(size, dtype=dtype) | ||
| b = cp.random.random(size, dtype=dtype) | ||
| rng = cp.random.default_rng() | ||
| a = rng.random(size, dtype=dtype) | ||
| b = rng.random(size, dtype=dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keep a note on this change for posterity:
- We want to encourage end users to use the new NumPy/CuPy RNG interface
- But the real reason that we must do it in this PR is: The new RNG does not require libcurand to be installed. CuPy is self-contained -- for all of our use cases we just need NVRTC and driver. Our CI relies on this assumption (to save resources).
|
🎉 There are a few APIs about to land in |
|
Summary
Add support for multi-GPU CI testing by introducing
GPU_COUNTfield to the test matrix and adding t4 and h100 2-GPU configurations.Changes
ci/test-matrix.yml:
GPU_COUNTfield to all entries for consistencyt4andh100withGPU_COUNT: '2'special_runnerssection - entries now integrated directly intopull-requestmatrix.github/workflows/test-wheel-linux.yml:
runs-onto use${{ matrix.GPU_COUNT }}instead of hardcoded-1(x2)suffix for multi-GPU tests (e.g.,t4(x2)).github/workflows/test-wheel-windows.yml:
runs-onto use${{ matrix.GPU_COUNT }}for consistencycuda_core/examples/simple_multi_gpu_example.py:
cp.random.random()) to new RNG (cp.random.default_rng()) to avoid requiring libcurand.socuda_core/tests/test_launcher.py:
Test Coverage
py3.13, 13.1.0, local, t4(x2)Closes #1501