-
Notifications
You must be signed in to change notification settings - Fork 325
Use cuda.bindings.path_finder in cuda.parallel wheel
#4735
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
… from `cuda.parallel`
|
Proof of concept commit 3e7a419, which simply
Tested locally: Ubuntu 24.04 with #!/bin/bash
set -euo pipefail
if [[ "$(realpath .)" != */cccl/python ]]; then
echo "Please cd cccl/python"
exit 1
fi
CCCL_PYTHON="$(pwd)"
rm -rf venvs/scratch/cpwheels
/usr/bin/python -m venv venvs/scratch/cpwheels
. venvs/scratch/cpwheels/bin/activate
pip install --upgrade pip
pip install pytest typing_extensions
cd "$CCCL_PYTHON/cuda_cccl"
rm -f cuda_cccl-*.whl
pip wheel -v .
pip install cuda_cccl-*.whl
cd "$CCCL_PYTHON/cuda_parallel"
rm -f cuda_parallel-*.whl
# nvcc → PATH
PATH="/usr/local/cuda/bin:$PATH" pip wheel -v .
pip install cuda_parallel-*.whl
find "$CCCL_PYTHON"/venvs/scratch/cpwheels/lib/python3.*/site-packages/cuda/parallel -name '*.so*' -print -exec ldd {} \;
# nvdisasm → PATH
(cd tests && PATH="/usr/local/cuda/bin:$PATH" python -m pytest --log-cli-level=INFO -s -vv test_reduce_api.py)
pip install nvidia-cuda-runtime-cu12 nvidia-cuda-nvrtc-cu12 nvidia-nvjitlink-cu12
# nvdisasm → PATH
(cd tests && PATH="/usr/local/cuda/bin:$PATH" python -m pytest --log-cli-level=INFO -s -vv test_reduce_api.py)Full output: build_test_cuda_parallel_wheel_log_2025-05-18+2037.txt Excerpts: This shows that the two Note: Symbols provided by But loading those libraries at runtime before those symbols are needed is sufficient: It works with libraries loaded from It also works with libraries loaded from wheels: Action (failed): https://github.com/NVIDIA/cccl/actions/runs/15103908290/job/42449191939 |
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
🟨 CI finished in 39m 20s: Pass: 85%/14 | Total: 1h 44m | Avg: 7m 26s | Max: 20m 14s
|
| Project | |
|---|---|
| CCCL Infrastructure | |
| libcu++ | |
| CUB | |
| Thrust | |
| CUDA Experimental | |
| stdpar | |
| +/- | python |
| +/- | CCCL C Parallel Library |
| Catch2Helper |
Modifications in project or dependencies?
| Project | |
|---|---|
| CCCL Infrastructure | |
| libcu++ | |
| CUB | |
| Thrust | |
| CUDA Experimental | |
| stdpar | |
| +/- | python |
| +/- | CCCL C Parallel Library |
| Catch2Helper |
🏃 Runner counts (total jobs: 14)
| # | Runner |
|---|---|
| 7 | linux-amd64-cpu16 |
| 6 | linux-amd64-gpu-rtxa6000-latest-1 |
| 1 | linux-amd64-gpu-rtx2080-latest-1 |
…es inside the container are not silently ignored.
|
/ok to test |
|
Action (failure): https://github.com/NVIDIA/cccl/actions/runs/15118789339/job/42496209585?pr=4735 Issue: We need to add back |
…the changes in c/parallel/CMakeLists.txt)
I hadn't hit save on the issue, just did and then saw this. I think the best course of action is to add them back in, they stay in as needed and then |
cryos
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some suggestions
Co-authored-by: Marcus D. Hanwell <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it not work for some reason if we did this directly in (the existing) _bindings.pyx?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think it cannot work without introducing this layer of indirection.
These are the ldd outputs with a wheel before I made the _bindings_impl change:
$ ldd cuda/parallel/experimental/_bindings.cpython-312-x86_64-linux-gnu.so
linux-vdso.so.1 (0x00007927a2d5d000)
libcccl.c.parallel.so => /home/rgrossekunst/junk/cuda/parallel/experimental/cccl/libcccl.c.parallel.so (0x00007927a2b51000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007927a2800000)
libnvrtc.so.12 => /usr/local/cuda/targets/x86_64-linux/lib/libnvrtc.so.12 (0x000079279be00000)
libnvJitLink.so.12 => /usr/local/cuda/targets/x86_64-linux/lib/libnvJitLink.so.12 (0x0000792796000000)
libcuda.so.1 => /lib/x86_64-linux-gnu/libcuda.so.1 (0x0000792790600000)
libstdc++.so.6 => /lib/x86_64-linux-gnu/libstdc++.so.6 (0x0000792790200000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007927a2a4e000)
libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007927a2a20000)
/lib64/ld-linux-x86-64.so.2 (0x00007927a2d5f000)
libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007927a2a1b000)
librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007927a2a16000)
libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007927a27fb000)
$ ldd cuda/parallel/experimental/cccl/libcccl.c.parallel.so
linux-vdso.so.1 (0x00007b2d8cc0a000)
libnvrtc.so.12 => /usr/local/cuda/targets/x86_64-linux/lib/libnvrtc.so.12 (0x00007b2d86000000)
libnvJitLink.so.12 => /usr/local/cuda/targets/x86_64-linux/lib/libnvJitLink.so.12 (0x00007b2d80200000)
libcuda.so.1 => /lib/x86_64-linux-gnu/libcuda.so.1 (0x00007b2d7a800000)
libstdc++.so.6 => /lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007b2d7a400000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007b2d8c990000)
libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007b2d8c960000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007b2d7a000000)
/lib64/ld-linux-x86-64.so.2 (0x00007b2d8cc0c000)
libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007b2d8c95b000)
librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007b2d8c956000)
libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007b2d8c951000)
We have a number of _bindings imports:
$ git grep -e 'import.*_bindings'
cuda/parallel/experimental/algorithms/_merge_sort.py:from .. import _bindings
cuda/parallel/experimental/algorithms/_radix_sort.py:from .. import _bindings
cuda/parallel/experimental/algorithms/_reduce.py:from .. import _bindings
cuda/parallel/experimental/algorithms/_scan.py:from .. import _bindings
cuda/parallel/experimental/algorithms/_segmented_reduce.py:from .. import _bindings
cuda/parallel/experimental/algorithms/_transform.py:from .. import _bindings
cuda/parallel/experimental/algorithms/_unique_by_key.py:from .. import _bindings
tests/test_bindings.py:import cuda.parallel.experimental._bindings as bindings
When any of those run, we need to ensure that path_finder loads nvrtc and nvJitLink before the cython extension is imported.
(That's basically what I need to explain in the comment you requested.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks - I think that would also be useful to include in your comments (why we need the indirection)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmmm I wonder why this linking happens
$ ldd cuda/parallel/experimental/_bindings.cpython-312-x86_64-linux-gnu.so
...
libnvrtc.so.12 => /usr/local/cuda/targets/x86_64-linux/lib/libnvrtc.so.12 (0x000079279be00000)
libnvJitLink.so.12 => /usr/local/cuda/targets/x86_64-linux/lib/libnvJitLink.so.12 (0x0000792796000000)
...
I don't think the binding extension module needs to link to NVRTC/nvJitLink? @oleksandr-pavlyk I can't tell how they sneaked in by inspecting CMakeList
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question, it was like that before this PR†. I figure it won't change anything for this PR, because the cython extension needs to depend on libcccl.c.parallel.so, which needs to depend on nvrtc and nvJitLink.
†I double-checked to be sure what I'm writing is correct, inspecting the same wheel I used for the size comparison:
$ ldd cuda/parallel/experimental/_bindings.cpython-313-x86_64-linux-gnu.so
linux-vdso.so.1 (0x00007a0bef2d2000)
libcccl.c.parallel.so => /home/rgrossekunst/junk/before/cuda/parallel/experimental/cccl/libcccl.c.parallel.so (0x00007a0bef11d000)
libcuda.so.1 => /lib/x86_64-linux-gnu/libcuda.so.1 (0x00007a0be9800000)
libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007a0bef100000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007a0be9400000)
libcudart-381c0faa.so.12.9.37 => /home/rgrossekunst/junk/before/cuda/parallel/experimental/cccl/../../../../cuda_parallel.libs/libcudart-381c0faa.so.12.9.37 (0x00007a0be9000000)
libnvrtc-b0064dcb.so.12.9.41 => /home/rgrossekunst/junk/before/cuda/parallel/experimental/cccl/../../../../cuda_parallel.libs/libnvrtc-b0064dcb.so.12.9.41 (0x00007a0be2600000)
libnvJitLink-21ab891f.so.12.9.41 => /home/rgrossekunst/junk/before/cuda/parallel/experimental/cccl/../../../../cuda_parallel.libs/libnvJitLink-21ab891f.so.12.9.41 (0x00007a0bdc800000)
librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007a0bef0f9000)
libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007a0bef0f4000)
libstdc++.so.6 => /lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007a0bdc400000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007a0be9717000)
libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007a0bef0c4000)
/lib64/ld-linux-x86-64.so.2 (0x00007a0bef2d4000)
$ ldd cuda/parallel/experimental/cccl/libcccl.c.parallel.so
linux-vdso.so.1 (0x000073e7b4dc5000)
libcudart-381c0faa.so.12.9.37 => /home/rgrossekunst/junk/before/cuda/parallel/experimental/cccl/../../../../cuda_parallel.libs/libcudart-381c0faa.so.12.9.37 (0x000073e7b4800000)
libnvrtc-b0064dcb.so.12.9.41 => /home/rgrossekunst/junk/before/cuda/parallel/experimental/cccl/../../../../cuda_parallel.libs/libnvrtc-b0064dcb.so.12.9.41 (0x000073e7ade00000)
libnvJitLink-21ab891f.so.12.9.41 => /home/rgrossekunst/junk/before/cuda/parallel/experimental/cccl/../../../../cuda_parallel.libs/libnvJitLink-21ab891f.so.12.9.41 (0x000073e7a8000000)
libcuda.so.1 => /lib/x86_64-linux-gnu/libcuda.so.1 (0x000073e7a2600000)
librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x000073e7b4c8e000)
libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x000073e7b4c87000)
libstdc++.so.6 => /lib/x86_64-linux-gnu/libstdc++.so.6 (0x000073e7a2200000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x000073e7b4b9e000)
libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x000073e7b4b70000)
libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x000073e7b4b6b000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x000073e7a1e00000)
/lib64/ld-linux-x86-64.so.2 (0x000073e7b4dc7000)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(corrected my autospell checker mistake, sorry)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I marked this thread resolved but I am reopening it...
To be pedantic, the old treatment (pre-loading in root __init__.py) is actually slightly safer that the new/current treatment, which relies on _bindings being imported first (by other modules or by the developer). But what if someone decides to import _bindings_impl directly without importing _bindings? Then I don't think preloading would happen!
Either we revert back to the old treatment, or we move _bindings_impl one-level down, something like
- _bindings/
- _bindings/__init__.py # run preloading here
- _bindings/_bindings_impl.pyx
then we can guarantee the preloading always happens regardless how import happens, as desired.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But what if someone decides to import
_bindings_impldirectly without importing_bindings?
Everything here is private (leading underscores), so: That's simply invalid. Nobody will be surprised if that doesn't work.
I think it's best to have the preloading only if it is actually needed (current implementation).
Previously it would trigger even for unrelated imports, literally anything under cuda.parallel.experimental. That was great and easy for the proof-of-concept stage, but poor factoring from a more purist viewpoint.
Honestly, I think making the code organization more complicated by introducing a _bindings/ directory is a bit too elaborate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK as long as our current binding maintainer @oleksandr-pavlyk is cool I am cool. As I said this is pedantic, I just feel obligated to call it out 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It crossed my mind last night: Automatic doc generation could try to import _bindings_impl
But moving this to _bindings/__init__.py would only make a difference if all dependencies are fully installed. AFAIK we don't have a setup like that. If we want to work on that in the future, and we actually run into a problem, it'll probably only be a tiny extra chore to introduce _bindings/__init__.py then.
Awesome, thanks for the investigation, and the great report!
Yes, that's something we discussed briefly but left for later. There is a hint about that in the existing README here. (I was thinking this applies to Windows, too.)
I tried to simply use |
🟩 CI finished in 1h 57m: Pass: 100%/187 | Total: 2d 10h | Avg: 18m 43s | Max: 1h 29m | Hits: 84%/289419
|
| Project | |
|---|---|
| +/- | CCCL Infrastructure |
| libcu++ | |
| CUB | |
| Thrust | |
| CUDA Experimental | |
| stdpar | |
| +/- | python |
| +/- | CCCL C Parallel Library |
| Catch2Helper |
Modifications in project or dependencies?
| Project | |
|---|---|
| +/- | CCCL Infrastructure |
| +/- | libcu++ |
| +/- | CUB |
| +/- | Thrust |
| +/- | CUDA Experimental |
| +/- | stdpar |
| +/- | python |
| +/- | CCCL C Parallel Library |
| +/- | Catch2Helper |
🏃 Runner counts (total jobs: 187)
| # | Runner |
|---|---|
| 129 | linux-amd64-cpu16 |
| 15 | windows-amd64-cpu16 |
| 12 | linux-arm64-cpu16 |
| 12 | linux-amd64-gpu-rtxa6000-latest-1 |
| 11 | linux-amd64-gpu-rtx2080-latest-1 |
| 5 | linux-amd64-gpu-h100-latest-1 |
| 3 | linux-amd64-gpu-rtx4090-latest-1 |
It might be useful for debugging to look at the wheel name before and after the call to |
| for libname in ("nvrtc", "nvJitLink"): | ||
| _load_nvidia_dynamic_library(libname) | ||
|
|
||
| from ._bindings_impl import * # noqa: E402 F403 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would bite a bullet and explicitly enumerate imported symbols here. All exported symbols are documented in _bindings.pyi.
from ._bindings_impl import (
IntEnumerationMember,
TypeEnum,
OpKind,
IteratorKind,
SortOrder,
is_TypeEnum,
is_OpKind,
is_IteratorKind,
is_SortOrder,
Op,
TypeInfo,
Value,
Pointer,
make_pointer_object,
IteratorState,
Iterator,
CommonData,
DeviceReduceBuildResult,
DeviceScanBuildResult,
DeviceSegmentedReduceBuildResult,
DeviceMergeSortBuildResult,
DeviceUniqueByKeyBuildResult,
DeviceRadixSortBuildResult,
DeviceUnaryTransform,
DeviceBinaryTransform,
)Also, was renaming of _bindings.pyi necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would bite a bullet and explicitly enumerate imported symbols here. All exported symbols are documented in
_bindings.pyi.
Generally I prefer DRY-ness over investing human energy (now and on-going) into working around a lack of smartness in tooling. But I'm happy to go with the flow: @shwina, @leofang what's your take on this?
Also, was renaming of
_bindings.pyinecessary?
I don't know, will find out (should be quick).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't worry about explicit import. I still have it on my plate to auto-generate the C bindings, and by the time we sort it out this would be moot. We've conquered so many C libraries, there's no reason I can't autogenerate for CCCL C 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, was renaming of
_bindings.pyinecessary?
It turns out "no" (thanks): commit 04c8b0a
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is one of the few cases where import * is actually the most idiomatic solution. We can define __all__ in the module being imported from to define exactly what symbols get imported by import *.
Copy-pasting from CI log file below. The original and modified wheel names lined up for easy comparison: This looks like a smoking gun: For easy reference, the command leading to that message is: python -m auditwheel repair \
--exclude 'libcuda.so*' \
--exclude 'libnvrtc.so*' \
--exclude 'libnvJitLink.so*' \
cuda_parallel-*.whlHow is that requesting https://github.com/NVIDIA/cccl/actions/runs/15166879093/job/42647012005?pr=4735 logs_38920744335.zip |
|
For comparison: https://github.com/NVIDIA/cccl/actions/runs/15167229298/job/42648241197 And the diff of selected parts (I removed the Successfully installed auditwheel-6.3.0 packaging-25.0 patchelf-0.17.2.2 pyelftools-0.32
-[notice] A new release of pip is available: 25.0.1 -> 25.1.1
INFO:auditwheel.main_repair:Repairing cuda_parallel-0.1.3.1.0-cp313-cp313-linux_x86_64.whl
+INFO:auditwheel.main_repair:Wheel is eligible for a higher priority tag. You requested manylinux_2_28_x86_64 but I have found this wheel is eligible for manylinux_2_27_x86_64.
INFO:auditwheel.wheeltools:Previous filename tags: linux_x86_64
-INFO:auditwheel.wheeltools:New filename tags: manylinux_2_28_x86_64
+INFO:auditwheel.wheeltools:New filename tags: manylinux_2_27_x86_64, manylinux_2_28_x86_64
INFO:auditwheel.wheeltools:Previous WHEEL info tags: cp313-cp313-linux_x86_64
-INFO:auditwheel.wheeltools:New WHEEL info tags: cp313-cp313-manylinux_2_28_x86_64
+INFO:auditwheel.wheeltools:New WHEEL info tags: cp313-cp313-manylinux_2_27_x86_64, cp313-cp313-manylinux_2_28_x86_64
INFO:auditwheel.main_repair:
-Fixed-up wheel written to /workspace/python/cuda_parallel/wheelhouse/cuda_parallel-0.1.3.1.0-cp313-cp313-manylinux_2_28_x86_64.whl
+Fixed-up wheel written to /workspace/python/cuda_parallel/wheelhouse/cuda_parallel-0.1.3.1.0-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whlI'm still mystified ... |
|
Looks like the wheel name we have is OK. It's called "compressed tag sets": pypa/auditwheel#314. I check locally (we can add this to the CI later) that If we're really unhappy about the name, I think it can be changed manually: |
|
(I still don't know why multiple platform tags were generated, though. Was only saying that the name we have seems to be permitted by the standard.) btw @rwgk there's a conflict that we need to resolve. |
According to this ChatGPT conversation: ✅ TL;DR
I'll take care of it before |
|
Thanks both - sounds like we're good to merge! |
|
Another round: two commits I have already + I'm just about to work on the comment for the preloading. For the If you have a different preference, please let me know. Hopefully only one more CI run will get us there. |
|
Sorry, I saw this too late (only just now). |
Can we fix it? |
…e name)" This reverts commit 46cac6c.
Done: commit d5d5c97 |
🟨 CI finished in 2h 14m: Pass: 95%/187 | Total: 2d 18h | Avg: 21m 11s | Max: 1h 23m | Hits: 87%/289419
|
| Project | |
|---|---|
| +/- | CCCL Infrastructure |
| libcu++ | |
| CUB | |
| Thrust | |
| CUDA Experimental | |
| stdpar | |
| +/- | python |
| +/- | CCCL C Parallel Library |
| Catch2Helper |
Modifications in project or dependencies?
| Project | |
|---|---|
| +/- | CCCL Infrastructure |
| +/- | libcu++ |
| +/- | CUB |
| +/- | Thrust |
| +/- | CUDA Experimental |
| +/- | stdpar |
| +/- | python |
| +/- | CCCL C Parallel Library |
| +/- | Catch2Helper |
🏃 Runner counts (total jobs: 187)
| # | Runner |
|---|---|
| 129 | linux-amd64-cpu16 |
| 15 | windows-amd64-cpu16 |
| 12 | linux-arm64-cpu16 |
| 12 | linux-amd64-gpu-rtxa6000-latest-1 |
| 11 | linux-amd64-gpu-rtx2080-latest-1 |
| 5 | linux-amd64-gpu-h100-latest-1 |
| 3 | linux-amd64-gpu-rtx4090-latest-1 |
🟩 CI finished in 5h 16m: Pass: 100%/187 | Total: 2d 18h | Avg: 21m 27s | Max: 1h 23m | Hits: 87%/289419
|
| Project | |
|---|---|
| +/- | CCCL Infrastructure |
| libcu++ | |
| CUB | |
| Thrust | |
| CUDA Experimental | |
| stdpar | |
| +/- | python |
| +/- | CCCL C Parallel Library |
| Catch2Helper |
Modifications in project or dependencies?
| Project | |
|---|---|
| +/- | CCCL Infrastructure |
| +/- | libcu++ |
| +/- | CUB |
| +/- | Thrust |
| +/- | CUDA Experimental |
| +/- | stdpar |
| +/- | python |
| +/- | CCCL C Parallel Library |
| +/- | Catch2Helper |
🏃 Runner counts (total jobs: 187)
| # | Runner |
|---|---|
| 129 | linux-amd64-cpu16 |
| 15 | windows-amd64-cpu16 |
| 12 | linux-arm64-cpu16 |
| 12 | linux-amd64-gpu-rtxa6000-latest-1 |
| 11 | linux-amd64-gpu-rtx2080-latest-1 |
| 5 | linux-amd64-gpu-h100-latest-1 |
| 3 | linux-amd64-gpu-rtx4090-latest-1 |
|
I'll go ahead and merge this PR, thanks for all the feedback! Happy to do follow-on work, while we're already enjoying the vastly smaller wheel sizes. |
|
Thanks, @rwgk and all! Great to see everything pieced together! |
Description
Closes #3979