How is this issue impacting you?
Lower performance than expected
Reproduction Steps
# Set NCCL algorithm to Ring
export NCCL_ALGO=Ring
# Run all_reduce_perf test for fp32 sum operation
./build/all_reduce_perf -b 8 -e 8g -f 2 -g 2 -d float -o sum -w 20
Note: The analysis focuses on the all_reduce_ring_simple_f32_sum operation as a representative case. This does not imply that other operations/configurations are unaffected.
Parameters:
-b 8: minbytes = 8 bytes (minimum transfer size)
-e 8g: maxbytes = 8GB (maximum transfer size)
-f 2: stepfactor = 2 (size increment factor)
-g 2: ngpus = 2 (GPUs per thread)
-d float: datatype = float32
-o sum: op = sum (reduction operation)
-w 20: warmup_iters = 20 (warmup iterations)
Environment
- GPU: NVIDIA H200
- NCCL Version Tested: 2.22, 2.24, 2.29
- NCCL Test Version: v2.17.2 (commit abc46770a98777a9fd1b072adcf8becb76bfe125)
- CUDA Version: 13.1
- Driver Version: 590.44.01
Expected Behavior
NCCL 2.24/2.29 should have similar or better performance compared to NCCL 2.22.
Actual Behavior
After upgrading from NCCL 2.22 to 2.24/2.29:
-
Local memory instructions (ldl/stl) increased significantly:
- NCCL 2.22: 138 stl + 273 ldl = 411 total
- NCCL 2.29: 178 stl + 420 ldl = 598 total (+45%)
-
Performance degradation: Measured degradation in all_reduce_sum_f32 operation.
Root Cause Analysis
Through code comparison between NCCL 2.22 and 2.24, we identified that a large number of
if-else judgment logic was added to the waitPeer function in NCCL 2.24. This additional
control flow is suspected to be the cause of the increased ldl/stl instructions and
performance regression.
Suspected Area
- Function:
waitPeer
- Change: Significant if-else branches added in NCCL 2.24
- Impact: May cause more register spilling and local memory access
Attachments
SASS analysis between NCCL 2.22 and 2.29
ncclsass.zip