Skip to content

[Issue]: Upgrading NCCL from 2.22 to 2.24/2.29 causes significant performance regression in collective operations due to increased local memory instructions (ldl/stl). #1997

@Shengqi-Pan

Description

@Shengqi-Pan

How is this issue impacting you?

Lower performance than expected

Reproduction Steps

# Set NCCL algorithm to Ring
export NCCL_ALGO=Ring

# Run all_reduce_perf test for fp32 sum operation
./build/all_reduce_perf -b 8 -e 8g -f 2 -g 2 -d float -o sum -w 20

Note: The analysis focuses on the all_reduce_ring_simple_f32_sum operation as a representative case. This does not imply that other operations/configurations are unaffected.

Parameters:

  • -b 8: minbytes = 8 bytes (minimum transfer size)
  • -e 8g: maxbytes = 8GB (maximum transfer size)
  • -f 2: stepfactor = 2 (size increment factor)
  • -g 2: ngpus = 2 (GPUs per thread)
  • -d float: datatype = float32
  • -o sum: op = sum (reduction operation)
  • -w 20: warmup_iters = 20 (warmup iterations)

Environment

  • GPU: NVIDIA H200
  • NCCL Version Tested: 2.22, 2.24, 2.29
  • NCCL Test Version: v2.17.2 (commit abc46770a98777a9fd1b072adcf8becb76bfe125)
  • CUDA Version: 13.1
  • Driver Version: 590.44.01

Expected Behavior

NCCL 2.24/2.29 should have similar or better performance compared to NCCL 2.22.

Actual Behavior

After upgrading from NCCL 2.22 to 2.24/2.29:

  1. Local memory instructions (ldl/stl) increased significantly:

    • NCCL 2.22: 138 stl + 273 ldl = 411 total
    • NCCL 2.29: 178 stl + 420 ldl = 598 total (+45%)
  2. Performance degradation: Measured degradation in all_reduce_sum_f32 operation.

Image

Root Cause Analysis

Through code comparison between NCCL 2.22 and 2.24, we identified that a large number of
if-else judgment logic was added to the waitPeer function in NCCL 2.24. This additional
control flow is suspected to be the cause of the increased ldl/stl instructions and
performance regression.

Image

Suspected Area

  • Function: waitPeer
  • Change: Significant if-else branches added in NCCL 2.24
  • Impact: May cause more register spilling and local memory access

Attachments

SASS analysis between NCCL 2.22 and 2.29
ncclsass.zip

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions