[Issue]: Upgrading NCCL from 2.22 to 2.24/2.29 causes significant performance regression in collective operations due to increased local memory instructions (ldl/stl).

## How is this issue impacting you?

Lower performance than expected

## Reproduction Steps
```bash
# Set NCCL algorithm to Ring
export NCCL_ALGO=Ring

# Run all_reduce_perf test for fp32 sum operation
./build/all_reduce_perf -b 8 -e 8g -f 2 -g 2 -d float -o sum -w 20
```

**Note**: The analysis focuses on the `all_reduce_ring_simple_f32_sum` operation as a representative case. This does not imply that other operations/configurations are unaffected.

Parameters:
- `-b 8`: minbytes = 8 bytes (minimum transfer size)
- `-e 8g`: maxbytes = 8GB (maximum transfer size)
- `-f 2`: stepfactor = 2 (size increment factor)
- `-g 2`: ngpus = 2 (GPUs per thread)
- `-d float`: datatype = float32
- `-o sum`: op = sum (reduction operation)
- `-w 20`: warmup_iters = 20 (warmup iterations)

## Environment
- **GPU**: NVIDIA H200
- **NCCL Version Tested**: 2.22, 2.24, 2.29
- **NCCL Test Version**: v2.17.2 (commit abc46770a98777a9fd1b072adcf8becb76bfe125)
- **CUDA Version**: 13.1
- **Driver Version**: 590.44.01

## Expected Behavior
NCCL 2.24/2.29 should have similar or better performance compared to NCCL 2.22.

## Actual Behavior
After upgrading from NCCL 2.22 to 2.24/2.29:

1. **Local memory instructions (ldl/stl) increased significantly**:
   - NCCL 2.22: 138 stl + 273 ldl = 411 total
   - NCCL 2.29: 178 stl + 420 ldl = 598 total (+45%)

2. **Performance degradation**: Measured degradation in all_reduce_sum_f32 operation.

<img width="2382" height="1780" alt="Image" src="https://github.com/user-attachments/assets/94724048-eb26-43b0-8ee3-e79b6e285a53" />

## Root Cause Analysis
Through code comparison between NCCL 2.22 and 2.24, we identified that a large number of
if-else judgment logic was added to the `waitPeer` function in NCCL 2.24. This additional
control flow is suspected to be the cause of the increased ldl/stl instructions and
performance regression.

<img width="2380" height="1050" alt="Image" src="https://github.com/user-attachments/assets/3265bdef-b9d4-48af-8964-e7aac7fe3c4b" />

## Suspected Area
- Function: `waitPeer`
- Change: Significant if-else branches added in NCCL 2.24
- Impact: May cause more register spilling and local memory access

## Attachments
SASS analysis between NCCL 2.22 and 2.29
[ncclsass.zip](https://github.com/user-attachments/files/24815380/ncclsass.zip)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Issue]: Upgrading NCCL from 2.22 to 2.24/2.29 causes significant performance regression in collective operations due to increased local memory instructions (ldl/stl). #1997

How is this issue impacting you?

Reproduction Steps

Environment

Expected Behavior

Actual Behavior

Root Cause Analysis

Suspected Area

Attachments

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Issue]: Upgrading NCCL from 2.22 to 2.24/2.29 causes significant performance regression in collective operations due to increased local memory instructions (ldl/stl). #1997

Description

How is this issue impacting you?

Reproduction Steps

Environment

Expected Behavior

Actual Behavior

Root Cause Analysis

Suspected Area

Attachments

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions