Skip to content

apps/nccl: fix a bug in allreduce kernels for graph mode#502

Merged
chhwang merged 3 commits intomicrosoft:mainfrom
nusislam:allreduce7-fix
Apr 24, 2025
Merged

apps/nccl: fix a bug in allreduce kernels for graph mode#502
chhwang merged 3 commits intomicrosoft:mainfrom
nusislam:allreduce7-fix

Conversation

@nusislam
Copy link
Contributor

@nusislam nusislam commented Apr 15, 2025

allreduce7 and allreduceAllpairs kernels were updating the LL protocol flag on the host side. So, it was not properly captured in graph mode. This PR fixes the issue by updating the flag in the kernels.

@nusislam nusislam force-pushed the allreduce7-fix branch 2 times, most recently from 6a7a6c3 to 12b46eb Compare April 15, 2025 20:59
@Binyang2014 Binyang2014 requested a review from Copilot April 16, 2025 22:15
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot reviewed 1 out of 2 changed files in this pull request and generated no comments.

Files not reviewed (1)
  • apps/nccl/src/nccl.cu: Language not supported
Comments suppressed due to low confidence (2)

apps/nccl/src/allreduce.hpp:451

  • Casting the deviceFlag value from uint64_t to uint32_t may cause data truncation if the flag value exceeds 32 bits. Verify that this conversion is safe for all expected flag values.
uint32_t flag = (uint32_t) commFlag;

apps/nccl/src/allreduce.hpp:545

  • Casting the deviceFlag value from uint64_t to uint32_t here may lead to truncation issues. Ensure this conversion will not lose significant bits in scenarios where the flag value grows beyond 32 bits.
uint32_t flag = (uint32_t) commFlag;

@Binyang2014 Binyang2014 requested a review from chhwang April 22, 2025 21:53
Copy link
Contributor

@Binyang2014 Binyang2014 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@chhwang chhwang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@chhwang chhwang merged commit 9df2bdb into microsoft:main Apr 24, 2025
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants