Skip to content

nccl/rccl integration#469

Merged
seagater merged 21 commits intomainfrom
qinghuazhou/nccl-rccl-integration-dlopen
Mar 20, 2025
Merged

nccl/rccl integration#469
seagater merged 21 commits intomainfrom
qinghuazhou/nccl-rccl-integration-dlopen

Conversation

@seagater
Copy link
Contributor

@seagater seagater commented Feb 25, 2025

Use dlopen to load nccl/rccl apis from shared library to replace Fallback code for Allgather, Allreduce, Broadcast, ReduceScatter.

Add two related environment variables
-x MSCCLPP_ENABLE_SHARED_LIB=TRUE -x MSCCLPP_NCCL_LIB_PATH=path_to_libnccl.so/librccl.so

@SreevatsaAnantharamu
Copy link
Contributor

SreevatsaAnantharamu commented Feb 25, 2025

image

Additional symbols that need to be loaded from libnccl.so:

  • ncclGetUniqueId
  • ncclCommInitRank

nccl_ops_t->ncclGetUniqueId(

Here, before returning, you have to call nccl_ops_t->ncclCommInitRank and create a new real NCCL's communicator. Inside MSCCL++'s ncclComm_t, you can have a void * or ncclComm_t nccl_comm.

nccl_ops_t->ncclCommInitRank(&commPtr->nccl_comm, ... )

@seagater
Copy link
Contributor Author

seagater commented Mar 5, 2025

Add two related environment variables:
-x MSCCLPP_ENABLE_SHARED_LIB=TRUE -x MSCCLPP_NCCL_LIB_PATH=path_to_libnccl.so/librccl.so

Support dlopen for following nccl apis:
ncclCommInitRank
ncclGetUniqueId
ncclCommDestroy
ncclCommUserRank
ncclAllGather
ncclAllReduce
ncclBroadcast

Pass following tests
nccl-test:
Allgather, Allreduce, Broadcast, ReduceScatter

rccl-test:
Allgather, Allreduce, Broadcast, ReduceScatter

Copy link
Contributor

@Binyang2014 Binyang2014 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this PR! A few comments

Copy link
Contributor

@SreevatsaAnantharamu SreevatsaAnantharamu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have left comments. Please take a look.

The most important is that we have to create both NCCL's communicator and our MSCCLPP communicator.

…from nccl/rccl; Keep both nccl communicator and mscclpp communicator
…llgather() to get the ncclUniqueId created on rank:0 so that each rank can get the same ncclUniqueId.
@seagater
Copy link
Contributor Author

Update the environment variable names:
-x MSCCLPP_ENABLE_NCCL_FALLBACK=TRUE
-x MSCCLPP_NCCL_LIB_PATH=$NCCL_LIB
-x MSCCLPP_FORCE_NCCL_FALLBACK_OPERATION="allreduce,allgather,broadcast,reducescatter" or "all"

Copy link
Contributor

@Binyang2014 Binyang2014 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good to me, leaves some comments. BTW maybe we can add CI for this

…ueId; Use new/delete pair for mscclppNcclComm; Pass the name of collective operations for mscclppNcclInFallbackList
Copy link
Contributor

@Binyang2014 Binyang2014 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Binyang2014
Copy link
Contributor

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 3 pipeline(s).

@seagater seagater merged commit a7c364b into main Mar 20, 2025
25 checks passed
@seagater seagater deleted the qinghuazhou/nccl-rccl-integration-dlopen branch March 20, 2025 18:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants