Conversation
|
Additional symbols that need to be loaded from libnccl.so:
nccl_ops_t->ncclGetUniqueId( Here, before returning, you have to call nccl_ops_t->ncclCommInitRank and create a new real NCCL's communicator. Inside MSCCL++'s ncclComm_t, you can have a void * or ncclComm_t nccl_comm. nccl_ops_t->ncclCommInitRank(&commPtr->nccl_comm, ... ) |
|
Add two related environment variables: Support dlopen for following nccl apis: Pass following tests rccl-test: |
Binyang2014
left a comment
There was a problem hiding this comment.
Thanks for this PR! A few comments
SreevatsaAnantharamu
left a comment
There was a problem hiding this comment.
I have left comments. Please take a look.
The most important is that we have to create both NCCL's communicator and our MSCCLPP communicator.
…from nccl/rccl; Keep both nccl communicator and mscclpp communicator
…llgather() to get the ncclUniqueId created on rank:0 so that each rank can get the same ncclUniqueId.
|
Update the environment variable names: |
Binyang2014
left a comment
There was a problem hiding this comment.
Overall looks good to me, leaves some comments. BTW maybe we can add CI for this
…ueId; Use new/delete pair for mscclppNcclComm; Pass the name of collective operations for mscclppNcclInFallbackList
|
/azp run |
|
Azure Pipelines successfully started running 3 pipeline(s). |

Use dlopen to load nccl/rccl apis from shared library to replace Fallback code for Allgather, Allreduce, Broadcast, ReduceScatter.
Add two related environment variables
-x MSCCLPP_ENABLE_SHARED_LIB=TRUE -x MSCCLPP_NCCL_LIB_PATH=path_to_libnccl.so/librccl.so