Improve signal/wait performance and fix barrier issue#499
Conversation
|
Hi @Binyang2014 the pull request looks good to me. However, I noticed changing I am unable to benchmark the code. I am curious if memoryOrderRelaxed atomicLoad is faster than memoryOrderAcquire one. If so, maybe it is more efficient to use memoryOrderRelaxed atomicLoad in the |
Yes, it's faster. But I am not sure the correctness for this case. If we get the flag via memoryOrderRelaxed atomicLoad, can system guarantee the following atomicLoad memoryOrderAcquire will see same value? I don't find any document for this. Also I don't see any compile issue for pip install, are you using mscclpp container image? |
|
I am suggesting something similar to #497 that we keep the memoryOrderRelaxed atomicStore and atomicLoad, but add fence before and after. For documentation from cuda, please refer to #497 (comment), especially item 3 of 8.8. Basically, if one thread writes a value that is read by another thread and there are fences before the write and after the read, then the two threads have release-acquire relation even if the read and write are relaxed. We should establish a release-acquire relation between any two threadblocks participating in the DeviceSyncer. I think we should probably also add fence2 to #497 like the following: fence_acq_rel_gpu(); // fence1
unsigned int tmp = preFlag_ ^ 1;
if (atomicInc(&count_, maxOldCnt) == maxOldCnt) {
fence_acq_rel_gpu(); // fence2
atomicStore(&flag_, tmp, memoryOrderRelaxed);
} else {
POLL_MAYBE_JAILBREAK((atomicLoad(&flag_, memoryOrderRelaxed) != tmp), maxSpinCount);
}
preFlag_ = tmp;
fence_acq_rel_gpu(); // fence3Suppose the threadblocks reach the atomicInc in order 1, ..., n. Then from any block x to block y, the release-acquire relation has two cases:
And of course, this only establishes the release-acquire between thread0 of all blocks. The __syncthreads before and after all of these will transition the release-acquire to between any two threads of all blocks. |
|
|
Looks good to me. The release atomicFetchAdd and acquire atomicLoad, both on the same memory location, can establish mutual release-acquire between threadblocks. Please try replacing the memoryOrderAcquire atomicLoad in the poll loop with a memoryOrderRelaxed one and add an acquire fence after poll loop ends, if this brings performance improvement. The correctness is guaranteed by item 3 of 8.8 acquire pattern. |
Thanks @liangyuRain, I tried with |
|
/azp run |
|
Azure Pipelines successfully started running 3 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 3 pipeline(s). |
Remove __assert_fail for release build. This will reduce the number of PTX instructions inside the loop. Also Trying to resolve this issue reported in #497. Reduce the number of PTX instructions from 8 to 6.
8 ranks signal/wait will reduce from 3.2us->2.8us on NDv5
Also NDEBUG flag is confused here, sometime it will not be set. Use customized flag for debug build.
Here is current PTX:
If we change to
asm volatile("ld.global.acquire.sys.b64 %0, [%1];" : "=l"(flag) : "l"(flag_addr));will reduce to 4 instructions. We can get 2.1 us for 8 ranks signal/wait