🐛 Bug
When I pull the latest code, I found that DDP training would get stuck in the first few epochs.
I ran some tests to see which commit caused this bug and I found commit a3ecf0fd640465f9a7c009e81bcc5ecabf381004 on Mar 3 worked well.
But when I git checkout commit e931b9da33f45551928059b8d61bddd50e401e48 on Mar 4, the bug appeared.
And the bug still exists in the latest commit.
To Reproduce (REQUIRED)
python3 -m torch.distributed.launch --nproc_per_node 4 train.py
The training process would get stuck forever unless you terminate it manually.
And it still occupied the GPU memory unless killing the process by kill -9 xxxxx

Expected behavior
Roll back to the older code, and get the expected behavior.
$ git checkout a3ecf0fd640465f9a7c009e81bcc5ecabf381004
$ python3 -m torch.distributed.launch --nproc_per_node 4 train.py

Environment
If applicable, add screenshots to help explain your problem.
- OS: Ubuntu 20.04
- GPU: 1080 Ti * 4
- Python: 3.8
- pytorch: 1.7.1
- CUDA: 11.1
- Driver: 455.32
Additional
It seems like the latest commit working fine on 2 * 3090, I'm not sure yet, I will do some further tests on 3090 or other GPU.
🐛 Bug
When I pull the latest code, I found that DDP training would get stuck in the first few epochs.
I ran some tests to see which commit caused this bug and I found commit
a3ecf0fd640465f9a7c009e81bcc5ecabf381004on Mar 3 worked well.But when I
git checkoutcommite931b9da33f45551928059b8d61bddd50e401e48on Mar 4, the bug appeared.And the bug still exists in the latest commit.
To Reproduce (REQUIRED)
python3 -m torch.distributed.launch --nproc_per_node 4 train.pyThe training process would get stuck forever unless you terminate it manually.
And it still occupied the GPU memory unless killing the process by
kill -9 xxxxxExpected behavior
Roll back to the older code, and get the expected behavior.
Environment
If applicable, add screenshots to help explain your problem.
Additional
It seems like the latest commit working fine on 2 * 3090, I'm not sure yet, I will do some further tests on 3090 or other GPU.