Skip to content

Can't train in DDP mode after recent update #2405

@wudashuo

Description

@wudashuo

🐛 Bug

When I pull the latest code, I found that DDP training would get stuck in the first few epochs.
I ran some tests to see which commit caused this bug and I found commit a3ecf0fd640465f9a7c009e81bcc5ecabf381004 on Mar 3 worked well.
But when I git checkout commit e931b9da33f45551928059b8d61bddd50e401e48 on Mar 4, the bug appeared.
And the bug still exists in the latest commit.

To Reproduce (REQUIRED)

python3 -m torch.distributed.launch --nproc_per_node 4 train.py

The training process would get stuck forever unless you terminate it manually.
And it still occupied the GPU memory unless killing the process by kill -9 xxxxx

stuck

Expected behavior

Roll back to the older code, and get the expected behavior.

$ git checkout a3ecf0fd640465f9a7c009e81bcc5ecabf381004
$ python3 -m torch.distributed.launch --nproc_per_node 4 train.py

worked well

Environment

If applicable, add screenshots to help explain your problem.

  • OS: Ubuntu 20.04
  • GPU: 1080 Ti * 4
  • Python: 3.8
  • pytorch: 1.7.1
  • CUDA: 11.1
  • Driver: 455.32

Additional

It seems like the latest commit working fine on 2 * 3090, I'm not sure yet, I will do some further tests on 3090 or other GPU.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions