Skip to content

Facing issue with the 14B model #37

@Akshaysharma29

Description

@Akshaysharma29

Hi Team,

Running the cmd to run the 14B model:
torchrun --nproc_per_node=8 generate.py --task s2v-14B --size 832*480 --ckpt_dir ../Wan2.1-T2V-14B --phantom_ckpt ../Phantom-Wan-Models/ --ref_image "examples/ref3.png,examples/ref4.png" --dit_fsdp --t5_fsdp --ulysses_size 8 --ring_size 1 --prompt "夕阳下,一位有着小麦色肌肤、留着乌黑长发的女人穿上有着大朵立体花朵装饰、肩袖处带有飘逸纱带的红色纱裙,漫步在金色的海滩上,海风轻拂她的长发,画面唯美动人。" --base_seed 42

Facing below issue on multiple l4 gpu:

[2025-06-30 04:26:34,339] INFO: Creating WanModel from ../Phantom-Wan-Models/
W0630 04:28:17.164000 139757906253632 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 8919 closing signal SIGTERM
W0630 04:28:17.166000 139757906253632 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 8920 closing signal SIGTERM
W0630 04:28:17.167000 139757906253632 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 8921 closing signal SIGTERM
W0630 04:28:17.167000 139757906253632 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 8922 closing signal SIGTERM
W0630 04:28:17.168000 139757906253632 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 8923 closing signal SIGTERM
W0630 04:28:17.168000 139757906253632 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 8924 closing signal SIGTERM
W0630 04:28:17.169000 139757906253632 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 8925 closing signal SIGTERM
E0630 04:28:23.497000 139757906253632 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: -9) local_rank: 0 (pid: 8918) of binary: /opt/conda/bin/python
Traceback (most recent call last):
  File "/opt/conda/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==2.4.0', 'console_scripts', 'torchrun')())
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/torch/distributed/run.py", line 901, in main
    run(args)
  File "/opt/conda/lib/python3.11/site-packages/torch/distributed/run.py", line 892, in run
    elastic_launch(
  File "/opt/conda/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 133, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
=====================================================
generate.py FAILED
-----------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
-----------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2025-06-30_04:28:17
  host      : 7ba479c0458d
  rank      : 0 (local_rank: 0)
  exitcode  : -9 (pid: 8918)
  error_file: <N/A>
  traceback : Signal 9 (SIGKILL) received by PID 8918
=====================================================

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions