-
Notifications
You must be signed in to change notification settings - Fork 97
Facing issue with the 14B model #37
Copy link
Copy link
Open
Description
Hi Team,
Running the cmd to run the 14B model:
torchrun --nproc_per_node=8 generate.py --task s2v-14B --size 832*480 --ckpt_dir ../Wan2.1-T2V-14B --phantom_ckpt ../Phantom-Wan-Models/ --ref_image "examples/ref3.png,examples/ref4.png" --dit_fsdp --t5_fsdp --ulysses_size 8 --ring_size 1 --prompt "夕阳下,一位有着小麦色肌肤、留着乌黑长发的女人穿上有着大朵立体花朵装饰、肩袖处带有飘逸纱带的红色纱裙,漫步在金色的海滩上,海风轻拂她的长发,画面唯美动人。" --base_seed 42
Facing below issue on multiple l4 gpu:
[2025-06-30 04:26:34,339] INFO: Creating WanModel from ../Phantom-Wan-Models/
W0630 04:28:17.164000 139757906253632 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 8919 closing signal SIGTERM
W0630 04:28:17.166000 139757906253632 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 8920 closing signal SIGTERM
W0630 04:28:17.167000 139757906253632 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 8921 closing signal SIGTERM
W0630 04:28:17.167000 139757906253632 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 8922 closing signal SIGTERM
W0630 04:28:17.168000 139757906253632 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 8923 closing signal SIGTERM
W0630 04:28:17.168000 139757906253632 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 8924 closing signal SIGTERM
W0630 04:28:17.169000 139757906253632 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 8925 closing signal SIGTERM
E0630 04:28:23.497000 139757906253632 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: -9) local_rank: 0 (pid: 8918) of binary: /opt/conda/bin/python
Traceback (most recent call last):
File "/opt/conda/bin/torchrun", line 33, in <module>
sys.exit(load_entry_point('torch==2.4.0', 'console_scripts', 'torchrun')())
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/torch/distributed/run.py", line 901, in main
run(args)
File "/opt/conda/lib/python3.11/site-packages/torch/distributed/run.py", line 892, in run
elastic_launch(
File "/opt/conda/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 133, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
=====================================================
generate.py FAILED
-----------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
-----------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2025-06-30_04:28:17
host : 7ba479c0458d
rank : 0 (local_rank: 0)
exitcode : -9 (pid: 8918)
error_file: <N/A>
traceback : Signal 9 (SIGKILL) received by PID 8918
=====================================================
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels