Skip to content

Hugging Face Dataset Streaming can fail causing NCCL to timeout #3338

@tchaton

Description

@tchaton

Bug description

Hey,

I was trying torchtitan and it seems streaming dataset from Hugging Face is somewhat flaky.

torchrun --nnodes 1 --nproc_per_node 8 \
  --rdzv_backend c10d --rdzv_endpoint localhost:29500 \
  -m torchtitan.train \
  --module llama3 --config llama3_8b \
  --hf-assets-path ./assets/hf/Meta-Llama-3-8B \
  --dump-folder ./outputs/llama3_8b_dayrun \
  --training.local-batch-size 2 \
  --training.steps 500000 \
  --checkpoint.enable \
  --checkpoint.interval 2000 \
  --checkpoint.keep-latest-k 3 \
  --metrics.log-freq 10 \
  --profiler.profile-freq 1000000 \
  2>&1 | tee outputs/llama3_8b_dayrun.log

The portion of the logs

[titan] 2026-05-13 00:52:16,792 - root - INFO - �[31mstep: 2690  �[32mloss:  3.76196  �[38;2;180;60;0mgrad_norm:  0.2823  �[38;2;54;234;195mmemory: 59.24GiB(74.81%)  �[34mtps: 2,059  �[36mtflops: 119.27  �[35mmfu: 12.06%�[39m
[titan] 2026-05-13 00:52:16,792 - root - INFO - �[31mstep: 2690  �[32mloss:  3.76196  �[38;2;180;60;0mgrad_norm:  0.2823  �[38;2;54;234;195mmemory: 59.24GiB(74.81%)  �[34mtps: 2,059  �[36mtflops: 119.27  �[35mmfu: 12.06%�[39m
[titan] 2026-05-13 00:52:16,792 - root - INFO - �[31mstep: 2690  �[32mloss:  3.76196  �[38;2;180;60;0mgrad_norm:  0.2823  �[38;2;54;234;195mmemory: 59.24GiB(74.82%)  �[34mtps: 2,059  �[36mtflops: 119.27  �[35mmfu: 12.06%�[39m
[titan] 2026-05-13 00:52:46,310 - httpx - INFO - HTTP Request: GET https://huggingface.co/datasets/allenai/c4/resolve/1588ec454efa1a09f29cd18ddd04fe05fc8653a2/en/c4-train.00004-of-01024.json.gz "HTTP/1.1 302 Found"
[titan] 2026-05-13 00:52:46,337 - httpx - INFO - HTTP Request: GET https://cas-bridge.xethub.hf.co/xet-bridge-us/621ffdd236468d709f182a80/732ea51359e83c755ff7a289603bc934a5d50b7220002769b99b42f17034a706?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Content-Sha256=UNSIGNED-PAYLOAD&X-Amz-Credential=cas%2F20260513%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20260513T005244Z&X-Amz-Expires=3600&X-Amz-Signature=27ad0ca740a05012b4b0ec636310a1518a0123bfe87cd68ff94c87edaf70e226&X-Amz-SignedHeaders=host&X-Xet-Cas-Uid=public&response-content-disposition=inline%3B+filename*%3DUTF-8%27%27c4-train.00004-of-01024.json.gz%3B+filename%3D%22c4-train.00004-of-01024.json.gz%22%3B&response-content-type=application%2Fgzip&x-amz-checksum-mode=ENABLED&x-id=GetObject&Expires=1778637164&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTc3ODYzNzE2NH19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2FzLWJyaWRnZS54ZXRodWIuaGYuY28veGV0LWJyaWRnZS11cy82MjFmZmRkMjM2NDY4ZDcwOWYxODJhODAvNzMyZWE1MTM1OWU4M2M3NTVmZjdhMjg5NjAzYmM5MzRhNWQ1MGI3MjIwMDAyNzY5Yjk5YjQyZjE3MDM0YTcwNioifV19&Signature=MyVroYh5%7Ey8Oj4QOXxFBbDDO%7ERurPF-8e40ifqT-4waugwcS-DEr6I5g80y1KESxdY9ueQUX7Dk4JE%7EbMZj4MFyl-XcoPGpGaJ-safUBGxpOwp3m53MI4auKzaAbjWt5pRDNSE9gWy3z5zMsMiqpBkUmExPiN5FL%7E7dxMGiSYR6TU1d7DE9h-2O83QEHsafVFkL-qWQhyzA%7EuZ1pjx9EYd0LC8yEILfPYUUx-YxKiRMN98YY9bLAA2OsTmHn1PmhVAC5k9cvzBreK0eiYpEfiNH%7Eb-Jcpaws04ckRwQ33REHyoo9zYWPgoHSPX8iWV%7EfFFPVfN%7EgTvU0thlZsNtUTQ__&Key-Pair-Id=K2L8F4GPSG1IFC "HTTP/1.1 206 Partial Content"
[titan] 2026-05-13 00:52:54,343 - httpx - INFO - HTTP Request: GET https://huggingface.co/datasets/allenai/c4/resolve/1588ec454efa1a09f29cd18ddd04fe05fc8653a2/en/c4-train.00000-of-01024.json.gz "HTTP/1.1 302 Found"
[titan] 2026-05-13 00:52:54,363 - httpx - INFO - HTTP Request: GET https://cas-bridge.xethub.hf.co/xet-bridge-us/621ffdd236468d709f182a80/6656aac078668b0d5feb1cfb7b04def05bff73e7e375d0a66e252827336ac1a8?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Content-Sha256=UNSIGNED-PAYLOAD&X-Amz-Credential=cas%2F20260513%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20260513T004925Z&X-Amz-Expires=3600&X-Amz-Signature=55f6fa3ad70854c3d377f1c9153c91d4450e766c3178ce2e3ef8f44a025007a0&X-Amz-SignedHeaders=host&X-Xet-Cas-Uid=public&response-content-disposition=inline%3B+filename*%3DUTF-8%27%27c4-train.00000-of-01024.json.gz%3B+filename%3D%22c4-train.00000-of-01024.json.gz%22%3B&response-content-type=application%2Fgzip&x-amz-checksum-mode=ENABLED&x-id=GetObject&Expires=1778636965&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTc3ODYzNjk2NX19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2FzLWJyaWRnZS54ZXRodWIuaGYuY28veGV0LWJyaWRnZS11cy82MjFmZmRkMjM2NDY4ZDcwOWYxODJhODAvNjY1NmFhYzA3ODY2OGIwZDVmZWIxY2ZiN2IwNGRlZjA1YmZmNzNlN2UzNzVkMGE2NmUyNTI4MjczMzZhYzFhOCoifV19&Signature=HI-kcMay-3YI3pLH55fnYAXLAKJtkwZTtWgd5fRIJjUQ8FLcWIN01WXR420dYdJYSBr1tyy2SCdRzPMhxcAr4hfSwbjKdSxMnzL7%7EukvSy7GQ51hW%7E4wHfOtiYRLK4yHzkcsg1Dhat2xxiVm6UpqO64b-emeKwBTagsug3-DfylVd2h5FThBUpeLeY8WoSTdbucv5lJWClzBKpmQVaiJLYwFH7HUzgN-AD0K82FRpbghs43HlaKYwZWfs09lDw2Arm9K0UNQJ-55TjVuvwYkKbawl97D4M1XE1WQUB4fmwoSFFiq2h2NGNzjuPKHHAfR1b6Z1r4E8jA0kAHiZ-0kKw__&Key-Pair-Id=K2L8F4GPSG1IFC "HTTP/1.1 206 Partial Content"
[titan] 2026-05-13 00:53:02,245 - httpx - INFO - HTTP Request: GET https://huggingface.co/datasets/allenai/c4/resolve/1588ec454efa1a09f29cd18ddd04fe05fc8653a2/en/c4-train.00002-of-01024.json.gz "HTTP/1.1 302 Found"
[titan] 2026-05-13 00:53:02,266 - httpx - INFO - HTTP Request: GET https://cas-bridge.xethub.hf.co/xet-bridge-us/621ffdd236468d709f182a80/96adb94376f42df2255c7ad2fa02604236304fc487f7bb89b8bdee36a126f298?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Content-Sha256=UNSIGNED-PAYLOAD&X-Amz-Credential=cas%2F20260513%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20260513T005300Z&X-Amz-Expires=3600&X-Amz-Signature=6ddfb69b71e8133426f4de4db704764f89320cb83866b15f1d5ff1ffa5dc681a&X-Amz-SignedHeaders=host&X-Xet-Cas-Uid=public&response-content-disposition=inline%3B+filename*%3DUTF-8%27%27c4-train.00002-of-01024.json.gz%3B+filename%3D%22c4-train.00002-of-01024.json.gz%22%3B&response-content-type=application%2Fgzip&x-amz-checksum-mode=ENABLED&x-id=GetObject&Expires=1778637180&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTc3ODYzNzE4MH19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2FzLWJyaWRnZS54ZXRodWIuaGYuY28veGV0LWJyaWRnZS11cy82MjFmZmRkMjM2NDY4ZDcwOWYxODJhODAvOTZhZGI5NDM3NmY0MmRmMjI1NWM3YWQyZmEwMjYwNDIzNjMwNGZjNDg3ZjdiYjg5YjhiZGVlMzZhMTI2ZjI5OCoifV19&Signature=jPtGBvNU7LIuvkYZJEy%7E09TOoomUJ1DC%7ECDFV3sLuAuL9kSML%7E2l-9tDwU-J0pm2xWzfToi6VpMv5aXuS4k5bRQPsoLfmAi6aspirk-KK7mRZeIgzwQ0cPT-5WDaC19-ONkcKSLk8-8F7ed-jKJqDleeXY7ryINQh1CkKSVujHw5fve3guerQeIqc5bLNoMayVLGdrO91T5pz7fp%7EQ9xWy1njxrjl3fRmaUJm42IWQhjQCOtnJY9hqRi1j0KwX2p2IQ2ZlGRTPN26NF2wWzEmk8livOhdv0f2UvdK1tOFrYw6xtytmmBm3O7oORvpKriyXNKqNkf2shQXSYowhmebg__&Key-Pair-Id=K2L8F4GPSG1IFC "HTTP/1.1 206 Partial Content"
[titan] 2026-05-13 00:53:02,269 - httpx - INFO - HTTP Request: GET https://huggingface.co/datasets/allenai/c4/resolve/1588ec454efa1a09f29cd18ddd04fe05fc8653a2/en/c4-train.00003-of-01024.json.gz "HTTP/1.1 302 Found"
[titan] 2026-05-13 00:53:02,295 - httpx - INFO - HTTP Request: GET https://cas-bridge.xethub.hf.co/xet-bridge-us/621ffdd236468d709f182a80/a20f257af4ab58526d8f66b50d36299eec45b2e79f25d94c8cb8186322bfb888?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Content-Sha256=UNSIGNED-PAYLOAD&X-Amz-Credential=cas%2F20260513%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20260513T005300Z&X-Amz-Expires=3600&X-Amz-Signature=58a8f2b96ac72cc6c0d324927a56fd581be13b158d1c9010cb6584eb0972ee72&X-Amz-SignedHeaders=host&X-Xet-Cas-Uid=public&response-content-disposition=inline%3B+filename*%3DUTF-8%27%27c4-train.00003-of-01024.json.gz%3B+filename%3D%22c4-train.00003-of-01024.json.gz%22%3B&response-content-type=application%2Fgzip&x-amz-checksum-mode=ENABLED&x-id=GetObject&Expires=1778637180&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTc3ODYzNzE4MH19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2FzLWJyaWRnZS54ZXRodWIuaGYuY28veGV0LWJyaWRnZS11cy82MjFmZmRkMjM2NDY4ZDcwOWYxODJhODAvYTIwZjI1N2FmNGFiNTg1MjZkOGY2NmI1MGQzNjI5OWVlYzQ1YjJlNzlmMjVkOTRjOGNiODE4NjMyMmJmYjg4OCoifV19&Signature=tGVkmZgdUBljaqSRRaSpE1JfKnwcHm-i7KzDO5tffytmqXfvsHEvTtgzXBhxo24EcveU3LW76Sa6504qwt5doDjcBJFgGY3DXEFMQNGS-y4hpzA-DhvinW-wtSRMy0u7bRtmQscCEW1PCybH8GSihaPHfFfJpAPRAZPq3Uw5RLTSEw8J8O4wvThicCWsUkPipYG81Zz62vzJftbxElbPD3nLqxrl204xm0Ct1NX8AAxoz-TFkIySk%7EJ9rQ7rFb1a%7Eo35gsQFaPcGtvlenuqecKd90C4raIKytsJMi0Y4OV4PbS5BXL0sfCAaeF3iXkg7nYUXKBCP-EcmgPTSalqEvg__&Key-Pair-Id=K2L8F4GPSG1IFC "HTTP/1.1 206 Partial Content"
[titan] 2026-05-13 00:53:18,433 - httpx - INFO - HTTP Request: GET https://huggingface.co/datasets/allenai/c4/resolve/1588ec454efa1a09f29cd18ddd04fe05fc8653a2/en/c4-train.00001-of-01024.json.gz "HTTP/1.1 302 Found"
[rank6]:[E513 00:55:00.284720923 ProcessGroupNCCL.cpp:757] [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=286809, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=100000) ran for 100004 milliseconds before timing out.
[rank6]:[E513 00:55:00.284950840 ProcessGroupNCCL.cpp:2396] [PG ID 0 PG GUID 0(default_pg) Rank 6]  failure detected by watchdog at work sequence id: 286809 PG status: last enqueued work: 286809, last completed work: 286808
[titan] 2026-05-13 00:53:18,455 - httpx - INFO - HTTP Request: GET https://cas-bridge.xethub.hf.co/xet-bridge-us/621ffdd236468d709f182a80/9f36194e2bf0cd3bd27a023ec7519b36c0dcc5c236eb2695517969d070de9aac?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Content-Sha256=UNSIGNED-PAYLOAD&X-Amz-Credential=cas%2F20260513%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20260513T005316Z&X-Amz-Expires=3600&X-Amz-Signature=032395145b906b87d306358ff7828d5524c43cae4d3a46530d5a809bf4b03517&X-Amz-SignedHeaders=host&X-Xet-Cas-Uid=public&response-content-disposition=inline%3B+filename*%3DUTF-8%27%27c4-train.00001-of-01024.json.gz%3B+filename%3D%22c4-train.00001-of-01024.json.gz%22%3B&response-content-type=application%2Fgzip&x-amz-checksum-mode=ENABLED&x-id=GetObject&Expires=1778637196&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTc3ODYzNzE5Nn19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2FzLWJyaWRnZS54ZXRodWIuaGYuY28veGV0LWJyaWRnZS11cy82MjFmZmRkMjM2NDY4ZDcwOWYxODJhODAvOWYzNjE5NGUyYmYwY2QzYmQyN2EwMjNlYzc1MTliMzZjMGRjYzVjMjM2ZWIyNjk1NTE3OTY5ZDA3MGRlOWFhYyoifV19&Signature=OmTjE1fZkM%7EmdyuZInzJzcRDuBWrwXRvZ7liAFGX%7ERbNJj5Jp%7ESNj-mQUjvm0S9ns2RjUP7zBj5HzAnd-IltmpwDGr1%7E0YQy04hSWf96aFJDNcSatna9e7PTHnqo4MeQ9ARuJd9dnTe9CRaptyH82l6xED4Rrqry-Yr2byMDE-6ZuJiUKL1Yc4KvQmaBPtSpXXXnbhbmqe2y84q8fOee5bPcJM0NkyQmiS0gC0QkZvEoHfftDdP3pHP%7EYqq38sAKx82y-8KkOQ5s4j4Yg%7EdHEpaT%7E1G%7Em7xXXmawDMWDRfg7JWln3m-q2yGEgdLVwWLZCAF7E36cM4fpSDeK-aHUjA__&Key-Pair-Id=K2L8F4GPSG1IFC "HTTP/1.1 206 Partial Content"
[rank6]:[E513 00:55:00.285704524 ProcessGroupNCCL.cpp:801] Stack trace of the failed collective: 
#0 redispatch from /home/zeus/miniconda3/envs/cloudspace/lib/python3.12/site-packages/torch/_ops.py:878
#1 forward_no_grad from /home/zeus/miniconda3/envs/cloudspace/lib/python3.12/site-packages/torch/_library/autograd.py:41
#2 autograd_impl from /home/zeus/miniconda3/envs/cloudspace/lib/python3.12/site-packages/torch/_library/autograd.py:112
#3 __call__ from /home/zeus/miniconda3/envs/cloudspace/lib/python3.12/site-packages/torch/_ops.py:1275
#4 all_reduce from /home/zeus/miniconda3/envs/cloudspace/lib/python3.12/site-packages/torch/distributed/_functional_collectives.py:179
#5 _dist_reduce from /teamspace/studios/this_studio/torchtitan/torchtitan/distributed/utils.py:72
#6 dist_sum from /teamspace/studios/this_studio/torchtitan/torchtitan/distributed/utils.py:91
#7 train_step from /teamspace/studios/this_studio/torchtitan/torchtitan/trainer.py:720
#8 train from /teamspace/studios/this_studio/torchtitan/torchtitan/trainer.py:831
#9 wrapper from /home/zeus/miniconda3/envs/cloudspace/lib/python3.12/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py:367
#10 main from /teamspace/studios/this_studio/torchtitan/torchtitan/train.py:65
#11 <module> from /teamspace/studios/this_studio/torchtitan/torchtitan/train.py:79
#12 _run_code from <frozen runpy>:88
#13 _run_module_as_main from <frozen runpy>:198
....
[titan] 07:13:30 - 'The read operation timed out' thrown while requesting 
GET .../c4-train.00001-of-01024.json.gz

Here are the full logs

Versions

⚡ main ~/torchtitan python -V
Python 3.12.11
⚡ main ~/torchtitan python
Python 3.12.11 | packaged by Anaconda, Inc. | (main, Jun  5 2025, 13:09:17) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch; torch.__version__
'2.13.0.dev20260512+cu130'
>>> import torchtitan; torchtitan.__version__
'0.2.2'
>>> 

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No fields configured for Bug.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions