Bug description
Hey,
I was trying torchtitan and it seems streaming dataset from Hugging Face is somewhat flaky.
torchrun --nnodes 1 --nproc_per_node 8 \
--rdzv_backend c10d --rdzv_endpoint localhost:29500 \
-m torchtitan.train \
--module llama3 --config llama3_8b \
--hf-assets-path ./assets/hf/Meta-Llama-3-8B \
--dump-folder ./outputs/llama3_8b_dayrun \
--training.local-batch-size 2 \
--training.steps 500000 \
--checkpoint.enable \
--checkpoint.interval 2000 \
--checkpoint.keep-latest-k 3 \
--metrics.log-freq 10 \
--profiler.profile-freq 1000000 \
2>&1 | tee outputs/llama3_8b_dayrun.log
The portion of the logs
[titan] 2026-05-13 00:52:16,792 - root - INFO - �[31mstep: 2690 �[32mloss: 3.76196 �[38;2;180;60;0mgrad_norm: 0.2823 �[38;2;54;234;195mmemory: 59.24GiB(74.81%) �[34mtps: 2,059 �[36mtflops: 119.27 �[35mmfu: 12.06%�[39m
[titan] 2026-05-13 00:52:16,792 - root - INFO - �[31mstep: 2690 �[32mloss: 3.76196 �[38;2;180;60;0mgrad_norm: 0.2823 �[38;2;54;234;195mmemory: 59.24GiB(74.81%) �[34mtps: 2,059 �[36mtflops: 119.27 �[35mmfu: 12.06%�[39m
[titan] 2026-05-13 00:52:16,792 - root - INFO - �[31mstep: 2690 �[32mloss: 3.76196 �[38;2;180;60;0mgrad_norm: 0.2823 �[38;2;54;234;195mmemory: 59.24GiB(74.82%) �[34mtps: 2,059 �[36mtflops: 119.27 �[35mmfu: 12.06%�[39m
[titan] 2026-05-13 00:52:46,310 - httpx - INFO - HTTP Request: GET https://huggingface.co/datasets/allenai/c4/resolve/1588ec454efa1a09f29cd18ddd04fe05fc8653a2/en/c4-train.00004-of-01024.json.gz "HTTP/1.1 302 Found"
[titan] 2026-05-13 00:52:46,337 - httpx - INFO - HTTP Request: GET https://cas-bridge.xethub.hf.co/xet-bridge-us/621ffdd236468d709f182a80/732ea51359e83c755ff7a289603bc934a5d50b7220002769b99b42f17034a706?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Content-Sha256=UNSIGNED-PAYLOAD&X-Amz-Credential=cas%2F20260513%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20260513T005244Z&X-Amz-Expires=3600&X-Amz-Signature=27ad0ca740a05012b4b0ec636310a1518a0123bfe87cd68ff94c87edaf70e226&X-Amz-SignedHeaders=host&X-Xet-Cas-Uid=public&response-content-disposition=inline%3B+filename*%3DUTF-8%27%27c4-train.00004-of-01024.json.gz%3B+filename%3D%22c4-train.00004-of-01024.json.gz%22%3B&response-content-type=application%2Fgzip&x-amz-checksum-mode=ENABLED&x-id=GetObject&Expires=1778637164&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTc3ODYzNzE2NH19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2FzLWJyaWRnZS54ZXRodWIuaGYuY28veGV0LWJyaWRnZS11cy82MjFmZmRkMjM2NDY4ZDcwOWYxODJhODAvNzMyZWE1MTM1OWU4M2M3NTVmZjdhMjg5NjAzYmM5MzRhNWQ1MGI3MjIwMDAyNzY5Yjk5YjQyZjE3MDM0YTcwNioifV19&Signature=MyVroYh5%7Ey8Oj4QOXxFBbDDO%7ERurPF-8e40ifqT-4waugwcS-DEr6I5g80y1KESxdY9ueQUX7Dk4JE%7EbMZj4MFyl-XcoPGpGaJ-safUBGxpOwp3m53MI4auKzaAbjWt5pRDNSE9gWy3z5zMsMiqpBkUmExPiN5FL%7E7dxMGiSYR6TU1d7DE9h-2O83QEHsafVFkL-qWQhyzA%7EuZ1pjx9EYd0LC8yEILfPYUUx-YxKiRMN98YY9bLAA2OsTmHn1PmhVAC5k9cvzBreK0eiYpEfiNH%7Eb-Jcpaws04ckRwQ33REHyoo9zYWPgoHSPX8iWV%7EfFFPVfN%7EgTvU0thlZsNtUTQ__&Key-Pair-Id=K2L8F4GPSG1IFC "HTTP/1.1 206 Partial Content"
[titan] 2026-05-13 00:52:54,343 - httpx - INFO - HTTP Request: GET https://huggingface.co/datasets/allenai/c4/resolve/1588ec454efa1a09f29cd18ddd04fe05fc8653a2/en/c4-train.00000-of-01024.json.gz "HTTP/1.1 302 Found"
[titan] 2026-05-13 00:52:54,363 - httpx - INFO - HTTP Request: GET https://cas-bridge.xethub.hf.co/xet-bridge-us/621ffdd236468d709f182a80/6656aac078668b0d5feb1cfb7b04def05bff73e7e375d0a66e252827336ac1a8?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Content-Sha256=UNSIGNED-PAYLOAD&X-Amz-Credential=cas%2F20260513%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20260513T004925Z&X-Amz-Expires=3600&X-Amz-Signature=55f6fa3ad70854c3d377f1c9153c91d4450e766c3178ce2e3ef8f44a025007a0&X-Amz-SignedHeaders=host&X-Xet-Cas-Uid=public&response-content-disposition=inline%3B+filename*%3DUTF-8%27%27c4-train.00000-of-01024.json.gz%3B+filename%3D%22c4-train.00000-of-01024.json.gz%22%3B&response-content-type=application%2Fgzip&x-amz-checksum-mode=ENABLED&x-id=GetObject&Expires=1778636965&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTc3ODYzNjk2NX19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2FzLWJyaWRnZS54ZXRodWIuaGYuY28veGV0LWJyaWRnZS11cy82MjFmZmRkMjM2NDY4ZDcwOWYxODJhODAvNjY1NmFhYzA3ODY2OGIwZDVmZWIxY2ZiN2IwNGRlZjA1YmZmNzNlN2UzNzVkMGE2NmUyNTI4MjczMzZhYzFhOCoifV19&Signature=HI-kcMay-3YI3pLH55fnYAXLAKJtkwZTtWgd5fRIJjUQ8FLcWIN01WXR420dYdJYSBr1tyy2SCdRzPMhxcAr4hfSwbjKdSxMnzL7%7EukvSy7GQ51hW%7E4wHfOtiYRLK4yHzkcsg1Dhat2xxiVm6UpqO64b-emeKwBTagsug3-DfylVd2h5FThBUpeLeY8WoSTdbucv5lJWClzBKpmQVaiJLYwFH7HUzgN-AD0K82FRpbghs43HlaKYwZWfs09lDw2Arm9K0UNQJ-55TjVuvwYkKbawl97D4M1XE1WQUB4fmwoSFFiq2h2NGNzjuPKHHAfR1b6Z1r4E8jA0kAHiZ-0kKw__&Key-Pair-Id=K2L8F4GPSG1IFC "HTTP/1.1 206 Partial Content"
[titan] 2026-05-13 00:53:02,245 - httpx - INFO - HTTP Request: GET https://huggingface.co/datasets/allenai/c4/resolve/1588ec454efa1a09f29cd18ddd04fe05fc8653a2/en/c4-train.00002-of-01024.json.gz "HTTP/1.1 302 Found"
[titan] 2026-05-13 00:53:02,266 - httpx - INFO - HTTP Request: GET https://cas-bridge.xethub.hf.co/xet-bridge-us/621ffdd236468d709f182a80/96adb94376f42df2255c7ad2fa02604236304fc487f7bb89b8bdee36a126f298?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Content-Sha256=UNSIGNED-PAYLOAD&X-Amz-Credential=cas%2F20260513%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20260513T005300Z&X-Amz-Expires=3600&X-Amz-Signature=6ddfb69b71e8133426f4de4db704764f89320cb83866b15f1d5ff1ffa5dc681a&X-Amz-SignedHeaders=host&X-Xet-Cas-Uid=public&response-content-disposition=inline%3B+filename*%3DUTF-8%27%27c4-train.00002-of-01024.json.gz%3B+filename%3D%22c4-train.00002-of-01024.json.gz%22%3B&response-content-type=application%2Fgzip&x-amz-checksum-mode=ENABLED&x-id=GetObject&Expires=1778637180&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTc3ODYzNzE4MH19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2FzLWJyaWRnZS54ZXRodWIuaGYuY28veGV0LWJyaWRnZS11cy82MjFmZmRkMjM2NDY4ZDcwOWYxODJhODAvOTZhZGI5NDM3NmY0MmRmMjI1NWM3YWQyZmEwMjYwNDIzNjMwNGZjNDg3ZjdiYjg5YjhiZGVlMzZhMTI2ZjI5OCoifV19&Signature=jPtGBvNU7LIuvkYZJEy%7E09TOoomUJ1DC%7ECDFV3sLuAuL9kSML%7E2l-9tDwU-J0pm2xWzfToi6VpMv5aXuS4k5bRQPsoLfmAi6aspirk-KK7mRZeIgzwQ0cPT-5WDaC19-ONkcKSLk8-8F7ed-jKJqDleeXY7ryINQh1CkKSVujHw5fve3guerQeIqc5bLNoMayVLGdrO91T5pz7fp%7EQ9xWy1njxrjl3fRmaUJm42IWQhjQCOtnJY9hqRi1j0KwX2p2IQ2ZlGRTPN26NF2wWzEmk8livOhdv0f2UvdK1tOFrYw6xtytmmBm3O7oORvpKriyXNKqNkf2shQXSYowhmebg__&Key-Pair-Id=K2L8F4GPSG1IFC "HTTP/1.1 206 Partial Content"
[titan] 2026-05-13 00:53:02,269 - httpx - INFO - HTTP Request: GET https://huggingface.co/datasets/allenai/c4/resolve/1588ec454efa1a09f29cd18ddd04fe05fc8653a2/en/c4-train.00003-of-01024.json.gz "HTTP/1.1 302 Found"
[titan] 2026-05-13 00:53:02,295 - httpx - INFO - HTTP Request: GET https://cas-bridge.xethub.hf.co/xet-bridge-us/621ffdd236468d709f182a80/a20f257af4ab58526d8f66b50d36299eec45b2e79f25d94c8cb8186322bfb888?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Content-Sha256=UNSIGNED-PAYLOAD&X-Amz-Credential=cas%2F20260513%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20260513T005300Z&X-Amz-Expires=3600&X-Amz-Signature=58a8f2b96ac72cc6c0d324927a56fd581be13b158d1c9010cb6584eb0972ee72&X-Amz-SignedHeaders=host&X-Xet-Cas-Uid=public&response-content-disposition=inline%3B+filename*%3DUTF-8%27%27c4-train.00003-of-01024.json.gz%3B+filename%3D%22c4-train.00003-of-01024.json.gz%22%3B&response-content-type=application%2Fgzip&x-amz-checksum-mode=ENABLED&x-id=GetObject&Expires=1778637180&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTc3ODYzNzE4MH19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2FzLWJyaWRnZS54ZXRodWIuaGYuY28veGV0LWJyaWRnZS11cy82MjFmZmRkMjM2NDY4ZDcwOWYxODJhODAvYTIwZjI1N2FmNGFiNTg1MjZkOGY2NmI1MGQzNjI5OWVlYzQ1YjJlNzlmMjVkOTRjOGNiODE4NjMyMmJmYjg4OCoifV19&Signature=tGVkmZgdUBljaqSRRaSpE1JfKnwcHm-i7KzDO5tffytmqXfvsHEvTtgzXBhxo24EcveU3LW76Sa6504qwt5doDjcBJFgGY3DXEFMQNGS-y4hpzA-DhvinW-wtSRMy0u7bRtmQscCEW1PCybH8GSihaPHfFfJpAPRAZPq3Uw5RLTSEw8J8O4wvThicCWsUkPipYG81Zz62vzJftbxElbPD3nLqxrl204xm0Ct1NX8AAxoz-TFkIySk%7EJ9rQ7rFb1a%7Eo35gsQFaPcGtvlenuqecKd90C4raIKytsJMi0Y4OV4PbS5BXL0sfCAaeF3iXkg7nYUXKBCP-EcmgPTSalqEvg__&Key-Pair-Id=K2L8F4GPSG1IFC "HTTP/1.1 206 Partial Content"
[titan] 2026-05-13 00:53:18,433 - httpx - INFO - HTTP Request: GET https://huggingface.co/datasets/allenai/c4/resolve/1588ec454efa1a09f29cd18ddd04fe05fc8653a2/en/c4-train.00001-of-01024.json.gz "HTTP/1.1 302 Found"
[rank6]:[E513 00:55:00.284720923 ProcessGroupNCCL.cpp:757] [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=286809, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=100000) ran for 100004 milliseconds before timing out.
[rank6]:[E513 00:55:00.284950840 ProcessGroupNCCL.cpp:2396] [PG ID 0 PG GUID 0(default_pg) Rank 6] failure detected by watchdog at work sequence id: 286809 PG status: last enqueued work: 286809, last completed work: 286808
[titan] 2026-05-13 00:53:18,455 - httpx - INFO - HTTP Request: GET https://cas-bridge.xethub.hf.co/xet-bridge-us/621ffdd236468d709f182a80/9f36194e2bf0cd3bd27a023ec7519b36c0dcc5c236eb2695517969d070de9aac?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Content-Sha256=UNSIGNED-PAYLOAD&X-Amz-Credential=cas%2F20260513%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20260513T005316Z&X-Amz-Expires=3600&X-Amz-Signature=032395145b906b87d306358ff7828d5524c43cae4d3a46530d5a809bf4b03517&X-Amz-SignedHeaders=host&X-Xet-Cas-Uid=public&response-content-disposition=inline%3B+filename*%3DUTF-8%27%27c4-train.00001-of-01024.json.gz%3B+filename%3D%22c4-train.00001-of-01024.json.gz%22%3B&response-content-type=application%2Fgzip&x-amz-checksum-mode=ENABLED&x-id=GetObject&Expires=1778637196&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTc3ODYzNzE5Nn19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2FzLWJyaWRnZS54ZXRodWIuaGYuY28veGV0LWJyaWRnZS11cy82MjFmZmRkMjM2NDY4ZDcwOWYxODJhODAvOWYzNjE5NGUyYmYwY2QzYmQyN2EwMjNlYzc1MTliMzZjMGRjYzVjMjM2ZWIyNjk1NTE3OTY5ZDA3MGRlOWFhYyoifV19&Signature=OmTjE1fZkM%7EmdyuZInzJzcRDuBWrwXRvZ7liAFGX%7ERbNJj5Jp%7ESNj-mQUjvm0S9ns2RjUP7zBj5HzAnd-IltmpwDGr1%7E0YQy04hSWf96aFJDNcSatna9e7PTHnqo4MeQ9ARuJd9dnTe9CRaptyH82l6xED4Rrqry-Yr2byMDE-6ZuJiUKL1Yc4KvQmaBPtSpXXXnbhbmqe2y84q8fOee5bPcJM0NkyQmiS0gC0QkZvEoHfftDdP3pHP%7EYqq38sAKx82y-8KkOQ5s4j4Yg%7EdHEpaT%7E1G%7Em7xXXmawDMWDRfg7JWln3m-q2yGEgdLVwWLZCAF7E36cM4fpSDeK-aHUjA__&Key-Pair-Id=K2L8F4GPSG1IFC "HTTP/1.1 206 Partial Content"
[rank6]:[E513 00:55:00.285704524 ProcessGroupNCCL.cpp:801] Stack trace of the failed collective:
#0 redispatch from /home/zeus/miniconda3/envs/cloudspace/lib/python3.12/site-packages/torch/_ops.py:878
#1 forward_no_grad from /home/zeus/miniconda3/envs/cloudspace/lib/python3.12/site-packages/torch/_library/autograd.py:41
#2 autograd_impl from /home/zeus/miniconda3/envs/cloudspace/lib/python3.12/site-packages/torch/_library/autograd.py:112
#3 __call__ from /home/zeus/miniconda3/envs/cloudspace/lib/python3.12/site-packages/torch/_ops.py:1275
#4 all_reduce from /home/zeus/miniconda3/envs/cloudspace/lib/python3.12/site-packages/torch/distributed/_functional_collectives.py:179
#5 _dist_reduce from /teamspace/studios/this_studio/torchtitan/torchtitan/distributed/utils.py:72
#6 dist_sum from /teamspace/studios/this_studio/torchtitan/torchtitan/distributed/utils.py:91
#7 train_step from /teamspace/studios/this_studio/torchtitan/torchtitan/trainer.py:720
#8 train from /teamspace/studios/this_studio/torchtitan/torchtitan/trainer.py:831
#9 wrapper from /home/zeus/miniconda3/envs/cloudspace/lib/python3.12/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py:367
#10 main from /teamspace/studios/this_studio/torchtitan/torchtitan/train.py:65
#11 <module> from /teamspace/studios/this_studio/torchtitan/torchtitan/train.py:79
#12 _run_code from <frozen runpy>:88
#13 _run_module_as_main from <frozen runpy>:198
....
[titan] 07:13:30 - 'The read operation timed out' thrown while requesting
GET .../c4-train.00001-of-01024.json.gz
Here are the full logs
Versions
⚡ main ~/torchtitan python -V
Python 3.12.11
⚡ main ~/torchtitan python
Python 3.12.11 | packaged by Anaconda, Inc. | (main, Jun 5 2025, 13:09:17) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch; torch.__version__
'2.13.0.dev20260512+cu130'
>>> import torchtitan; torchtitan.__version__
'0.2.2'
>>>
Bug description
Hey,
I was trying torchtitan and it seems streaming dataset from Hugging Face is somewhat flaky.
The portion of the logs
Here are the full logs
Versions