Skip to content

[BUG] download_file_with_lock crashes multi-GPU training on transient network failures (no retry/timeout) #554

@icenfly

Description

@icenfly

Bug Description

When running scripts.base_train on a multi-GPU setup, the training crashes at step 2000 during the intermediate CORE evaluation. The download_file_with_lock function in nanochat/common.py attempts to download eval_bundle.zip from S3, but fails on a transient DNS resolution error with no retry logic and no timeout.

Because this happens inside a distributed training run (8 GPUs), the failure on a single rank (rank 6) causes the entire process to hang via NCCL watchdog timeout, destroying ~3 hours of training progress.

Note that download_single_file in nanochat/dataset.py already implements retry with exponential backoff and timeout for parquet shard downloads. The download_file_with_lock function should follow the same pattern for consistency and robustness.

System Config

OS: Linux (CentOS-based)
GPU: 8x NVIDIA A100-SXM4-80GB
PyTorch: with NCCL 2.27.5+cuda12.9

Reproduction Command

torchrun --standalone --nproc_per_node=8 -m scripts.base_train -- \
  --depth=26 --target-param-data-ratio=8.25 --device-batch-size=16

Logs

step 01998/07226 (27.65%) | loss: 2.793130 | lrm: 1.00 | dt: 5537.57ms | tok/sec: 189,356 | bf16_mfu: 46.92 | epoch: 1 | total time: 184.03m | eta: 484.0m
step 01999/07226 (27.66%) | loss: 2.782072 | lrm: 1.00 | dt: 5534.80ms | tok/sec: 189,451 | bf16_mfu: 46.95 | epoch: 1 | total time: 184.12m | eta: 483.9m
Step 02000 | Validation bpb: 0.835379
Downloading https://karpathy-public.s3.us-west-2.amazonaws.com/eval_bundle.zip...
Downloading https://karpathy-public.s3.us-west-2.amazonaws.com/eval_bundle.zip...
[rank6]: Traceback (most recent call last):
[rank6]:   File "/opt/conda/envs/python3.10/lib/python3.10/urllib/request.py", line 1348, in do_open
[rank6]:     h.request(req.get_method(), req.selector, req.data, headers,
[rank6]:   File "/opt/conda/envs/python3.10/lib/python3.10/http/client.py", line 1283, in request
[rank6]:     self._send_request(method, url, body, headers, encode_chunked)
[rank6]:   File "/opt/conda/envs/python3.10/lib/python3.10/http/client.py", line 1329, in _send_request
[rank6]:     self.endheaders(body, encode_chunked=encode_chunked)
[rank6]:   File "/opt/conda/envs/python3.10/lib/python3.10/http/client.py", line 1278, in endheaders
[rank6]:     self._send_output(message_body, encode_chunked=encode_chunked)
[rank6]:   File "/opt/conda/envs/python3.10/lib/python3.10/http/client.py", line 1038, in _send_output
[rank6]:     self.send(msg)
[rank6]:   File "/opt/conda/envs/python3.10/lib/python3.10/http/client.py", line 976, in send
[rank6]:     self.connect()
[rank6]:   File "/opt/conda/envs/python3.10/lib/python3.10/http/client.py", line 1448, in connect
[rank6]:     super().connect()
[rank6]:   File "/opt/conda/envs/python3.10/lib/python3.10/http/client.py", line 942, in connect
[rank6]:     self.sock = self._create_connection(
[rank6]:   File "/opt/conda/envs/python3.10/lib/python3.10/socket.py", line 836, in create_connection
[rank6]:     for res in getaddrinfo(host, port, 0, SOCK_STREAM):
[rank6]:   File "/opt/conda/envs/python3.10/lib/python3.10/socket.py", line 967, in getaddrinfo
[rank6]:     for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
[rank6]: socket.gaierror: [Errno -2] Name or service not known

[rank6]: During handling of the above exception, another exception occurred:

[rank6]: Traceback (most recent call last):
[rank6]:   File "/opt/conda/envs/python3.10/lib/python3.10/runpy.py", line 196, in _run_module_as_main
[rank6]:     return _run_code(code, main_globals, None,
[rank6]:   File "/opt/conda/envs/python3.10/lib/python3.10/runpy.py", line 86, in _run_code
[rank6]:     exec(code, run_globals)
[rank6]:   File "/home/admin/workspace/aop_lab/nanochat/scripts/base_train.py", line 428, in <module>
[rank6]:     results = evaluate_core(orig_model, tokenizer, device, max_per_task=args.core_metric_max_per_task)
[rank6]:   File "/home/admin/workspace/aop_lab/nanochat/scripts/base_eval.py", line 118, in evaluate_core
[rank6]:     download_file_with_lock(EVAL_BUNDLE_URL, "eval_bundle.zip", postprocess_fn=place_eval_bundle)
[rank6]:   File "/home/admin/workspace/aop_lab/nanochat/nanochat/common.py", line 83, in download_file_with_lock
[rank6]:     with urllib.request.urlopen(url) as response:
[rank6]:   File "/opt/conda/envs/python3.10/lib/python3.10/urllib/request.py", line 216, in urlopen
[rank6]:     return opener.open(url, data, timeout)
[rank6]:   File "/opt/conda/envs/python3.10/lib/python3.10/urllib/request.py", line 519, in open
[rank6]:     response = self._open(req, data)
[rank6]:   File "/opt/conda/envs/python3.10/lib/python3.10/urllib/request.py", line 536, in _open
[rank6]:     result = self._call_chain(self.handle_open, protocol, protocol +
[rank6]:   File "/opt/conda/envs/python3.10/lib/python3.10/urllib/request.py", line 496, in _call_chain
[rank6]:     result = func(*args)
[rank6]:   File "/opt/conda/envs/python3.10/lib/python3.10/urllib/request.py", line 1391, in https_open
[rank6]:     return self.do_open(http.client.HTTPSConnection, req,
[rank6]:   File "/opt/conda/envs/python3.10/lib/python3.10/urllib/request.py", line 1351, in do_open
[rank6]:     raise URLError(err)
[rank6]: urllib.error.URLError: <urlopen error [Errno -2] Name or service not known>
hippo-033099186168:804528:1095928 [6] NCCL INFO misc/socket.cc:64 -> 3
hippo-033099186168:804528:1095928 [6] NCCL INFO misc/socket.cc:81 -> 3
hippo-033099186168:804528:1095928 [6] NCCL INFO misc/socket.cc:863 -> 3
hippo-033099186168:804528:1095928 [6] NCCL INFO misc/socket.cc:64 -> 3
hippo-033099186168:804528:1095928 [6] NCCL INFO misc/socket.cc:81 -> 3
hippo-033099186168:804528:1095928 [6] NCCL INFO misc/socket.cc:863 -> 3
hippo-033099186168:804528:805225 [6] NCCL INFO misc/socket.cc:915 -> 3
hippo-033099186168:804528:1095928 [6] NCCL INFO misc/socket.cc:64 -> 3
hippo-033099186168:804528:1095928 [6] NCCL INFO misc/socket.cc:81 -> 3
hippo-033099186168:804528:1095928 [6] NCCL INFO misc/socket.cc:863 -> 3
[rank6]:[E222 18:40:45.627415948 ProcessGroupNCCL.cpp:1362] [PG ID 0 PG GUID 0(default_pg) Rank 6] Future for ProcessGroup abort timed out after 600000 ms
Downloaded to /home/admin/.cache/nanochat/eval_bundle.zip
Evaluating: hellaswag_zeroshot (0-shot, type: multiple_choice)... [rank6]:[E222 18:45:14.512389158 ProcessGroupNCCL.cpp:1858] [PG ID 0 PG GUID 0(default_pg) Rank 6] ProcessGroupNCCL's watchdog got stuck for 480 seconds without making progress in monitoring enqueued collectives. This typically indicates a NCCL/CUDA API (e.g., CudaEventDestroy) hang blocking the watchdog, and could be triggered by another thread holding the GIL inside a CUDA api (for example, CudaEventDestroy), or other deadlock-prone behaviors.If you suspect the watchdog is not actually stuck and a longer timeout would help, you can either increase the timeout (TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC) to a larger value or disable the heartbeat monitor (TORCH_NCCL_ENABLE_MONITORING=0).If either of aforementioned helps, feel free to file an issue to PyTorch about the short timeout or false positive abort; otherwise, please attempt to debug the hang. 
[rank6]:[E222 18:45:14.512676039 ProcessGroupNCCL.cpp:1575] [PG ID 0 PG GUID 0(default_pg) Rank 6] ProcessGroupNCCL preparing to dump debug info. Include stack trace: 1
[rank6]:[F222 18:53:14.537423957 ProcessGroupNCCL.cpp:1600] [PG ID 0 PG GUID 0(default_pg) Rank 6] [PG ID 0 PG GUID 0(default_pg) Rank 6] Terminating the process after attempting to dump debug info, due to ProcessGroupNCCL watchdog hang.
W0222 18:53:44.141000 804364 .venv/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:908] Sending process 804521 closing signal SIGTERM
W0222 18:53:44.148000 804364 .venv/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:908] Sending process 804522 closing signal SIGTERM
W0222 18:53:44.149000 804364 .venv/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:908] Sending process 804523 closing signal SIGTERM
W0222 18:53:44.177000 804364 .venv/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:908] Sending process 804524 closing signal SIGTERM
W0222 18:53:44.178000 804364 .venv/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:908] Sending process 804526 closing signal SIGTERM
W0222 18:53:44.179000 804364 .venv/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:908] Sending process 804527 closing signal SIGTERM
W0222 18:53:44.201000 804364 .venv/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:908] Sending process 804529 closing signal SIGTERM
E0222 18:53:49.627000 804364 .venv/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:882] failed (exitcode: -6) local_rank: 6 (pid: 804528) of binary: /home/admin/workspace/aop_lab/nanochat/.venv/bin/python3
Traceback (most recent call last):
  File "/home/admin/workspace/aop_lab/nanochat/.venv/bin/torchrun", line 10, in <module>
    sys.exit(main())
  File "/home/admin/workspace/aop_lab/nanochat/.venv/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 357, in wrapper
    return f(*args, **kwargs)
  File "/home/admin/workspace/aop_lab/nanochat/.venv/lib/python3.10/site-packages/torch/distributed/run.py", line 936, in main
    run(args)
  File "/home/admin/workspace/aop_lab/nanochat/.venv/lib/python3.10/site-packages/torch/distributed/run.py", line 927, in run
    elastic_launch(
  File "/home/admin/workspace/aop_lab/nanochat/.venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 156, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/admin/workspace/aop_lab/nanochat/.venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 293, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
=======================================================
scripts.base_train FAILED
-------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
-------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2026-02-22_18:53:44
  host      : hippo-033099186168.na61
  rank      : 6 (local_rank: 6)
  exitcode  : -6 (pid: 804528)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 804528
=======================================================

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions