Skip to content

Conversation

@lukestanley
Copy link
Contributor

@lukestanley lukestanley commented Oct 14, 2025

  • Updates speedrun.sh to select Torch index based on NVIDIA GPU check
    The Torch CPU package index is used if no NVIDIA GPU is indicated by nvidia-smi check.

  • Device conditional type changes

  • Added a link on README to the very useful DeepWiki page for the NanoChat repo.

$ python -m scripts.base_train --depth=2 --device_batch_size=1 --total_batch_size=2048 --num_iterations=1 --eval_every=1 --eval_tokens=2048 --core_metric_every=100000 --sample_every=100000

                                                   █████                 █████
                                                  ░░███                 ░░███
 ████████    ██████   ████████    ██████   ██████  ░███████    ██████   ███████
░░███░░███  ░░░░░███ ░░███░░███  ███░░███ ███░░███ ░███░░███  ░░░░░███ ░░░███░
 ░███ ░███   ███████  ░███ ░███ ░███ ░███░███ ░░░  ░███ ░███   ███████   ░███
 ░███ ░███  ███░░███  ░███ ░███ ░███ ░███░███  ███ ░███ ░███  ███░░███   ░███ ███
 ████ █████░░████████ ████ █████░░██████ ░░██████  ████ █████░░████████  ░░█████
░░░░ ░░░░░  ░░░░░░░░ ░░░░ ░░░░░  ░░░░░░   ░░░░░░  ░░░░ ░░░░░  ░░░░░░░░    ░░░░░

Overriding: depth = 2
Overriding: device_batch_size = 1
Overriding: total_batch_size = 2048
Overriding: num_iterations = 1
Overriding: eval_every = 1
Overriding: eval_tokens = 2048
Overriding: core_metric_every = 100000
Overriding: sample_every = 100000
2025-10-14 00:20:24,167 - nanochat.common - INFO - Distributed world size: 1
/workspaces/nanochat/.venv/lib/python3.10/site-packages/torch/amp/autocast_mode.py:283: UserWarning: In CPU autocast, but the target dtype is not supported. Disabling autocast.
CPU Autocast only supports dtype of torch.bfloat16, torch.float16 currently.
  warnings.warn(error_message)
Vocab size: 34,301
num_layers: 2
model_dim: 128
num_heads: 1
num_kv_heads: 1
Tokens / micro-batch / rank: 1 x 2048 = 2,048
Tokens / micro-batch: 2,048
Total batch size 2,048 => gradient accumulation steps: 1
Number of parameters: 9,174,272
Estimated FLOPs per token: 3.499392e+07
Using user-provided number of iterations: 1
Total number of training tokens: 2,048
Tokens : Params ratio: 0.00
Total training FLOPs estimate: 7.166755e+10
Scaling the LR for the AdamW parameters ∝1/√(128/768) = 2.449490
Step 00000 | Validation bpb: 3.2639
step 00000/00001 (0.00%) | loss: 10.442895 | lrm: 1.00 | dt: 2396.07ms | tok/sec: 854 | mfu: N/A | total time: 0.00m
Step 00001 | Validation bpb: 3.2430
Evaluating: hellaswag_zeroshot (0-shot, type: multiple_choice)... accuracy: 0.2480 | centered: -0.0027 | time: 56.06s
Evaluating: jeopardy (10-shot, type: language_modeling)... accuracy: 0.0000 | centered: 0.0000 | time: 53.14s
Evaluating: bigbench_qa_wikidata (10-shot, type: language_modeling)... accuracy: 0.0000 | centered: 0.0000 | time: 19.92s
Evaluating: arc_easy (10-shot, type: multiple_choice)... accuracy: 0.2660 | centered: 0.0213 | time: 200.33s
Evaluating: arc_challenge (10-shot, type: multiple_choice)... accuracy: 0.2220 | centered: -0.0373 | time: 218.14s
Evaluating: copa (0-shot, type: multiple_choice)... accuracy: 0.4400 | centered: -0.1200 | time: 1.25s
Evaluating: commonsense_qa (10-shot, type: multiple_choice)... accuracy: 0.2120 | centered: 0.0150 | time: 253.94s
Evaluating: piqa (10-shot, type: multiple_choice)... accuracy: 0.4940 | centered: -0.0120 | time: 110.09s
Evaluating: openbook_qa (0-shot, type: multiple_choice)... accuracy: 0.2260 | centered: -0.0320 | time: 10.72s
Evaluating: lambada_openai (0-shot, type: language_modeling)... accuracy: 0.0000 | centered: 0.0000 | time: 11.26s
Evaluating: hellaswag (10-shot, type: multiple_choice)... accuracy: 0.2480 | centered: -0.0027 | time: 459.59s
Evaluating: winograd (0-shot, type: schema)... accuracy: 0.4945 | centered: -0.0110 | time: 4.01s
Evaluating: winogrande (0-shot, type: schema)... accuracy: 0.5260 | centered: 0.0520 | time: 7.76s
Evaluating: bigbench_dyck_languages (10-shot, type: language_modeling)... accuracy: 0.0000 | centered: 0.0000 | time: 83.10s
Evaluating: agi_eval_lsat_ar (3-shot, type: multiple_choice)... accuracy: 0.2261 | centered: 0.0326 | time: 201.11s
Evaluating: bigbench_cs_algorithms (10-shot, type: language_modeling)... accuracy: 0.0000 | centered: 0.0000 | time: 69.72s
Evaluating: bigbench_operators (10-shot, type: language_modeling)... accuracy: 0.0000 | centered: 0.0000 | time: 22.28s
Evaluating: bigbench_repeat_copy_logic (10-shot, type: language_modeling)... accuracy: 0.0000 | centered: 0.0000 | time: 4.13s
Evaluating: squad (10-shot, type: language_modeling)... accuracy: 0.0000 | centered: 0.0000 | time: 276.78s
Evaluating: coqa (0-shot, type: language_modeling)... accuracy: 0.0000 | centered: 0.0000 | time: 71.01s
Evaluating: boolq (10-shot, type: multiple_choice)... accuracy: 0.3720 | centered: -0.6526 | time: 423.89s
Evaluating: bigbench_language_identification (10-shot, type: multiple_choice)... accuracy: 0.2380 | centered: 0.1617 | time: 844.53s
Step 00001 | CORE metric: -0.0267
W1014 01:17:32.378000 59893 .venv/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py:1016] [0/8] torch._dynamo hit config.recompile_limit (8)
W1014 01:17:32.378000 59893 .venv/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py:1016] [0/8]    function: 'forward' (/workspaces/nanochat/nanochat/gpt.py:261)
W1014 01:17:32.378000 59893 .venv/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py:1016] [0/8]    last reason: 0/7: kv_cache.pos == 10                                     
W1014 01:17:32.378000 59893 .venv/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py:1016] [0/8] To log all recompilation reasons, use TORCH_LOGS="recompiles".
W1014 01:17:32.378000 59893 .venv/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py:1016] [0/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html.
<|bos|>The capital of France is seen-mile than a seamless movement total traded shaped local s—’ moment,” recalls former
<|bos|>The chemical symbol of gold is seen-mile than a seamless movement total traded shaped local s—’ moment,” recalls former
<|bos|>If yesterday was Friday, then tomorrow will be subject new journeys tend to grow an important consideration when required their rivals taking only needs
<|bos|>The opposite of hot is seen-mile than a seamless movement total traded shaped local s—’ moment,” recalls former
<|bos|>The planets of the solar system are: at each end move via this article shall prevent km) access year them a seamless
<|bos|>My favorite color is seen-mile than a seamless movement total traded shaped local s—’ moment,” recalls former
<|bos|>If 5*x + 3 = 13, then x is seen-mile than a seamless movement total traded shaped local s—’ moment,” recalls former
2025-10-14 01:17:32,671 - nanochat.checkpoint_manager - INFO - Saved model file to: /home/codespace/.cache/nanochat/base_checkpoints/d2/model_000001.pt
2025-10-14 01:17:32,742 - nanochat.checkpoint_manager - INFO - Saved optimizer file to: /home/codespace/.cache/nanochat/base_checkpoints/d2/optim_000001.pt
2025-10-14 01:17:32,742 - nanochat.checkpoint_manager - INFO - Saved metadata file to: /home/codespace/.cache/nanochat/base_checkpoints/d2/meta_000001.json
Peak memory usage: N/A (CPU run)
Total training time: 0.00m
Minimum validation bpb: 3.2430
$ 

Well, I better sleep now it has finished!

Thanks for this great end-to-end project @karpathy

Makes torch index an extra in pyproject.toml,  speedrun.sh selects GPU index if supported, with CPU fallback
coded with help of Devin DeepWiki planner, GPT-5 Codex execution with VS Code agent mode.

https://deepwiki.com/search/suggest-how-to-modify-the-code_80cebbfc-0ad0-4b92-addd-2b4210fa9f04
@karpathy
Copy link
Owner

I think... you're going to wait a long time. :D

@lukestanley
Copy link
Contributor Author

Haha, the little test finished!
My favorite color is seen-mile than a seamless movement total traded shaped local s—’ moment,” recalls former
Thanks again for the great little end-to-end magic!
@karpathy

LokiMetaSmith referenced this pull request in LokiMetaSmith/nanochat Oct 14, 2025
This change incorporates the changes from pull requests #17 and #21 to add support for CPU-only and macOS environments. It introduces dynamic detection of hardware and data types, and updates the dependency installation process to select the appropriate PyTorch build.
@kbastani
Copy link

Nice work!

Copy link
Collaborator

@svlandeg svlandeg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maintenance update: the edits in this PR seem to be mostly covered by #88, so I suggest to close this one.

@svlandeg svlandeg closed this Nov 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants