feat: Add ROCm and device-agnostic support #23

LokiMetaSmith · 2025-10-14T05:07:46Z

This change adds support for ROCm and makes the codebase device-agnostic, allowing it to run on different hardware backends including ROCm, CUDA, and CPU.

The key changes are:

Modified pyproject.toml to use ROCm-compatible PyTorch wheels and added the pytorch-triton-rocm dependency.
Refactored nanochat/common.py to dynamically detect the available hardware and set the device and distributed backend accordingly.
Updated all training, evaluation, and inference scripts to be device-agnostic, removing hardcoded CUDA references.
Adapted speedrun.sh for single-device execution by replacing torchrun with python.
Updated nanochat/report.py to provide more generic GPU information.

This change adds support for ROCm and makes the codebase device-agnostic, allowing it to run on different hardware backends including ROCm, CUDA, and CPU. The key changes are: - Modified `pyproject.toml` to use ROCm-compatible PyTorch wheels and added the `pytorch-triton-rocm` dependency. - Refactored `nanochat/common.py` to dynamically detect the available hardware and set the device and distributed backend accordingly. - Updated all training, evaluation, and inference scripts to be device-agnostic, removing hardcoded CUDA references. - Adapted `speedrun.sh` for single-device execution by replacing `torchrun` with `python`. - Updated `nanochat/report.py` to provide more generic GPU information.

LokiMetaSmith · 2025-10-14T05:25:45Z

So far seeeems to be working.

running it on a strix halo 128gb version.

LokiMetaSmith · 2025-10-14T05:27:03Z

nvm, more work needs to be done.
current issue: nanochat/.venv/lib/python3.10/site-packages/torch/_utils.py:101: UserWarning: expandable_segments not supported on this platform (Triggered internally at /pytorch/c10/hip/HIPAllocatorConfig.h:36.)

This commit addresses several runtime errors encountered during the execution of the `speedrun.sh` script and improves the overall configuration of the project. The key changes are: - Patched `nanochat/configurator.py` to be more robust by handling flag-like arguments and ignoring unknown arguments. This resolves the `AssertionError`. - Fixed the argument handling for `chat_eval.py` in `speedrun.sh` to prevent argument parsing errors. - Updated `pyproject.toml` to correctly define optional dependencies for development.

This commit fixes a `torch.AcceleratorError: HIP error: invalid device function` that occurred during weight initialization on ROCm devices. It also improves the device detection logic to correctly identify and prioritize the ROCm backend. The key changes are: - Patched `nanochat/gpt.py` to initialize weights on the CPU before moving them to the target device, which avoids the HIP kernel error. - Simplified and corrected the device detection logic in `nanochat/common.py` to ensure the ROCm backend is properly selected when available.

This commit adds the `HSA_OVERRIDE_GFX_VERSION` environment variable to the `speedrun.sh` script. This is a workaround to enable support for newer AMD GPU architectures (e.g., gfx1151) that are not yet officially supported in the pre-compiled PyTorch ROCm builds. This change also includes an update to the `README.md` to explain this workaround to users.

This change adds the `PYTORCH_CUDA_ALLOC_CONF` environment variable to the main `speedrun.sh` execution script. Setting `expandable_segments:True` is recommended by PyTorch to manage memory more efficiently and prevent fragmentation, addressing a `UserWarning` observed during execution.

…tion Set PYTORCH_CUDA_ALLOC_CONF to prevent memory fragmentation

This commit re-adds the `PYTORCH_CUDA_ALLOC_CONF` environment variable to the training scripts. This setting helps prevent memory fragmentation and is beneficial for both CUDA and ROCm environments. This change was inadvertently removed during a previous refactoring.

jon-hotaisle · 2025-10-15T02:00:02Z

Confirmed. Seeing the same issue on a box of MI355x in a recent container: docker.io/rocm/pytorch:rocm7.0.2_ubuntu24.04_py3.12_pytorch_release_2.8.0

jammm · 2025-10-15T08:26:06Z

speedrun.sh

+# For newer AMD GPUs that are not yet officially supported by PyTorch ROCm builds,
+# we can override the detected GPU architecture to a compatible one.
+# For example, for a gfx1151 GPU, we can use gfx1100 (11.0.0).
+export HSA_OVERRIDE_GFX_VERSION=11.0.0


PyTorch should be supporting gfx1151 already via. TheRock. This shouldn't be enabled by default IMO.
This could also cause issues when running on gfx9 which @jon-hotaisle was running on.

jammm · 2025-10-15T08:34:46Z

pyproject.toml

 [tool.uv.sources]
 torch = [
-    { index = "pytorch-cu128" },
+    { index = "pytorch-rocm63" },


ROCm 6.4 is the latest supported one, and there's also ROCm 7. I suggest switching to TheRock wheels for best compatibility https://github.com/ROCm/TheRock/blob/main/RELEASES.md#installing-pytorch-python-packages - these also install ROCm itself as pip wheels, so you don't need to worry about whether the system contains a specific version of ROCm.

EDIT: Currently TheRock wheels are device-specific, so maybe it's best to stick with 6.4/nightly 7.0 wheels from the official pytorch index until TheRock releases with multi-device wheels.

jammm · 2025-10-15T08:47:14Z

expandable_segments

It seems like it's not supported properly for ROCm. But this shouldn't be a blocker for initial support I'd say.

jammm · 2025-10-15T08:53:10Z

@LokiMetaSmith I've pushed a fix for the issues that @jon-hotaisle faced in jammm@e3e21e2

Use ROCm 6.4
Use python 3.12 (maybe this isn't necessary but for some reason uv wasn't able to fully install the env)
Remove export HSA_OVERRIDE_GFX_VERSION=11.0.0 (this can be added again with arch detection to see if it's gfx11 based first, though I highly recommend users use TheRock wheels instead to avoid setting this env var altogether)
Use nccl as the distributed backend (see feat: Add ROCm and device-agnostic support #23 (comment))

jammm · 2025-10-15T09:19:16Z

nanochat/common.py

+    # Detect hardware
+    if hasattr(torch.version, 'hip') and torch.version.hip and torch.cuda.is_available():
+        device_type = "cuda" # ROCm uses cuda naming in torch
+        backend = "rccl"


This will throw an error because backend is still nccl for ROCm. PyTorch automatically maps it to RCCL.
Just edited this to nccl on a mi300x node and it works fine.

LokiMetaSmith · 2025-10-16T23:37:47Z

it looks like some of the memory issue is that linux kernel doesn't automatically let you assign the full 128gb, I ran this, and it cleared the running out of memory issue, but there are other problems still.

This is strix halo specific.

https://www.reddit.com/r/LocalLLaMA/comments/1mib7l9/pytorch_on_rocm_v650rc_gfx1151_amd_strix_halo/

jon-hotaisle · 2025-10-17T00:08:49Z

Would be good to either add the source to the docs or to the script...

jon-hotaisle · 2025-10-17T00:10:17Z

Also compare the work done here by @indianspeedster master...indianspeedster:nanochat:master

It runs on MI300x, but not on MI355x.

jon-hotaisle · 2025-10-17T00:15:36Z

jammm · 2025-10-17T04:05:04Z

it looks like some of the memory issue is that linux kernel doesn't automatically let you assign the full 128gb, I ran this, and it cleared the running out of memory issue, but there are other problems still.

This is strix halo specific.

https://www.reddit.com/r/LocalLLaMA/comments/1mib7l9/pytorch_on_rocm_v650rc_gfx1151_amd_strix_halo/

I helped build those wheels in the link. They’re quite old. Please use the ones from https://github.com/ROCm/TheRock/blob/main/RELEASES.md#torch-for-gfx1151 instead.

jammm · 2025-10-17T04:10:30Z

@jon-hotaisle can you try my branch ? https://github.com/jammm/nanochat/tree/rocm-support

The one you used from master...indianspeedster:nanochat:master has hard-coded export HSA_OVERRIDE_GFX_VERSION=9.4.2 which is why it’s not running on your mi355. My branch removes that line altogether from speedrun.sh

Make sure to do a fresh clone and also delete the ~/.cargo folder.

Also, can you confirm which rocm version you have installed ? If you’re on 7.0, there’s a good chance it won’t work because the changes here are using 6.3/6.4 (my branch uses 6.4)

it’ll need a minor tweak to pyproject.toml to use the nightly wheels that supports rocm 7.

jon-hotaisle · 2025-10-17T04:25:55Z

@jammm Thanks, yea, your branch doesn't work for me. I'm on MI355x which pretty much requires ROCm7 for everything.

There are a few things to fix:

auto source the env.
Make sure to install unzip.
It seems it is installing the nvidia python stuff, so I just hit ctrl-c.

I'm trying to run in a recent container:

docker.io/rocm/pytorch:rocm7.0.2_ubuntu24.04_py3.12_pytorch_release_2.8.0

Since I'm in a container, I don't need to remove the ~/.cargo as it isn't kept between runs.

This is how I start the container, keeping the ~/.cache is definitely helpful.

podman run -it --privileged --network=host --ipc=host -v $HOME:/workdir -v \
/shareddata:/shareddata -v /shared/apps:/shared/apps -v $HOME/.cache:/root/.cache \
--workdir /workdir \
docker.io/rocm/pytorch:rocm7.0.2_ubuntu24.04_py3.12_pytorch_release_2.8.0 bash

jammm · 2025-10-17T04:55:40Z

@jon-hotaisle I see. Can you try changing this line https://github.com/jammm/nanochat/blob/rocm-support/pyproject.toml#L37

url = "https://download.pytorch.org/whl/rocm6.4"

to
url = https://download.pytorch.org/whl/nightly/rocm7.0

Delete the .venv folder in nanochat and try speedrun.sh again?

jon-hotaisle · 2025-10-17T05:16:39Z

Ok, I screwed up, I wasn't actually on your branch when I cloned your repo. That was stupid. =)

Now it installs torch, but it should already have torch in my container.

But... best news... it looks like it is running now!

jon-hotaisle · 2025-10-17T05:37:24Z

Only one GPU is being used!?

jammm · 2025-10-17T05:45:47Z

Only one GPU is being used!?

That's because of the changes in speedrun.sh, which replaced the torchrun --nproc-per-node=8 with just python to make it compatible with single consumer GPU setups. https://github.com/karpathy/nanochat/pull/23/files#diff-f01ca501612c5f260e12ac1171d6705e6887825cba06360c99429531072ef130

You could undo those changes to get it working with all 8 GPUs. I did that on the mi300x. Ideally there should be a flag to toggle this.

jon-hotaisle · 2025-10-17T06:13:24Z

8x cooking now!

jammm · 2025-10-17T06:21:28Z

8x cooking now!

Awesome. Once that's done (takes a couple hours or so), you can source .venv/bin/activate then run python -m scripts.chat_web to run the web UI.

Qubitium · 2025-10-22T03:21:30Z

@jon-hotaisle Your vram usage is too low ~84GB out of 290GB usage. You need to tweak values as much as possible to so it reaches max vram without oom. Otherwise, you can't get close to max mfu. Easier is to switch batch from 32 (nanochat default) to 64 or 128. If 64 works but 128 oom but vram still not full, you can then try to increase per batch token count.

jon-hotaisle · 2025-10-22T03:23:26Z

@Qubitium helpful feedback, thank you! first goal was just to get it running, now we can focus on tuning. should this be something that is added to the PR? Have it set sane defaults on different architectures too?

svlandeg

Hi @LokiMetaSmith! It looks like a lot of the edits from this PR have already been addressed by #88. There's quite a bit of merge conflicts as well. I suggest to close this one and open new PR(s) with any remaining issues you'd encounter with the latest version of master.

PS: discussions about different setups & recommendations etc are nice, and it's good to see how this i being run on different platforms & setups. Please use the forum to continue such conversations!

svlandeg · 2025-11-14T21:28:04Z

speedrun.sh


 # pretrain the d20 model
-torchrun --standalone --nproc_per_node=8 -m scripts.base_train -- --depth=20 --run=$WANDB_RUN
+python -m scripts.base_train -- --depth=20 --run=$WANDB_RUN


I don't think we'll want this edit either way - the assumption being that this usually will be run on more than one GPU, and those that have only a single one can just edit this on their own fork.

google-labs-jules bot added 2 commits October 14, 2025 05:07

docs: Update README with computing environment details

f20d9d4

google-labs-jules bot and others added 7 commits October 14, 2025 05:47

Merge branch 'rocm-support' into fix/pytorch-memory-fragmentation

b3f662a

Merge pull request #1 from LokiMetaSmith/fix/pytorch-memory-fragmenta…

31db19a

…tion Set PYTORCH_CUDA_ALLOC_CONF to prevent memory fragmentation

jammm reviewed Oct 15, 2025

View reviewed changes

Merge branch 'master' into rocm-support

1bbba0f

jon-hotaisle mentioned this pull request Oct 22, 2025

Fix mfu statically keyed to h100 max tflops #147

Open

svlandeg added the refactor label Oct 29, 2025

svlandeg reviewed Nov 14, 2025

View reviewed changes

svlandeg closed this Nov 14, 2025

LokiMetaSmith deleted the rocm-support branch December 11, 2025 20:17

feat: Add ROCm and device-agnostic support #23

feat: Add ROCm and device-agnostic support #23

Uh oh!

Conversation

LokiMetaSmith commented Oct 14, 2025

Uh oh!

LokiMetaSmith commented Oct 14, 2025

Uh oh!

LokiMetaSmith commented Oct 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jon-hotaisle commented Oct 15, 2025

Uh oh!

jammm Oct 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jammm Oct 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jammm commented Oct 15, 2025

Uh oh!

jammm commented Oct 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jammm Oct 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

LokiMetaSmith commented Oct 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jon-hotaisle commented Oct 17, 2025

Uh oh!

jon-hotaisle commented Oct 17, 2025

Uh oh!

jon-hotaisle commented Oct 17, 2025

Uh oh!

jammm commented Oct 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jammm commented Oct 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jon-hotaisle commented Oct 17, 2025

Uh oh!

jammm commented Oct 17, 2025

Uh oh!

jon-hotaisle commented Oct 17, 2025

Uh oh!

jon-hotaisle commented Oct 17, 2025

Uh oh!

jammm commented Oct 17, 2025

Uh oh!

jon-hotaisle commented Oct 17, 2025

Uh oh!

jammm commented Oct 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Qubitium commented Oct 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jon-hotaisle commented Oct 22, 2025

Uh oh!

svlandeg left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

svlandeg Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

LokiMetaSmith commented Oct 14, 2025 •

edited

Loading

jammm Oct 15, 2025 •

edited

Loading

jammm Oct 15, 2025 •

edited

Loading

jammm commented Oct 15, 2025 •

edited

Loading

jammm Oct 15, 2025 •

edited

Loading

LokiMetaSmith commented Oct 16, 2025 •

edited

Loading

jammm commented Oct 17, 2025 •

edited

Loading

jammm commented Oct 17, 2025 •

edited

Loading

jammm commented Oct 17, 2025 •

edited

Loading

Qubitium commented Oct 22, 2025 •

edited

Loading

svlandeg left a comment •

edited

Loading