Skip to content

Disable mmap on Strix Halo to avoid OOM #44756

@woct0rdho

Description

@woct0rdho

System Info

  • transformers version: 5.3.0
  • Platform: Linux-6.19.0-9-generic-x86_64-with-glibc2.42
  • Python version: 3.13.12
  • Huggingface_hub version: 1.7.1
  • Safetensors version: 0.7.0
  • Accelerate version: 1.13.0
  • Accelerate config: not found
  • DeepSpeed version: not installed
  • PyTorch version (accelerator?): 2.11.0a0+rocm7.11.0a20260106 (CUDA)
  • Using distributed or parallel set-up in script?: No
  • Using GPU in script?: Yes
  • GPU type: Radeon 8060S Graphics

Strix Halo APU has 128 GB unified memory. I've configured 125 GB TTM memory and 1 GB UMA buffer.

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

When I run

model = AutoModel.from_pretrained("Qwen/Qwen3.5-35B-A3B")

it errors like

  ...
  File "/home/wd/venv_torch/lib/python3.13/site-packages/transformers/models/auto/auto_factory.py", line 374, in from_pretrained
    return model_class.from_pretrained(
           ~~~~~~~~~~~~~~~~~~~~~~~~~~~^
        pretrained_model_name_or_path, *model_args, config=config, **hub_kwargs, **kwargs
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    )
    ^
  File "/home/wd/venv_torch/lib/python3.13/site-packages/transformers/modeling_utils.py", line 4137, in from_pretrained
    loading_info, disk_offload_index = cls._load_pretrained_model(model, state_dict, checkpoint_files, load_config)
                                       ~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/wd/venv_torch/lib/python3.13/site-packages/transformers/modeling_utils.py", line 4256, in _load_pretrained_model
    loading_info, disk_offload_index = convert_and_load_state_dict_in_model(
                                       ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^
        model=model,
        ^^^^^^^^^^^^
    ...<3 lines>...
        disk_offload_index=disk_offload_index,
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    )
    ^
  File "/home/wd/venv_torch/lib/python3.13/site-packages/transformers/core_model_loading.py", line 1212, in convert_and_load_state_dict_in_model
    realized_value = mapping.convert(
        first_param_name,
    ...<3 lines>...
        loading_info=loading_info,
    )
  File "/home/wd/venv_torch/lib/python3.13/site-packages/transformers/core_model_loading.py", line 678, in convert
    collected_tensors = self.materialize_tensors()
  File "/home/wd/venv_torch/lib/python3.13/site-packages/transformers/core_model_loading.py", line 657, in materialize_tensors
    tensors = [func() for func in tensors]
               ~~~~^^
  File "/home/wd/venv_torch/lib/python3.13/site-packages/transformers/core_model_loading.py", line 800, in _job
    return _materialize_copy(tensor, device, dtype)
  File "/home/wd/venv_torch/lib/python3.13/site-packages/transformers/core_model_loading.py", line 789, in _materialize_copy
    tensor = tensor.to(device=device, dtype=dtype)
torch.OutOfMemoryError: HIP out of memory. Tried to allocate 2.00 MiB. GPU 0 has a total capacity of 125.00 GiB of which 24.00 KiB is free. Of the allocated memory 58.06 GiB is allocated by PyTorch, and 4.94 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

Expected behavior

The Qwen3.5 35B fp16 model takes 67 GB memory. With 125 GB TTM memory it should not OOM.

From what I know, safetensors uses mmap, and currently mmap does not work well on Strix Halo. A proper fix would be in the amdgpu driver. For now I use the following trick to work around this issue:

def patch_transformers_disable_mmap():
    import transformers.core_model_loading as _cml

    def _materialize_copy_no_mmap(t, device=None, dtype=None):
        # The indexing materializes the tensor from safetensors.
        t = t[...]

        # If safetensors returned an mmapped tensor on CPU,
        # force a copy on CPU before any device/dtype conversion.
        if isinstance(t, torch.Tensor) and t.device.type == "cpu":
            t = t.to(device="cpu", copy=True)

        if dtype is not None or device is not None:
            t = t.to(device=device, dtype=dtype)
        return t

    assert hasattr(_cml, "_materialize_copy")
    _cml._materialize_copy = _materialize_copy_no_mmap
    print("Patched transformers to disable mmap.")

def main():
    patch_transformers_disable_mmap()
    model = AutoModel.from_pretrained("Qwen/Qwen3.5-35B-A3B")
    ...

It's inspired by how ComfyUI disables mmap:
https://github.com/Comfy-Org/ComfyUI/blob/593be209a45a8a306c26de550e240a363de405a7/comfy/utils.py#L137
and it's known that currently disabling mmap improves the model loading speed in ComfyUI on Strix Halo.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions