Disable mmap on Strix Halo to avoid OOM

### System Info

- `transformers` version: 5.3.0
- Platform: Linux-6.19.0-9-generic-x86_64-with-glibc2.42
- Python version: 3.13.12
- Huggingface_hub version: 1.7.1
- Safetensors version: 0.7.0
- Accelerate version: 1.13.0
- Accelerate config: not found
- DeepSpeed version: not installed
- PyTorch version (accelerator?): 2.11.0a0+rocm7.11.0a20260106 (CUDA)
- Using distributed or parallel set-up in script?: No
- Using GPU in script?: Yes
- GPU type: Radeon 8060S Graphics

Strix Halo APU has 128 GB unified memory. I've configured 125 GB TTM memory and 1 GB UMA buffer.

### Who can help?

_No response_

### Information

- [ ] The official example scripts
- [x] My own modified scripts

### Tasks

- [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [x] My own task or dataset (give details below)

### Reproduction

When I run
```python3
model = AutoModel.from_pretrained("Qwen/Qwen3.5-35B-A3B")
```
it errors like
```
  ...
  File "/home/wd/venv_torch/lib/python3.13/site-packages/transformers/models/auto/auto_factory.py", line 374, in from_pretrained
    return model_class.from_pretrained(
           ~~~~~~~~~~~~~~~~~~~~~~~~~~~^
        pretrained_model_name_or_path, *model_args, config=config, **hub_kwargs, **kwargs
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    )
    ^
  File "/home/wd/venv_torch/lib/python3.13/site-packages/transformers/modeling_utils.py", line 4137, in from_pretrained
    loading_info, disk_offload_index = cls._load_pretrained_model(model, state_dict, checkpoint_files, load_config)
                                       ~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/wd/venv_torch/lib/python3.13/site-packages/transformers/modeling_utils.py", line 4256, in _load_pretrained_model
    loading_info, disk_offload_index = convert_and_load_state_dict_in_model(
                                       ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^
        model=model,
        ^^^^^^^^^^^^
    ...<3 lines>...
        disk_offload_index=disk_offload_index,
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    )
    ^
  File "/home/wd/venv_torch/lib/python3.13/site-packages/transformers/core_model_loading.py", line 1212, in convert_and_load_state_dict_in_model
    realized_value = mapping.convert(
        first_param_name,
    ...<3 lines>...
        loading_info=loading_info,
    )
  File "/home/wd/venv_torch/lib/python3.13/site-packages/transformers/core_model_loading.py", line 678, in convert
    collected_tensors = self.materialize_tensors()
  File "/home/wd/venv_torch/lib/python3.13/site-packages/transformers/core_model_loading.py", line 657, in materialize_tensors
    tensors = [func() for func in tensors]
               ~~~~^^
  File "/home/wd/venv_torch/lib/python3.13/site-packages/transformers/core_model_loading.py", line 800, in _job
    return _materialize_copy(tensor, device, dtype)
  File "/home/wd/venv_torch/lib/python3.13/site-packages/transformers/core_model_loading.py", line 789, in _materialize_copy
    tensor = tensor.to(device=device, dtype=dtype)
torch.OutOfMemoryError: HIP out of memory. Tried to allocate 2.00 MiB. GPU 0 has a total capacity of 125.00 GiB of which 24.00 KiB is free. Of the allocated memory 58.06 GiB is allocated by PyTorch, and 4.94 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
```

### Expected behavior

The Qwen3.5 35B fp16 model takes 67 GB memory. With 125 GB TTM memory it should not OOM.

From what I know, safetensors uses mmap, and currently mmap does not work well on Strix Halo. A proper fix would be in the amdgpu driver. For now I use the following trick to work around this issue:
```python3
def patch_transformers_disable_mmap():
    import transformers.core_model_loading as _cml

    def _materialize_copy_no_mmap(t, device=None, dtype=None):
        # The indexing materializes the tensor from safetensors.
        t = t[...]

        # If safetensors returned an mmapped tensor on CPU,
        # force a copy on CPU before any device/dtype conversion.
        if isinstance(t, torch.Tensor) and t.device.type == "cpu":
            t = t.to(device="cpu", copy=True)

        if dtype is not None or device is not None:
            t = t.to(device=device, dtype=dtype)
        return t

    assert hasattr(_cml, "_materialize_copy")
    _cml._materialize_copy = _materialize_copy_no_mmap
    print("Patched transformers to disable mmap.")

def main():
    patch_transformers_disable_mmap()
    model = AutoModel.from_pretrained("Qwen/Qwen3.5-35B-A3B")
    ...
```
It's inspired by how ComfyUI disables mmap:
https://github.com/Comfy-Org/ComfyUI/blob/593be209a45a8a306c26de550e240a363de405a7/comfy/utils.py#L137
and it's known that currently disabling mmap improves the model loading speed in ComfyUI on Strix Halo.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Disable mmap on Strix Halo to avoid OOM #44756

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Disable mmap on Strix Halo to avoid OOM #44756

Description

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions