convert : parse safetensors directly by compilade · Pull Request #15667 · ggml-org/llama.cpp

compilade · 2025-08-29T16:46:47Z

Should fix #15623
(originally targeted #14810, but was rebased)

This replaces the approach from #8482 to avoid using get_slice because it turns out it eagerly memmaps tensors which means on Windows this uses a lot of memory, and on Linux this inflates the resident set size.

Safetensors files are now parsed directly, since the format is simple enough. This will also eventually allow tracking the file ranges of tensors to maybe use os.copy_file_range when possible to make conversion of COW filesystems very fast (in #15727).

On Linux, when using memray (a memory profiler), this change reduces the peak heap memory usage by quite a lot, and with GNU time, it also reduces the peak resident set size memory usage.

The previous behavior when observed with memray seems to be that safe_open puts all of the model into the heap (likely memmaped, though since the resident set size is smaller and grows). The new behavior when observed with memray is more similar to what I thought happened in the first place (bumps of memory usage at each processed tensor, but it goes back down between each).

Here's a table of the "Maximum resident set size (kbytes)" from time -v (when using GNU time) on a few models:

$ $(which time) -v python3 convert_hf_to_gguf.py /path/to/model_dir --outfile /path/to/model.gguf --outtype f16

Model	Target type	`master` (kbytes)	This PR (kbytes)
https://huggingface.co/mistralai/Mistral-7B-v0.1	F16	10 334 248	1 129 248
https://huggingface.co/meta-llama/Llama-3.2-1B	F16	3 023 112	2 104 256
https://huggingface.co/Qwen/Qwen3-Coder-30B-A3B-Instruct	F16	9 165 048	2 680 124

Safetensors are already directly parsed since #12820 for remote models. This is similar, but for local models.

TODO:

Handle byteswapping on big-endian platforms?
- The safetensors library automatically byteswaps when running on a big-endian platform (since the format is always little-endian), but GGUFWriter byteswaps unconditionnaly when the target endianness is big, so this never really worked anyway? (double-byteswapping in this case would produce little endian tensors...) Unless I'm misunderstanding something.

Make sure to read the contributing guidelines before submitting a PR

LostRuins · 2025-09-08T08:26:44Z

I can confirm that this helped me convert glm 4.5 air, whereas current main fails.

LostRuins · 2025-09-11T04:12:36Z

Is there anything preventing this PR from being merged/unset for draft?

It's impossible for me to convert GLM Air reliably without this PR so I think it's quite useful to have.

LostRuins · 2025-09-23T11:39:38Z

@whatever1983 this has nothing to do with fp8 conversion. This is simply a more memory efficient way of performing the GGUF convert that prevents OOMs/crashing during the conversion process, which I need in order to convert GLM Air.

As for politics I can't advise on that. I just want to successfully convert my models hence me bumping the issue.

Applies to both local and remote safetensors custom parsing. This matches the behavior of the official safetensors implementation. * convert : rename from_safetensors_meta to from_local_tensor For consistency with from_remote_tensor

fairydreaming · 2025-12-21T17:17:23Z

@compilade I have a question regarding this change: why do you brute-force alignment of the safetensors file byte-buffer with this code:

llama.cpp/gguf-py/gguf/utility.py

Lines 319 to 321 in fd05c51

    
           alignment = self.ALIGNMENT 
        
           if data_start_offset % alignment != 0: 
        
               data_start_offset += alignment - (data_start_offset % alignment)

I see another occurrence of this here (added later):

llama.cpp/gguf-py/gguf/utility.py

Lines 207 to 209 in fd05c51

    
           alignment = SafetensorRemote.ALIGNMENT 
        
           if data_start_offset % alignment != 0: 
        
               data_start_offset += alignment - (data_start_offset % alignment)

I'm asking because I can't find any information indicating that safetensors file format requires the tensor byte buffer to be aligned to 8 bytes.

Moreover, the current code causes DeepSeek models conversion failures that manifest like this:

$ python convert_hf_to_gguf.py /mnt/md0/huggingface/hub/models--deepseek-ai--DeepSeek-V3.1-Terminus/snapshots/19510d6dc61f79dbd925bd51ee8a9081c509a4b6/ --outfile /dev/null --outtype q8_0
...
INFO:gguf.gguf_writer:Writing the following files:
INFO:gguf.gguf_writer:/dev/null: n_tensors = 1086, total_size = 713.3G
Writing:   0%|                                                                                    | 0.00/713G [00:00<?, ?byte/s]/home/phm/projects/llama.cpp-ds32/gguf-py/gguf/quants.py:383: RuntimeWarning: invalid value encountered in divide
  d = abs(blocks).max(axis=1, keepdims=True) / 127
/home/phm/projects/llama.cpp-ds32/gguf-py/gguf/quants.py:386: RuntimeWarning: invalid value encountered in multiply
  qs = np_roundf(blocks * id)
/home/phm/projects/llama.cpp-ds32/gguf-py/gguf/quants.py:389: RuntimeWarning: overflow encountered in cast
  d = d.astype(np.float16).view(np.uint8)
/home/phm/projects/llama.cpp-ds32/gguf-py/gguf/quants.py:391: RuntimeWarning: invalid value encountered in cast
  qs = qs.astype(np.int8).view(np.uint8)

On the other hand I find it hard to believe that nobody noticed this yet, so just wanted to make sure.

CISC · 2025-12-21T17:38:16Z

I'm asking because I can't find any information indicating that safetensors file format requires the tensor byte buffer to be aligned to 8 bytes.

https://github.com/huggingface/safetensors/blob/987b7380f85db6f92214d9ed0f0d36ef0e71d5f6/safetensors/src/tensor.rs#L256-L258

CISC · 2025-12-21T17:45:47Z

INFO:gguf.gguf_writer:Writing the following files:
INFO:gguf.gguf_writer:/dev/null: n_tensors = 1086, total_size = 713.3G
Writing:   0%|                                                                                    | 0.00/713G [00:00<?, ?byte/s]/home/phm/projects/llama.cpp-ds32/gguf-py/gguf/quants.py:383: RuntimeWarning: invalid value encountered in divide
  d = abs(blocks).max(axis=1, keepdims=True) / 127
/home/phm/projects/llama.cpp-ds32/gguf-py/gguf/quants.py:386: RuntimeWarning: invalid value encountered in multiply
  qs = np_roundf(blocks * id)
/home/phm/projects/llama.cpp-ds32/gguf-py/gguf/quants.py:389: RuntimeWarning: overflow encountered in cast
  d = d.astype(np.float16).view(np.uint8)
/home/phm/projects/llama.cpp-ds32/gguf-py/gguf/quants.py:391: RuntimeWarning: invalid value encountered in cast
  qs = qs.astype(np.int8).view(np.uint8)

Either something went wrong in FP8 dequant or the download is corrupt?

fairydreaming · 2025-12-21T17:50:10Z

I'm asking because I can't find any information indicating that safetensors file format requires the tensor byte buffer to be aligned to 8 bytes.

https://github.com/huggingface/safetensors/blob/987b7380f85db6f92214d9ed0f0d36ef0e71d5f6/safetensors/src/tensor.rs#L256-L258

@CISC this is a serialization code, I agree that it's a good idea to pad safetensors header metadata to 8 bytes to make byte buffer aligned when you CREATE safetensor files. But I can't find any info that it's a file format requirement. On the contrary, there are reports of safetensors being inefficient due to possible alignment issues, for example here: https://arxiv.org/html/2505.23072v1

The safetensors format does not define data alignment and can cause alignment errors due to constraints of both GDS and CUDA kernels.

fairydreaming · 2025-12-21T18:09:16Z

@CISC Also found this discussion where one of the HF engineers mentions that alignment is not required: huggingface/safetensors#254

Because alignment is not required. Alignment speeds things up that's it.
Files already exist without alignment being forced.
Also if alignement needs to be done differently based on different machines it's going to become tricky.

It's kind of old, but perhaps still relevant.

Also: https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B/discussions/18

Looks like DeepSeek has a history of creating unaligned safetensor files.

CISC · 2025-12-21T18:09:16Z

@CISC this is a serialization code, I agree that it's a good idea to pad safetensors header metadata to 8 bytes to make byte buffer aligned when you CREATE safetensor files. But I can't find any info that it's a file format requirement.

May not be a requirement, but as long as that code has been there (do not know if it always was), 64-bit alignment was guaranteed (as long as you use this library of course). Though it does store the aligned length, so it doesn't make sense to do the extra alignment.

CISC · 2025-12-21T18:21:08Z

@fairydreaming If you could make a PR it would be great, seeing as you're the only one testing DeepSeek models apparently. :)

fairydreaming · 2025-12-21T18:29:54Z

I found when header alignment was introduced in safetensors:

commit 0c5b3a6b9c9be653b91ce8355225b8cad074f8b0
Author: Nicolas Patry <patry.nicolas@protonmail.com>
Date:   Tue Feb 21 15:23:56 2023 +0100

    [Major Change] Enforcing tensor alignment (#148)
    
    * [Major Change] Enforcing tensor alignment
    
    - Now the header will automatically align itself to 8 bytes (f64) with
      appending extra spaces as necessary.
    - This will allow extra fast memory mapping by reinterpreting bytes as
      f32/f64 etc.. Unaligned bytes do not allow for this. https://www.reddit.com/r/rust/comments/tanaxm/mutating_a_buffer_of_u8
s_as_f32s_in_place/
    - This does not change contiguousness of tensors
    - This does not change the actual spec (we're just putting extra valid bytes
      in the header and using a different serialization ordering)
    - Readers should still be able to read old files, they would just need
      to be copied before being cast as their final destination when using
      mmap
    - This has no effect for GPU since copy is already necessary (*I think*,
      depends on the cuda API actually if it allows filling f32 addresses
      from raw unaligned bytes).
    
    This change will only be interesting if things like https://github.com/Narsil/fast_gpt2
    actually pick up. And even with the copy, load times are still vastly
    superior to `pytorch`.
    
    We need to be able to read old files.

So it looks like unaligned files are valid. Readers have to copy the data to fix alignment issues (if needed).

* convert : parse safetensors directly * gguf-py : order safetensors tensors by name Applies to both local and remote safetensors custom parsing. This matches the behavior of the official safetensors implementation. * convert : rename from_safetensors_meta to from_local_tensor For consistency with from_remote_tensor * convert : fix no-lazy dtypes from direct safetensors

github-actions bot added the python python script changes label Aug 29, 2025

This was referenced Aug 29, 2025

Misc. bug: convert_hf_to_gguf.py runs out of memory #15623

Closed

gguf-py: reduce peak RAM during convert by streaming dtype casts #15648

Open

compilade force-pushed the compilade/convert-safetensors-parse branch from 85edafe to 786b32d Compare September 1, 2025 14:18

compilade mentioned this pull request Sep 2, 2025

convert : use reflinks for faster conversion #15727

Draft

14 tasks

compilade force-pushed the compilade/convert-safetensors-parse branch from 786b32d to e582f1a Compare September 9, 2025 18:49

This comment was marked as off-topic.

Sign in to view

compilade mentioned this pull request Oct 23, 2025

convert : handle pre-quantized models #14810

Merged

2 tasks

compilade added 3 commits November 6, 2025 22:33

convert : parse safetensors directly

c4b630f

gguf-py : order safetensors tensors by name

e7b7ed8

Applies to both local and remote safetensors custom parsing. This matches the behavior of the official safetensors implementation. * convert : rename from_safetensors_meta to from_local_tensor For consistency with from_remote_tensor

convert : fix no-lazy dtypes from direct safetensors

e996f3a

compilade force-pushed the compilade/convert-safetensors-parse branch from e582f1a to e996f3a Compare November 7, 2025 03:39

compilade changed the base branch from compilade/convert-prequant to master November 7, 2025 03:39

DajanaV mentioned this pull request Nov 7, 2025

UPSTREAM PR #15667: convert : parse safetensors directly auroralabs-loci/llama.cpp#111

Open

1 task

compilade marked this pull request as ready for review November 7, 2025 04:11

compilade requested a review from CISC as a code owner November 7, 2025 04:11

CISC approved these changes Nov 7, 2025

View reviewed changes

compilade mentioned this pull request Nov 9, 2025

Refactor: convert_hf_to_gguf.py #17114

Draft

compilade merged commit 802cef4 into master Nov 9, 2025
5 checks passed

compilade mentioned this pull request Nov 21, 2025

Fix convert_hf_to_gguf.py script on s390x #17431

Merged

compilade mentioned this pull request Dec 22, 2025

gguf-py : do not align the data start offset #18291

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

convert : parse safetensors directly#15667

convert : parse safetensors directly#15667
compilade merged 3 commits intomasterfrom
compilade/convert-safetensors-parse

compilade commented Aug 29, 2025 •

edited

Loading

Uh oh!

LostRuins commented Sep 8, 2025

Uh oh!

LostRuins commented Sep 11, 2025

Uh oh!

This comment was marked as off-topic.

LostRuins commented Sep 23, 2025

Uh oh!

Uh oh!

fairydreaming commented Dec 21, 2025

Uh oh!

CISC commented Dec 21, 2025

Uh oh!

CISC commented Dec 21, 2025

Uh oh!

fairydreaming commented Dec 21, 2025

Uh oh!

fairydreaming commented Dec 21, 2025

Uh oh!

CISC commented Dec 21, 2025

Uh oh!

CISC commented Dec 21, 2025

Uh oh!

fairydreaming commented Dec 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

compilade commented Aug 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LostRuins commented Sep 8, 2025

Uh oh!

LostRuins commented Sep 11, 2025

Uh oh!

This comment was marked as off-topic.

LostRuins commented Sep 23, 2025

Uh oh!

Uh oh!

fairydreaming commented Dec 21, 2025

Uh oh!

CISC commented Dec 21, 2025

Uh oh!

CISC commented Dec 21, 2025

Uh oh!

fairydreaming commented Dec 21, 2025

Uh oh!

fairydreaming commented Dec 21, 2025

Uh oh!

CISC commented Dec 21, 2025

Uh oh!

CISC commented Dec 21, 2025

Uh oh!

fairydreaming commented Dec 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

compilade commented Aug 29, 2025 •

edited

Loading