Skip to content

convert : parse safetensors directly#15667

Merged
compilade merged 3 commits intomasterfrom
compilade/convert-safetensors-parse
Nov 9, 2025
Merged

convert : parse safetensors directly#15667
compilade merged 3 commits intomasterfrom
compilade/convert-safetensors-parse

Conversation

@compilade
Copy link
Collaborator

@compilade compilade commented Aug 29, 2025

Should fix #15623
(originally targeted #14810, but was rebased)

This replaces the approach from #8482 to avoid using get_slice because it turns out it eagerly memmaps tensors which means on Windows this uses a lot of memory, and on Linux this inflates the resident set size.

Safetensors files are now parsed directly, since the format is simple enough. This will also eventually allow tracking the file ranges of tensors to maybe use os.copy_file_range when possible to make conversion of COW filesystems very fast (in #15727).

On Linux, when using memray (a memory profiler), this change reduces the peak heap memory usage by quite a lot, and with GNU time, it also reduces the peak resident set size memory usage.

The previous behavior when observed with memray seems to be that safe_open puts all of the model into the heap (likely memmaped, though since the resident set size is smaller and grows). The new behavior when observed with memray is more similar to what I thought happened in the first place (bumps of memory usage at each processed tensor, but it goes back down between each).

Here's a table of the "Maximum resident set size (kbytes)" from time -v (when using GNU time) on a few models:

$ $(which time) -v python3 convert_hf_to_gguf.py /path/to/model_dir --outfile /path/to/model.gguf --outtype f16
Model Target type master (kbytes) This PR (kbytes)
https://huggingface.co/mistralai/Mistral-7B-v0.1 F16 10 334 248 1 129 248
https://huggingface.co/meta-llama/Llama-3.2-1B F16 3 023 112 2 104 256
https://huggingface.co/Qwen/Qwen3-Coder-30B-A3B-Instruct F16 9 165 048 2 680 124

Safetensors are already directly parsed since #12820 for remote models. This is similar, but for local models.


TODO:

  • Handle byteswapping on big-endian platforms?
    • The safetensors library automatically byteswaps when running on a big-endian platform (since the format is always little-endian), but GGUFWriter byteswaps unconditionnaly when the target endianness is big, so this never really worked anyway? (double-byteswapping in this case would produce little endian tensors...) Unless I'm misunderstanding something.

Make sure to read the contributing guidelines before submitting a PR

@LostRuins
Copy link
Collaborator

I can confirm that this helped me convert glm 4.5 air, whereas current main fails.

@compilade compilade force-pushed the compilade/convert-safetensors-parse branch from 786b32d to e582f1a Compare September 9, 2025 18:49
@LostRuins
Copy link
Collaborator

Is there anything preventing this PR from being merged/unset for draft?

It's impossible for me to convert GLM Air reliably without this PR so I think it's quite useful to have.

@whatever1983

This comment was marked as off-topic.

@LostRuins
Copy link
Collaborator

@whatever1983 this has nothing to do with fp8 conversion. This is simply a more memory efficient way of performing the GGUF convert that prevents OOMs/crashing during the conversion process, which I need in order to convert GLM Air.

As for politics I can't advise on that. I just want to successfully convert my models hence me bumping the issue.

Applies to both local and remote safetensors custom parsing.
This matches the behavior of the official safetensors implementation.

* convert : rename from_safetensors_meta to from_local_tensor

For consistency with from_remote_tensor
@compilade compilade force-pushed the compilade/convert-safetensors-parse branch from e582f1a to e996f3a Compare November 7, 2025 03:39
@compilade compilade changed the base branch from compilade/convert-prequant to master November 7, 2025 03:39
@compilade compilade marked this pull request as ready for review November 7, 2025 04:11
@compilade compilade requested a review from CISC as a code owner November 7, 2025 04:11
@compilade compilade merged commit 802cef4 into master Nov 9, 2025
5 checks passed
@fairydreaming
Copy link
Collaborator

@compilade I have a question regarding this change: why do you brute-force alignment of the safetensors file byte-buffer with this code:

alignment = self.ALIGNMENT
if data_start_offset % alignment != 0:
data_start_offset += alignment - (data_start_offset % alignment)

I see another occurrence of this here (added later):

alignment = SafetensorRemote.ALIGNMENT
if data_start_offset % alignment != 0:
data_start_offset += alignment - (data_start_offset % alignment)

I'm asking because I can't find any information indicating that safetensors file format requires the tensor byte buffer to be aligned to 8 bytes.

Moreover, the current code causes DeepSeek models conversion failures that manifest like this:

$ python convert_hf_to_gguf.py /mnt/md0/huggingface/hub/models--deepseek-ai--DeepSeek-V3.1-Terminus/snapshots/19510d6dc61f79dbd925bd51ee8a9081c509a4b6/ --outfile /dev/null --outtype q8_0
...
INFO:gguf.gguf_writer:Writing the following files:
INFO:gguf.gguf_writer:/dev/null: n_tensors = 1086, total_size = 713.3G
Writing:   0%|                                                                                    | 0.00/713G [00:00<?, ?byte/s]/home/phm/projects/llama.cpp-ds32/gguf-py/gguf/quants.py:383: RuntimeWarning: invalid value encountered in divide
  d = abs(blocks).max(axis=1, keepdims=True) / 127
/home/phm/projects/llama.cpp-ds32/gguf-py/gguf/quants.py:386: RuntimeWarning: invalid value encountered in multiply
  qs = np_roundf(blocks * id)
/home/phm/projects/llama.cpp-ds32/gguf-py/gguf/quants.py:389: RuntimeWarning: overflow encountered in cast
  d = d.astype(np.float16).view(np.uint8)
/home/phm/projects/llama.cpp-ds32/gguf-py/gguf/quants.py:391: RuntimeWarning: invalid value encountered in cast
  qs = qs.astype(np.int8).view(np.uint8)

On the other hand I find it hard to believe that nobody noticed this yet, so just wanted to make sure.

@CISC
Copy link
Collaborator

CISC commented Dec 21, 2025

I'm asking because I can't find any information indicating that safetensors file format requires the tensor byte buffer to be aligned to 8 bytes.

https://github.com/huggingface/safetensors/blob/987b7380f85db6f92214d9ed0f0d36ef0e71d5f6/safetensors/src/tensor.rs#L256-L258

@CISC
Copy link
Collaborator

CISC commented Dec 21, 2025

INFO:gguf.gguf_writer:Writing the following files:
INFO:gguf.gguf_writer:/dev/null: n_tensors = 1086, total_size = 713.3G
Writing:   0%|                                                                                    | 0.00/713G [00:00<?, ?byte/s]/home/phm/projects/llama.cpp-ds32/gguf-py/gguf/quants.py:383: RuntimeWarning: invalid value encountered in divide
  d = abs(blocks).max(axis=1, keepdims=True) / 127
/home/phm/projects/llama.cpp-ds32/gguf-py/gguf/quants.py:386: RuntimeWarning: invalid value encountered in multiply
  qs = np_roundf(blocks * id)
/home/phm/projects/llama.cpp-ds32/gguf-py/gguf/quants.py:389: RuntimeWarning: overflow encountered in cast
  d = d.astype(np.float16).view(np.uint8)
/home/phm/projects/llama.cpp-ds32/gguf-py/gguf/quants.py:391: RuntimeWarning: invalid value encountered in cast
  qs = qs.astype(np.int8).view(np.uint8)

Either something went wrong in FP8 dequant or the download is corrupt?

@fairydreaming
Copy link
Collaborator

I'm asking because I can't find any information indicating that safetensors file format requires the tensor byte buffer to be aligned to 8 bytes.

https://github.com/huggingface/safetensors/blob/987b7380f85db6f92214d9ed0f0d36ef0e71d5f6/safetensors/src/tensor.rs#L256-L258

@CISC this is a serialization code, I agree that it's a good idea to pad safetensors header metadata to 8 bytes to make byte buffer aligned when you CREATE safetensor files. But I can't find any info that it's a file format requirement. On the contrary, there are reports of safetensors being inefficient due to possible alignment issues, for example here: https://arxiv.org/html/2505.23072v1

The safetensors format does not define data alignment and can cause alignment errors due to constraints of both GDS and CUDA kernels.

@fairydreaming
Copy link
Collaborator

@CISC Also found this discussion where one of the HF engineers mentions that alignment is not required: huggingface/safetensors#254

Because alignment is not required. Alignment speeds things up that's it.
Files already exist without alignment being forced.
Also if alignement needs to be done differently based on different machines it's going to become tricky.

It's kind of old, but perhaps still relevant.

Also: https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B/discussions/18

Looks like DeepSeek has a history of creating unaligned safetensor files.

@CISC
Copy link
Collaborator

CISC commented Dec 21, 2025

@CISC this is a serialization code, I agree that it's a good idea to pad safetensors header metadata to 8 bytes to make byte buffer aligned when you CREATE safetensor files. But I can't find any info that it's a file format requirement.

May not be a requirement, but as long as that code has been there (do not know if it always was), 64-bit alignment was guaranteed (as long as you use this library of course). Though it does store the aligned length, so it doesn't make sense to do the extra alignment.

@CISC
Copy link
Collaborator

CISC commented Dec 21, 2025

@fairydreaming If you could make a PR it would be great, seeing as you're the only one testing DeepSeek models apparently. :)

@fairydreaming
Copy link
Collaborator

I found when header alignment was introduced in safetensors:

commit 0c5b3a6b9c9be653b91ce8355225b8cad074f8b0
Author: Nicolas Patry <patry.nicolas@protonmail.com>
Date:   Tue Feb 21 15:23:56 2023 +0100

    [Major Change] Enforcing tensor alignment (#148)
    
    * [Major Change] Enforcing tensor alignment
    
    - Now the header will automatically align itself to 8 bytes (f64) with
      appending extra spaces as necessary.
    - This will allow extra fast memory mapping by reinterpreting bytes as
      f32/f64 etc.. Unaligned bytes do not allow for this. https://www.reddit.com/r/rust/comments/tanaxm/mutating_a_buffer_of_u8
s_as_f32s_in_place/
    - This does not change contiguousness of tensors
    - This does not change the actual spec (we're just putting extra valid bytes
      in the header and using a different serialization ordering)
    - Readers should still be able to read old files, they would just need
      to be copied before being cast as their final destination when using
      mmap
    - This has no effect for GPU since copy is already necessary (*I think*,
      depends on the cuda API actually if it allows filling f32 addresses
      from raw unaligned bytes).
    
    This change will only be interesting if things like https://github.com/Narsil/fast_gpt2
    actually pick up. And even with the copy, load times are still vastly
    superior to `pytorch`.
    
    We need to be able to read old files.

So it looks like unaligned files are valid. Readers have to copy the data to fix alignment issues (if needed).

Anico2 added a commit to Anico2/llama.cpp that referenced this pull request Jan 15, 2026
* convert : parse safetensors directly

* gguf-py : order safetensors tensors by name

Applies to both local and remote safetensors custom parsing.
This matches the behavior of the official safetensors implementation.

* convert : rename from_safetensors_meta to from_local_tensor

For consistency with from_remote_tensor

* convert : fix no-lazy dtypes from direct safetensors
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

python python script changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Misc. bug: convert_hf_to_gguf.py runs out of memory

5 participants