convert : parse safetensors directly#15667
Conversation
85edafe to
786b32d
Compare
|
I can confirm that this helped me convert glm 4.5 air, whereas current |
786b32d to
e582f1a
Compare
|
Is there anything preventing this PR from being merged/unset for draft? It's impossible for me to convert GLM Air reliably without this PR so I think it's quite useful to have. |
This comment was marked as off-topic.
This comment was marked as off-topic.
|
@whatever1983 this has nothing to do with fp8 conversion. This is simply a more memory efficient way of performing the GGUF convert that prevents OOMs/crashing during the conversion process, which I need in order to convert GLM Air. As for politics I can't advise on that. I just want to successfully convert my models hence me bumping the issue. |
Applies to both local and remote safetensors custom parsing. This matches the behavior of the official safetensors implementation. * convert : rename from_safetensors_meta to from_local_tensor For consistency with from_remote_tensor
e582f1a to
e996f3a
Compare
|
@compilade I have a question regarding this change: why do you brute-force alignment of the safetensors file byte-buffer with this code: llama.cpp/gguf-py/gguf/utility.py Lines 319 to 321 in fd05c51 I see another occurrence of this here (added later): llama.cpp/gguf-py/gguf/utility.py Lines 207 to 209 in fd05c51 I'm asking because I can't find any information indicating that safetensors file format requires the tensor byte buffer to be aligned to 8 bytes. Moreover, the current code causes DeepSeek models conversion failures that manifest like this: On the other hand I find it hard to believe that nobody noticed this yet, so just wanted to make sure. |
|
Either something went wrong in FP8 dequant or the download is corrupt? |
@CISC this is a serialization code, I agree that it's a good idea to pad safetensors header metadata to 8 bytes to make byte buffer aligned when you CREATE safetensor files. But I can't find any info that it's a file format requirement. On the contrary, there are reports of safetensors being inefficient due to possible alignment issues, for example here: https://arxiv.org/html/2505.23072v1
|
|
@CISC Also found this discussion where one of the HF engineers mentions that alignment is not required: huggingface/safetensors#254
It's kind of old, but perhaps still relevant. Also: https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B/discussions/18 Looks like DeepSeek has a history of creating unaligned safetensor files. |
May not be a requirement, but as long as that code has been there (do not know if it always was), 64-bit alignment was guaranteed (as long as you use this library of course). Though it does store the aligned length, so it doesn't make sense to do the extra alignment. |
|
@fairydreaming If you could make a PR it would be great, seeing as you're the only one testing DeepSeek models apparently. :) |
|
I found when header alignment was introduced in safetensors: So it looks like unaligned files are valid. Readers have to copy the data to fix alignment issues (if needed). |
* convert : parse safetensors directly * gguf-py : order safetensors tensors by name Applies to both local and remote safetensors custom parsing. This matches the behavior of the official safetensors implementation. * convert : rename from_safetensors_meta to from_local_tensor For consistency with from_remote_tensor * convert : fix no-lazy dtypes from direct safetensors
Should fix #15623
(originally targeted #14810, but was rebased)
This replaces the approach from #8482 to avoid using
get_slicebecause it turns out it eagerly memmaps tensors which means on Windows this uses a lot of memory, and on Linux this inflates the resident set size.Safetensors files are now parsed directly, since the format is simple enough. This will also eventually allow tracking the file ranges of tensors to maybe use
os.copy_file_rangewhen possible to make conversion of COW filesystems very fast (in #15727).On Linux, when using
memray(a memory profiler), this change reduces the peak heap memory usage by quite a lot, and with GNUtime, it also reduces the peak resident set size memory usage.The previous behavior when observed with
memrayseems to be thatsafe_openputs all of the model into the heap (likely memmaped, though since the resident set size is smaller and grows). The new behavior when observed withmemrayis more similar to what I thought happened in the first place (bumps of memory usage at each processed tensor, but it goes back down between each).Here's a table of the "Maximum resident set size (kbytes)" from
time -v(when using GNUtime) on a few models:$ $(which time) -v python3 convert_hf_to_gguf.py /path/to/model_dir --outfile /path/to/model.gguf --outtype f16master(kbytes)Safetensors are already directly parsed since #12820 for remote models. This is similar, but for local models.
TODO:
safetensorslibrary automatically byteswaps when running on a big-endian platform (since the format is always little-endian), butGGUFWriterbyteswaps unconditionnaly when the target endianness is big, so this never really worked anyway? (double-byteswapping in this case would produce little endian tensors...) Unless I'm misunderstanding something.Make sure to read the contributing guidelines before submitting a PR