model: support GLM4V vision encoder#18042
Conversation
|
@tarruda just tested, it should work with the latest commit (feel free to give it a try) |
Will do. Did you publish any GGUF weights? |
| case LLM_ARCH_GLM4: | ||
| return model->hparams.use_mrope() ? LLAMA_ROPE_TYPE_MROPE : LLAMA_ROPE_TYPE_NORM; | ||
| case LLM_ARCH_GLM4_MOE: | ||
| return model->hparams.use_mrope() ? LLAMA_ROPE_TYPE_MROPE : LLAMA_ROPE_TYPE_NEOX; |
There was a problem hiding this comment.
Because the 2 models (vision and non-vision) are mostly the same, except for the rope mode, so I was quite lazy not to duplicate it into a new arch (which adds involves quite a lot of copy-paste code)
I hope that we can somewhat allow de-duplicating some code via #18051
In the meantime, lmk if you're OK with keeping this hack, or a new arch is still preferable @ggerganov @CISC
No because there is a chance we will change the arch name |
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Sorry for the interruption. I have quantized a Q4 model, but the current PR does not yet support the vision module. Edited: 2025-12-16 10:37 |
Yes it does? |
|
I tried https://huggingface.co/ggml-org/GLM-4.6V-GGUF with the annotate test page and it seems to work well. While it is not as good as Qwen3-VL 32B, it seems to have good bounding box capabilities:
Note that I used Q8_0 mmproj, maybe with F16 or F32 it will improve? |
|
The bf16 or fp16 mmproj should definitely be used whenever possible, mmproj degrades much faster than regular llm weights, and and 2GB is nothing when the regular model is already 70GB. |
* convert ok * no deepstack * less new tensors * cgraph ok * add mrope for text model * faster patch merger * add GGML_ROPE_TYPE_MRNORM * add support for metal * move glm4v do dedicated graph * convert: add norm_embd * clip: add debugging fn * working correctly * fix style * use bicubic * fix mrope metal * improve cpu * convert to neox ordering on conversion * revert backend changes * force stop if using old weight * support moe variant * fix conversion * fix convert (2) * Update tools/mtmd/clip-graph.h Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * process mrope_section on TextModel base class * resolve conflict merge --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* convert ok * no deepstack * less new tensors * cgraph ok * add mrope for text model * faster patch merger * add GGML_ROPE_TYPE_MRNORM * add support for metal * move glm4v do dedicated graph * convert: add norm_embd * clip: add debugging fn * working correctly * fix style * use bicubic * fix mrope metal * improve cpu * convert to neox ordering on conversion * revert backend changes * force stop if using old weight * support moe variant * fix conversion * fix convert (2) * Update tools/mtmd/clip-graph.h Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * process mrope_section on TextModel base class * resolve conflict merge --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>



On first look, it seems to be an easy model to support as the HF implementation is pretty much the same as Qwen2.5VL
However, there are some very subtle differences that even some LLMs will miss (I tried both Grok and Gemini 3 and they both missed the 2 first points):
The embedding output was tested against HF transformers and confirmed to be matched
Important
RoPE ordering was corrected upon conversion - no more backend changes here in this PR
Testing
https://huggingface.co/zai-org/GLM-4.6V-Flash
I'm using the ./tools/mtmd/test-1.jpeg already included in this repo:
llama-mtmd-cli -m ..... -mm ..... --image ./tools/mtmd/test-1.jpeg -p "extract all texts from this image" --temp 0 -n 1024Output: