model: support GLM4V vision encoder by ngxson · Pull Request #18042 · ggml-org/llama.cpp

ngxson · 2025-12-14T21:31:23Z

On first look, it seems to be an easy model to support as the HF implementation is pretty much the same as Qwen2.5VL

However, there are some very subtle differences that even some LLMs will miss (I tried both Grok and Gemini 3 and they both missed the 2 first points):

For text model, M-RoPE ordering is non-Neox. And because ggml's M-RoPE uses the Neox ordering by default, we need to convert the weight to neox ordering upon conversion. This is by far the most complicated change to support this model
Learned position embedding interpolation uses bicubic instead of bilinear
Added a norm layer right after patch bias
Use RMS norm

The embedding output was tested against HF transformers and confirmed to be matched

Important

RoPE ordering was corrected upon conversion - no more backend changes here in this PR

Testing

https://huggingface.co/zai-org/GLM-4.6V-Flash

I'm using the ./tools/mtmd/test-1.jpeg already included in this repo:

llama-mtmd-cli -m ..... -mm ..... --image ./tools/mtmd/test-1.jpeg -p "extract all texts from this image" --temp 0 -n 1024

Output:

"All the News That's Fit to Print"

Then the newspaper title: "The New York Times"

"LATE CITY EDITION"

"VOL. CXLVIII, No. 40,711"

"NEW YORK, MONDAY, JULY 21, 1969"

"10 CENTS"

Then the main headline: "MEN WALK ON MOON"

Next: "ASTRONAUTS LAND ON PLAIN; COLLECT ROCKS, PLANT FLAG"

Then a section: "Voice From Moon: 'Eagle Has Landed'"

Then the article by John Noble Wilford: "A Powdery Surface Is Closely Explored"

Now, let's transcribe each part carefully, including smaller text.

First, the top left box:

"All the News  
That's Fit to Print"

Then the newspaper header:

"The New York Times"

"LATE CITY EDITION"

"VOL. CXLVIII, No. 40,711"

"NEW YORK, MONDAY, JULY 21, 1969"

"10 CENTS"

ngxson · 2025-12-15T14:29:40Z

@tarruda just tested, it should work with the latest commit (feel free to give it a try)

tarruda · 2025-12-15T14:31:27Z

@tarruda just tested, it should work with the latest commit (feel free to give it a try)

Will do. Did you publish any GGUF weights?

ngxson · 2025-12-15T14:34:00Z

src/llama-model.cpp

+        case LLM_ARCH_GLM4:
+            return model->hparams.use_mrope() ? LLAMA_ROPE_TYPE_MROPE : LLAMA_ROPE_TYPE_NORM;
+        case LLM_ARCH_GLM4_MOE:
+            return model->hparams.use_mrope() ? LLAMA_ROPE_TYPE_MROPE : LLAMA_ROPE_TYPE_NEOX;


Because the 2 models (vision and non-vision) are mostly the same, except for the rope mode, so I was quite lazy not to duplicate it into a new arch (which adds involves quite a lot of copy-paste code)

I hope that we can somewhat allow de-duplicating some code via #18051

In the meantime, lmk if you're OK with keeping this hack, or a new arch is still preferable @ggerganov @CISC

ngxson · 2025-12-15T14:34:39Z

Will do. Did you publish any GGUF weights?

No because there is a chance we will change the arch name

convert_hf_to_gguf.py

tools/mtmd/clip-graph.h

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

src/llama-hparams.cpp

src/llama-hparams.h

IIIIIllllIIIIIlllll · 2025-12-16T02:21:01Z

@tarruda just tested, it should work with the latest commit (feel free to give it a try)

Will do. Did you publish any GGUF weights?

Sorry for the interruption.

I have quantized a Q4 model, but the current PR does not yet support the vision module.

Edited: 2025-12-16 10:37
Sorry, I have too many local branches and I got confused.

CISC · 2025-12-16T02:32:51Z

Sorry for the interruption.

I have quantized a Q4 model, but the current PR does not yet support the vision module.

Yes it does?

IIIIIllllIIIIIlllll · 2025-12-16T02:54:46Z

it work well (the 106B one) :)

tarruda · 2025-12-16T19:00:25Z

I tried https://huggingface.co/ggml-org/GLM-4.6V-GGUF with the annotate test page and it seems to work well.

While it is not as good as Qwen3-VL 32B, it seems to have good bounding box capabilities:

Note that I used Q8_0 mmproj, maybe with F16 or F32 it will improve?

BetaDoggo · 2025-12-16T23:29:24Z

The bf16 or fp16 mmproj should definitely be used whenever possible, mmproj degrades much faster than regular llm weights, and and 2GB is nothing when the regular model is already 70GB.

* convert ok * no deepstack * less new tensors * cgraph ok * add mrope for text model * faster patch merger * add GGML_ROPE_TYPE_MRNORM * add support for metal * move glm4v do dedicated graph * convert: add norm_embd * clip: add debugging fn * working correctly * fix style * use bicubic * fix mrope metal * improve cpu * convert to neox ordering on conversion * revert backend changes * force stop if using old weight * support moe variant * fix conversion * fix convert (2) * Update tools/mtmd/clip-graph.h Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * process mrope_section on TextModel base class * resolve conflict merge --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

ngxson added 14 commits December 13, 2025 17:19

convert ok

b24d366

no deepstack

7b13c8e

less new tensors

f3f8fb4

cgraph ok

4e81ab4

add mrope for text model

306f342

faster patch merger

6a6e301

add GGML_ROPE_TYPE_MRNORM

c78c2e3

add support for metal

037e76e

move glm4v do dedicated graph

b4e65dc

convert: add norm_embd

7d6a1e0

clip: add debugging fn

5047d8e

working correctly

ad85426

fix style

f00127e

use bicubic

1514734

This was referenced Dec 14, 2025

mtmd: add GLM4V multimodal model with conversion support #17998

Closed

Proposal: ggml_rope_v2 ggml-org/ggml#1401

Open

github-actions bot added model Model specific examples python python script changes ggml changes relating to the ggml tensor library for machine learning Apple Metal https://en.wikipedia.org/wiki/Metal_(API) labels Dec 14, 2025

ngxson added 2 commits December 14, 2025 23:57

fix mrope metal

cadaedb

improve cpu

4a0b89a

loci-dev mentioned this pull request Dec 14, 2025

UPSTREAM PR #18042: model: support GLM4V vision encoder auroralabs-loci/llama.cpp#570

Open

convert to neox ordering on conversion

d00d11e

github-actions bot added the testing Everything test related label Dec 15, 2025

revert backend changes

f8aad31

ngxson removed testing Everything test related ggml changes relating to the ggml tensor library for machine learning Apple Metal https://en.wikipedia.org/wiki/Metal_(API) labels Dec 15, 2025

fix convert (2)

785ccf4

ngxson marked this pull request as ready for review December 15, 2025 14:30

ngxson requested review from CISC and ggerganov as code owners December 15, 2025 14:30

ngxson commented Dec 15, 2025

View reviewed changes

CISC reviewed Dec 15, 2025

View reviewed changes

convert_hf_to_gguf.py Outdated Show resolved Hide resolved

convert_hf_to_gguf.py Outdated Show resolved Hide resolved

ggerganov approved these changes Dec 15, 2025

View reviewed changes

tools/mtmd/clip-graph.h Outdated Show resolved Hide resolved

ngxson and others added 4 commits December 15, 2025 22:00

Update tools/mtmd/clip-graph.h

7d53c0f

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

Merge branch 'master' into xsn/glm4v

b81c03c

process mrope_section on TextModel base class

dd66aba

Merge branch 'master' into xsn/glm4v

35ad5a5

CISC approved these changes Dec 15, 2025

View reviewed changes

src/llama-hparams.cpp Outdated Show resolved Hide resolved

src/llama-hparams.h Outdated Show resolved Hide resolved

resolve conflict merge

f969d4f

ngxson merged commit 3d86c6c into ggml-org:master Dec 16, 2025
73 of 80 checks passed

ddh0 mentioned this pull request Dec 16, 2025

support GLM-4.5V and GLM-4.1V vision models #16600

Closed

sfallah added a commit to sfallah/llama.cpp that referenced this pull request Dec 16, 2025

merge with changes from ggml-org#18042

512b2c8

CISC linked an issue Dec 16, 2025 that may be closed by this pull request

Feature Request: Support GLM-4.1V-9B-Thinking #14495

Closed

4 tasks

LostRuins mentioned this pull request Dec 18, 2025

Eval bug: GLM4.6V Flash incoherent on Vulkan #18164

Closed

SurealCereal mentioned this pull request Dec 20, 2025

Model Request GLM-4.6V ollama/ollama#13391

Open

Quairon-Nailo mentioned this pull request Dec 24, 2025

Feature Request: support GLM4V vision encoder ikawrakow/ik_llama.cpp#1084

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

model: support GLM4V vision encoder#18042

model: support GLM4V vision encoder#18042
ngxson merged 27 commits intoggml-org:masterfrom
ngxson:xsn/glm4v

ngxson commented Dec 14, 2025 •

edited

Loading

Uh oh!

ngxson commented Dec 15, 2025

Uh oh!

tarruda commented Dec 15, 2025

Uh oh!

ngxson Dec 15, 2025

Uh oh!

ngxson commented Dec 15, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

IIIIIllllIIIIIlllll commented Dec 16, 2025 •

edited

Loading

Uh oh!

CISC commented Dec 16, 2025

Uh oh!

IIIIIllllIIIIIlllll commented Dec 16, 2025

Uh oh!

Uh oh!

tarruda commented Dec 16, 2025

Uh oh!

BetaDoggo commented Dec 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

ngxson commented Dec 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Testing

Uh oh!

ngxson commented Dec 15, 2025

Uh oh!

tarruda commented Dec 15, 2025

Uh oh!

ngxson Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

ngxson commented Dec 15, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

IIIIIllllIIIIIlllll commented Dec 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CISC commented Dec 16, 2025

Uh oh!

IIIIIllllIIIIIlllll commented Dec 16, 2025

Uh oh!

Uh oh!

tarruda commented Dec 16, 2025

Uh oh!

BetaDoggo commented Dec 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

ngxson commented Dec 14, 2025 •

edited

Loading

IIIIIllllIIIIIlllll commented Dec 16, 2025 •

edited

Loading