Skip to content

Eval bug: GLM4.6V Flash incoherent on Vulkan #18164

@LostRuins

Description

@LostRuins

Name and Version

llama-b7446-bin-win-vulkan-x64

Operating systems

Windows

GGML backends

Vulkan

Hardware

Nvidia RTX 4090

Models

GLM-4.6V-Flash-Q4_K_M.gguf and mmproj-GLM-4.6V-Flash-Q8_0

Official GGML org quants from https://huggingface.co/ggml-org/GLM-4.6V-Flash-GGUF/tree/main

Problem description & steps to reproduce

The model works perfectly fine on CPU, but produces extremely degraded output on Vulkan, as of official llama-b7446-bin-win-vulkan-x64. The degradation seems to be linked to the size/dimensions of the input image, on some it is decent, on others extremely poor.

I have attached an image for your testing. Please note that the dimensions affect this behavior, this is a 768x768 image.

Image

If I add a simple --no-mmproj-offload then everything runs fine.

Thank you for your consideration @ngxson @jeffbolznv @0cc4m

First Bad Commit

Since added in #18042

Relevant log output

Case 1: with mmproj on GPU in Vulkan

C:\Users\user\Desktop\llama-b7446-bin-win-vulkan-x64>llama-mtmd-cli.exe -m c:\Users\user\Desktop\GLM-4.6V-Flash-Q4_K_M.gguf --mmproj c:\Users\user\Desktop\mmproj-GLM-4.6V-Flash-Q8_0.gguf --image c:\Users\user\Desktop\benchy.jpg -p "what do you see?"
load_backend: loaded RPC backend from C:\Users\user\Desktop\llama-b7446-bin-win-vulkan-x64\ggml-rpc.dll
ggml_vulkan: Found 2 Vulkan devices:
ggml_vulkan: 0 = Intel(R) RaptorLake-S Mobile Graphics Controller (Intel Corporation) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 32768 | int dot: 1 | matrix cores: none
ggml_vulkan: 1 = NVIDIA GeForce RTX 4090 Laptop GPU (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
load_backend: loaded Vulkan backend from C:\Users\user\Desktop\llama-b7446-bin-win-vulkan-x64\ggml-vulkan.dll
load_backend: loaded CPU backend from C:\Users\user\Desktop\llama-b7446-bin-win-vulkan-x64\ggml-cpu-haswell.dll
build: 7446 (5c0d18881) with Clang 19.1.5 for Windows x86_64
common_init_result: fitting params to device memory, to report bugs during this step use -fit off (or --verbose if you can't)
llama_params_fit_impl: projected to use 11068 MiB of device memory vs. 16050 MiB of free device memory
llama_params_fit_impl: will leave 4210 >= 1024 MiB of free device memory, no changes needed
llama_params_fit: successfully fit params to free device memory
llama_params_fit: fitting params to free memory took 0.71 seconds
llama_model_load_from_file_impl: using device Vulkan1 (NVIDIA GeForce RTX 4090 Laptop GPU) (0000:01:00.0) - 15278 MiB free
llama_model_loader: loaded meta data with 32 key-value pairs and 523 tensors from c:\Users\user\Desktop\GLM-4.6V-Flash-Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = glm4
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                     general.sampling.top_k i32              = 2
llama_model_loader: - kv   3:                     general.sampling.top_p f32              = 0.600000
llama_model_loader: - kv   4:                      general.sampling.temp f32              = 0.800000
llama_model_loader: - kv   5:                         general.size_label str              = 9.4B
llama_model_loader: - kv   6:                            general.license str              = mit
llama_model_loader: - kv   7:                               general.tags arr[str,1]       = ["image-text-to-text"]
llama_model_loader: - kv   8:                          general.languages arr[str,2]       = ["zh", "en"]
llama_model_loader: - kv   9:                           glm4.block_count u32              = 40
llama_model_loader: - kv  10:                        glm4.context_length u32              = 131072
llama_model_loader: - kv  11:                      glm4.embedding_length u32              = 4096
llama_model_loader: - kv  12:                   glm4.feed_forward_length u32              = 13696
llama_model_loader: - kv  13:                  glm4.attention.head_count u32              = 32
llama_model_loader: - kv  14:               glm4.attention.head_count_kv u32              = 2
llama_model_loader: - kv  15:               glm4.rope.dimension_sections arr[i32,4]       = [8, 12, 12, 0]
llama_model_loader: - kv  16:                        glm4.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  17:      glm4.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  18:                  glm4.rope.dimension_count u32              = 64
llama_model_loader: - kv  19:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  20:                         tokenizer.ggml.pre str              = glm4
llama_model_loader: - kv  21:                      tokenizer.ggml.tokens arr[str,151552]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  22:                  tokenizer.ggml.token_type arr[i32,151552]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  23:                      tokenizer.ggml.merges arr[str,318088]  = ["─á ─á", "─á ─á─á─á", "─á─á ─á─á", "...
llama_model_loader: - kv  24:                tokenizer.ggml.eos_token_id u32              = 151329
llama_model_loader: - kv  25:            tokenizer.ggml.padding_token_id u32              = 151329
llama_model_loader: - kv  26:                tokenizer.ggml.eot_token_id u32              = 151336
llama_model_loader: - kv  27:            tokenizer.ggml.unknown_token_id u32              = 151329
llama_model_loader: - kv  28:                tokenizer.ggml.bos_token_id u32              = 151329
llama_model_loader: - kv  29:                    tokenizer.chat_template str              = [gMASK]<sop>\n{%- if tools -%}\n<|syste...
llama_model_loader: - kv  30:               general.quantization_version u32              = 2
llama_model_loader: - kv  31:                          general.file_type u32              = 15
llama_model_loader: - type  f32:  281 tensors
llama_model_loader: - type q5_0:   20 tensors
llama_model_loader: - type q8_0:   20 tensors
llama_model_loader: - type q4_K:  181 tensors
llama_model_loader: - type q6_K:   21 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_K - Medium
print_info: file size   = 5.73 GiB (5.24 BPW)
load: special_eot_id is not in special_eog_ids - the tokenizer config may be incorrect
←[0mload: printing all EOG tokens:
load:   - 151329 ('<|endoftext|>')
load:   - 151336 ('<|user|>')
load: special tokens cache size = 36
load: token to piece cache size = 0.9713 MB
print_info: arch             = glm4
print_info: vocab_only       = 0
print_info: no_alloc         = 0
print_info: n_ctx_train      = 131072
print_info: n_embd           = 4096
print_info: n_embd_inp       = 4096
print_info: n_layer          = 40
print_info: n_head           = 32
print_info: n_head_kv        = 2
print_info: n_rot            = 64
print_info: n_swa            = 0
print_info: is_swa_any       = 0
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 16
print_info: n_embd_k_gqa     = 256
print_info: n_embd_v_gqa     = 256
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-05
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 13696
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: n_expert_groups  = 0
print_info: n_group_used     = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 8
print_info: rope scaling     = linear
print_info: freq_base_train  = 500000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 131072
print_info: rope_yarn_log_mul= 0.0000
print_info: rope_finetuned   = unknown
print_info: mrope sections   = [8, 12, 12, 0]
print_info: model type       = 9B
print_info: model params     = 9.40 B
print_info: general.name     = n/a
print_info: vocab type       = BPE
print_info: n_vocab          = 151552
print_info: n_merges         = 318088
print_info: BOS token        = 151329 '<|endoftext|>'
print_info: EOS token        = 151329 '<|endoftext|>'
print_info: EOT token        = 151336 '<|user|>'
print_info: UNK token        = 151329 '<|endoftext|>'
print_info: PAD token        = 151329 '<|endoftext|>'
print_info: LF token         = 198 '─è'
print_info: FIM PRE token    = 151347 '<|code_prefix|>'
print_info: FIM SUF token    = 151349 '<|code_suffix|>'
print_info: FIM MID token    = 151348 '<|code_middle|>'
print_info: EOG token        = 151329 '<|endoftext|>'
print_info: EOG token        = 151336 '<|user|>'
print_info: max token length = 1024
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 40 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 41/41 layers to GPU
load_tensors:   CPU_Mapped model buffer size =   333.00 MiB
load_tensors:      Vulkan1 model buffer size =  5539.00 MiB
.........................................................................................
common_init_result: added <|endoftext|> logit bias = -inf
common_init_result: added <|user|> logit bias = -inf
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 131072
llama_context: n_ctx_seq     = 131072
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = auto
llama_context: kv_unified    = false
llama_context: freq_base     = 500000.0
llama_context: freq_scale    = 1
llama_context: Vulkan_Host  output buffer size =     0.58 MiB
llama_kv_cache:    Vulkan1 KV buffer size =  5120.00 MiB
llama_kv_cache: size = 5120.00 MiB (131072 cells,  40 layers,  1/1 seqs), K (f16): 2560.00 MiB, V (f16): 2560.00 MiB
llama_context: Flash Attention was auto, set to enabled
llama_context:    Vulkan1 compute buffer size =   409.02 MiB
llama_context: Vulkan_Host compute buffer size =   264.02 MiB
llama_context: graph nodes  = 1487
llama_context: graph splits = 2
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
←[0mmtmd_cli_context: chat template example:
[gMASK]<sop><|system|>
You are a helpful assistant<|user|>
Hello<|assistant|>
Hi there<|user|>
How are you?<|assistant|>

clip_model_loader: model name:
clip_model_loader: description:
clip_model_loader: GGUF version: 3
clip_model_loader: alignment:    32
clip_model_loader: n_tensors:    182
clip_model_loader: n_kv:         24

clip_model_loader: has vision encoder
clip_ctx: CLIP using Vulkan1 backend
load_hparams: projector:          glm4v
load_hparams: n_embd:             1536
load_hparams: n_head:             12
load_hparams: n_ff:               13696
load_hparams: n_layer:            24
load_hparams: ffn_op:             silu
load_hparams: projection_dim:     4096

--- vision hparams ---
load_hparams: image_size:         336
load_hparams: patch_size:         14
load_hparams: has_llava_proj:     0
load_hparams: minicpmv_version:   0
load_hparams: n_merge:            2
load_hparams: n_wa_pattern:       0
load_hparams: image_min_pixels:   6272
load_hparams: image_max_pixels:   3211264

load_hparams: model size:         934.64 MiB
load_hparams: metadata size:      0.06 MiB
warmup: warmup with image size = 1288 x 1288
alloc_compute_meta:    Vulkan1 compute buffer size =   515.05 MiB
alloc_compute_meta:        CPU compute buffer size =    19.11 MiB
alloc_compute_meta: graph splits = 1, nodes = 632
warmup: flash attention is enabled
main: loading model: c:\Users\user\Desktop\GLM-4.6V-Flash-Q4_K_M.gguf
WARN: This is an experimental CLI for testing multimodal capability.
←[0m      For normal use cases, please use the standard llama-cli
←[0mencoding image slice...
image slice encoded in 128 ms
decoding image batch 1/1, n_tokens_batch = 729
image decoded (batch 1/1) in 157 ms

<think>Got it, let's see. The user is asking "what do you see?" and there's an image, but maybe the image is a simple one? Wait, looking at the input, the image might be a star or something? Wait, no, the user provided an image, but in the text, maybe it's a star-like shape? Wait, the image descriptionΓÇömaybe it's a star or a snowflake? Let me think. The question is "what do you see?" So I need to describe what's in the image.

Looking at the image (assuming it's a star or a triangular shape with points, maybe a snowflake or a star). Wait, the image might be a star with six points, like a snowflake or a star symbol. So I should describe that.

So the answer would be something like "I see a star-shaped figure, possibly a snowflake or a six-pointed star, with symmetrical points." Or more precisely, looking at the image, it's a star (like a snowflake) with triangular sections, maybe a stylized star. Let me check the image again. The image is a star with six points, each point is a triangle, forming a star shape. So I can describe that.

So the response should be: I see a star - shaped (or snowflake - like) figure with multiple points, likely a stylized star or snowflake design.</think>
<|begin_of_box|>I see a star - shaped (or snowflake - like) figure with multiple points, which appears to be a stylized star or snowflake design.<|end_of_box|>


llama_perf_context_print:        load time =    5753.35 ms
llama_perf_context_print: prompt eval time =     526.60 ms /   742 tokens (    0.71 ms per token,  1409.03 tokens per second)
llama_perf_context_print:        eval time =    4463.51 ms /   326 runs   (   13.69 ms per token,    73.04 tokens per second)
llama_perf_context_print:       total time =    5855.37 ms /  1068 tokens
llama_perf_context_print:    graphs reused =          0

Case 2: with mmproj on CPU in Vulkan

C:\Users\user\Desktop\llama-b7446-bin-win-vulkan-x64>llama-mtmd-cli.exe -m c:\Users\user\Desktop\GLM-4.6V-Flash-Q4_K_M.gguf --mmproj c:\Users\user\Desktop\mmproj-GLM-4.6V-Flash-Q8_0.gguf --image c:\Users\user\Desktop\benchy.jpg -p "what do you see?" --no-mmproj-offload
load_backend: loaded RPC backend from C:\Users\user\Desktop\llama-b7446-bin-win-vulkan-x64\ggml-rpc.dll
ggml_vulkan: Found 2 Vulkan devices:
ggml_vulkan: 0 = Intel(R) RaptorLake-S Mobile Graphics Controller (Intel Corporation) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 32768 | int dot: 1 | matrix cores: none
ggml_vulkan: 1 = NVIDIA GeForce RTX 4090 Laptop GPU (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
load_backend: loaded Vulkan backend from C:\Users\user\Desktop\llama-b7446-bin-win-vulkan-x64\ggml-vulkan.dll
load_backend: loaded CPU backend from C:\Users\user\Desktop\llama-b7446-bin-win-vulkan-x64\ggml-cpu-haswell.dll
build: 7446 (5c0d18881) with Clang 19.1.5 for Windows x86_64
common_init_result: fitting params to device memory, to report bugs during this step use -fit off (or --verbose if you can't)
llama_params_fit_impl: projected to use 11068 MiB of device memory vs. 16050 MiB of free device memory
llama_params_fit_impl: will leave 4210 >= 1024 MiB of free device memory, no changes needed
llama_params_fit: successfully fit params to free device memory
llama_params_fit: fitting params to free memory took 0.71 seconds
llama_model_load_from_file_impl: using device Vulkan1 (NVIDIA GeForce RTX 4090 Laptop GPU) (0000:01:00.0) - 15278 MiB free
llama_model_loader: loaded meta data with 32 key-value pairs and 523 tensors from c:\Users\user\Desktop\GLM-4.6V-Flash-Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = glm4
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                     general.sampling.top_k i32              = 2
llama_model_loader: - kv   3:                     general.sampling.top_p f32              = 0.600000
llama_model_loader: - kv   4:                      general.sampling.temp f32              = 0.800000
llama_model_loader: - kv   5:                         general.size_label str              = 9.4B
llama_model_loader: - kv   6:                            general.license str              = mit
llama_model_loader: - kv   7:                               general.tags arr[str,1]       = ["image-text-to-text"]
llama_model_loader: - kv   8:                          general.languages arr[str,2]       = ["zh", "en"]
llama_model_loader: - kv   9:                           glm4.block_count u32              = 40
llama_model_loader: - kv  10:                        glm4.context_length u32              = 131072
llama_model_loader: - kv  11:                      glm4.embedding_length u32              = 4096
llama_model_loader: - kv  12:                   glm4.feed_forward_length u32              = 13696
llama_model_loader: - kv  13:                  glm4.attention.head_count u32              = 32
llama_model_loader: - kv  14:               glm4.attention.head_count_kv u32              = 2
llama_model_loader: - kv  15:               glm4.rope.dimension_sections arr[i32,4]       = [8, 12, 12, 0]
llama_model_loader: - kv  16:                        glm4.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  17:      glm4.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  18:                  glm4.rope.dimension_count u32              = 64
llama_model_loader: - kv  19:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  20:                         tokenizer.ggml.pre str              = glm4
llama_model_loader: - kv  21:                      tokenizer.ggml.tokens arr[str,151552]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  22:                  tokenizer.ggml.token_type arr[i32,151552]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  23:                      tokenizer.ggml.merges arr[str,318088]  = ["─á ─á", "─á ─á─á─á", "─á─á ─á─á", "...
llama_model_loader: - kv  24:                tokenizer.ggml.eos_token_id u32              = 151329
llama_model_loader: - kv  25:            tokenizer.ggml.padding_token_id u32              = 151329
llama_model_loader: - kv  26:                tokenizer.ggml.eot_token_id u32              = 151336
llama_model_loader: - kv  27:            tokenizer.ggml.unknown_token_id u32              = 151329
llama_model_loader: - kv  28:                tokenizer.ggml.bos_token_id u32              = 151329
llama_model_loader: - kv  29:                    tokenizer.chat_template str              = [gMASK]<sop>\n{%- if tools -%}\n<|syste...
llama_model_loader: - kv  30:               general.quantization_version u32              = 2
llama_model_loader: - kv  31:                          general.file_type u32              = 15
llama_model_loader: - type  f32:  281 tensors
llama_model_loader: - type q5_0:   20 tensors
llama_model_loader: - type q8_0:   20 tensors
llama_model_loader: - type q4_K:  181 tensors
llama_model_loader: - type q6_K:   21 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_K - Medium
print_info: file size   = 5.73 GiB (5.24 BPW)
load: special_eot_id is not in special_eog_ids - the tokenizer config may be incorrect
←[0mload: printing all EOG tokens:
load:   - 151329 ('<|endoftext|>')
load:   - 151336 ('<|user|>')
load: special tokens cache size = 36
load: token to piece cache size = 0.9713 MB
print_info: arch             = glm4
print_info: vocab_only       = 0
print_info: no_alloc         = 0
print_info: n_ctx_train      = 131072
print_info: n_embd           = 4096
print_info: n_embd_inp       = 4096
print_info: n_layer          = 40
print_info: n_head           = 32
print_info: n_head_kv        = 2
print_info: n_rot            = 64
print_info: n_swa            = 0
print_info: is_swa_any       = 0
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 16
print_info: n_embd_k_gqa     = 256
print_info: n_embd_v_gqa     = 256
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-05
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 13696
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: n_expert_groups  = 0
print_info: n_group_used     = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 8
print_info: rope scaling     = linear
print_info: freq_base_train  = 500000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 131072
print_info: rope_yarn_log_mul= 0.0000
print_info: rope_finetuned   = unknown
print_info: mrope sections   = [8, 12, 12, 0]
print_info: model type       = 9B
print_info: model params     = 9.40 B
print_info: general.name     = n/a
print_info: vocab type       = BPE
print_info: n_vocab          = 151552
print_info: n_merges         = 318088
print_info: BOS token        = 151329 '<|endoftext|>'
print_info: EOS token        = 151329 '<|endoftext|>'
print_info: EOT token        = 151336 '<|user|>'
print_info: UNK token        = 151329 '<|endoftext|>'
print_info: PAD token        = 151329 '<|endoftext|>'
print_info: LF token         = 198 '─è'
print_info: FIM PRE token    = 151347 '<|code_prefix|>'
print_info: FIM SUF token    = 151349 '<|code_suffix|>'
print_info: FIM MID token    = 151348 '<|code_middle|>'
print_info: EOG token        = 151329 '<|endoftext|>'
print_info: EOG token        = 151336 '<|user|>'
print_info: max token length = 1024
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 40 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 41/41 layers to GPU
load_tensors:   CPU_Mapped model buffer size =   333.00 MiB
load_tensors:      Vulkan1 model buffer size =  5539.00 MiB
.........................................................................................
common_init_result: added <|endoftext|> logit bias = -inf
common_init_result: added <|user|> logit bias = -inf
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 131072
llama_context: n_ctx_seq     = 131072
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = auto
llama_context: kv_unified    = false
llama_context: freq_base     = 500000.0
llama_context: freq_scale    = 1
llama_context: Vulkan_Host  output buffer size =     0.58 MiB
llama_kv_cache:    Vulkan1 KV buffer size =  5120.00 MiB
llama_kv_cache: size = 5120.00 MiB (131072 cells,  40 layers,  1/1 seqs), K (f16): 2560.00 MiB, V (f16): 2560.00 MiB
llama_context: Flash Attention was auto, set to enabled
llama_context:    Vulkan1 compute buffer size =   409.02 MiB
llama_context: Vulkan_Host compute buffer size =   264.02 MiB
llama_context: graph nodes  = 1487
llama_context: graph splits = 2
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
←[0mmtmd_cli_context: chat template example:
[gMASK]<sop><|system|>
You are a helpful assistant<|user|>
Hello<|assistant|>
Hi there<|user|>
How are you?<|assistant|>

clip_model_loader: model name:
clip_model_loader: description:
clip_model_loader: GGUF version: 3
clip_model_loader: alignment:    32
clip_model_loader: n_tensors:    182
clip_model_loader: n_kv:         24

clip_model_loader: has vision encoder
clip_ctx: CLIP using CPU backend
load_hparams: projector:          glm4v
load_hparams: n_embd:             1536
load_hparams: n_head:             12
load_hparams: n_ff:               13696
load_hparams: n_layer:            24
load_hparams: ffn_op:             silu
load_hparams: projection_dim:     4096

--- vision hparams ---
load_hparams: image_size:         336
load_hparams: patch_size:         14
load_hparams: has_llava_proj:     0
load_hparams: minicpmv_version:   0
load_hparams: n_merge:            2
load_hparams: n_wa_pattern:       0
load_hparams: image_min_pixels:   6272
load_hparams: image_max_pixels:   3211264

load_hparams: model size:         934.64 MiB
load_hparams: metadata size:      0.06 MiB
warmup: warmup with image size = 1288 x 1288
alloc_compute_meta:        CPU compute buffer size =   515.05 MiB
alloc_compute_meta: graph splits = 1, nodes = 632
warmup: flash attention is enabled
main: loading model: c:\Users\user\Desktop\GLM-4.6V-Flash-Q4_K_M.gguf
WARN: This is an experimental CLI for testing multimodal capability.
←[0m      For normal use cases, please use the standard llama-cli
←[0mencoding image slice...
image slice encoded in 13919 ms
decoding image batch 1/1, n_tokens_batch = 729
image decoded (batch 1/1) in 401 ms

<think>Got it, let's see. The image has a creature that looks like a dragon or a lizard-like humanoid, wearing an orange robe, holding a staff and a glowing orb. There's text on the image: "not all who wander are lost" in white, and "the mind is a terrible thing to waste" in blue. So I need to describe what's in the image.

First, the character: a dragon-like humanoid (maybe a dragonborn from D&D) in an orange tunic, with a staff, a glowing object in one hand, a sword at the waist. The text is from The Lord of the Rings and a quote about the mind. So I'll describe the visual elements and the text.</think>
The image shows a dragon - like humanoid (with reptilian features, horns, and a tail) dressed in an orange tunic. It holds a glowing orb in one hand and a staff in the other, with a sword at its waist. Text on the image includes ΓÇ£not all who wander are lostΓÇ¥ (in white) and ΓÇ£the mind is a terrible thing to wasteΓÇ¥ (in blue).


llama_perf_context_print:        load time =    5443.70 ms
llama_perf_context_print: prompt eval time =   14427.55 ms /   742 tokens (   19.44 ms per token,    51.43 tokens per second)
llama_perf_context_print:        eval time =    3168.82 ms /   231 runs   (   13.72 ms per token,    72.90 tokens per second)
llama_perf_context_print:       total time =   18338.98 ms /   973 tokens
llama_perf_context_print:    graphs reused =          0

Metadata

Metadata

Assignees

No one assigned

    Labels

    VulkanIssues specific to the Vulkan backendbugSomething isn't workingmerge readyindicates that this may be ready to merge soon and is just holding out in case of objections

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions