Performance of llama.cpp on Nvidia CUDA #15013

olegshulyakov · 2025-08-01T15:20:29Z

olegshulyakov
Aug 1, 2025

This is similar to the Performance of llama.cpp on Apple Silicon M-series, Performance of llama.cpp on AMD ROCm(HIP) and Performance of llama.cpp with Vulkan, but for CUDA! I think it's good to consolidate and discuss our results here.

We'll be testing the Llama 2 7B model like the other thread to keep things consistent, and use Q4_0 as it's simple to compute and small enough to fit on a 4GB GPU. You can download it here.

Instructions

Either run the commands below or download one of our CUDA releases. If you have multiple GPUs please run the test on a single GPU using -sm none -mg YOUR_GPU_NUMBER unless the model is too big to fit in VRAM.

wget https://huggingface.co/TheBloke/Llama-2-7B-GGUF/resolve/main/llama-2-7b.Q4_0.gguf
llama-bench -m llama-2-7b.Q4_0.gguf -ngl 99 -fa 0,1

Share your llama-bench results along with the git hash and CUDA info string in the comments. Feel free to try other models and compare backends, but only valid runs will be placed on the scoreboard.

If multiple entries are posted for the same device I'll prioritize newer commits with substantial CUDA updates, otherwise I'll pick the one with the highest overall score at my discretion. Performance may vary depending on driver, operating system, board manufacturer, etc. even if the chip is the same. For integrated graphics note that your memory speed and number of channels will greatly affect your inference speed!

CUDA Scoreboard for Llama 2 7B, Q4_0 (no FA)

Chip	Memory	pp512 t/s	tg128 t/s	Commit	Thanks to
RTX 5090	32 GB / GDDR7 / 512 bit	14073.41 ± 115.16	290.02 ± 1.10	`8cf6b42`	@totaldev
RTX PRO 6000 Blackwell	96 GB / GDDR7 / 512 bit	14854.63 ± 22.73	274.20 ± 0.14	`79c1160`	@Tom94
H100 80 GB	80 GB / HBM3 / 5120 bit	9918.34 ± 176.97	267.81 ± 1.54	`5143fa8`	@Hedede
A100 80 GB	80 GB / HBM2e / 5120 bit	4849.53 ± 8.94	190.88 ± 0.33	`5143fa8`	@Hedede
RTX 4090 D	24 GB / GDDR6X / 384 bit	10293.86 ± 134.72	189.33 ± 0.19	`79c1160`	@autonomous-AI-lab
RTX 4090	24 GB / GDDR6X / 384 bit	11992.70 ± 107.99	186.21 ± 0.13	`2241453`	@lhl
RTX 5080	16 GB / GDDR7 / 256 bit	8297.36 ± 9.50	181.99 ± 0.42	`8a4280c`	@Hedede
RTX 5070 Ti	16 GB / GDDR7 / 256 bit	6952.38 ± 13.73	176.85 ± 0.07	`933414c`	@TinyServal
RTX 6000 Ada	48 GB / GDDR6 / 384 bit	9229.23 ± 101.78	176.07 ± 0.26	`b8e09f0`	@Hedede
RTX 3090 Ti	24 GB / GDDR6X / 384 bit	6567.49 ± 20.30	171.19 ± 3.98	`9c35706`	@slaren
RTX 3090	24 GB / GDDR6X / 384 bit	5174.69 ± 21.83	158.16 ± 0.21	`c76b420`	@m18coppola
L40	48 GB / GDDR6 / 384 bit	8870.49 ± 378.76	152.01 ± 0.28	`ee09828`	@Hedede
RTX 4080 SUPER	16 GB / GDDR6X / 256 bit	8125.15 ± 41.05	148.33 ± 0.20	`81086cd`	@zacharyarnaise
RTX 4080	16 GB / GDDR6X / 256 bit	8031.64 ± 26.49	142.49 ± 0.16	`20638e4`	@Ristovski
RTX 3080	10 GB / GDDR6X / 320 bit	5013.86 ± 24.80	139.65 ± 0.99	`9c35706`	@slaren
RTX A6000	48 GB / GDDR6 / 384 bit	4913.93 ± 6.79	138.73 ± 2.75	`4795c91`	@Hedede
RTX 4070 Ti SUPER	16 GB / GDDR6X / 256 bit	6924.53 ± 13.87	132.26 ± 0.16	`9c35706`	@Ristovski
RTX PRO 4000 Blackwell	24 GB / GDDR7 / 192 bit	4992.83 ± 113.52	131.66 ± 0.20	`7d77f07`	@Hedede
RTX A5000	24 GB / GDDR6 / 384 bit	4028.16 ± 19.14	130.07 ± 2.74	`e5155e6`	@Hedede
Tesla V100	32 GB / HBM2 / 4096 bit	3042.64 ± 40.71	129.08 ± 0.05	`51f5a45`	@Hedede
RTX 5070	12 GB / GDDR7 / 192 bit	5184.75 ± 18.70	127.54 ± 0.46		@Spyro000
A40	48 GB / GDDR6 / 384 bit	4609.01 ± 10.67	124.11 ± 0.17	`3470a5c`	@Hedede
A30	24 GB / HBM2e / 3072 bit	2767.10 ± 1.88	124.81 ± 0.16	`583cb83`	@Hedede
Titan V	12 GB / HBM2 / 3072 bit	2617.46 ± 2.10	108.79 ± 0.05	`e56abd2`	@Hedede
RTX 2080 Ti	11 GB / GDDR6 / 352 bit	2890.66 ± 2.42	107.51 ± 0.21	`9c35706`	@ariya
Quadro RTX 6000	24 GB / GDDR6 / 384 bit	2751.18 ± 19.43	102.77 ± 0.04	`b8e09f0`	@Hedede
Quadro RTX 8000	48 GB / GDDR6 / 384 bit	2709.95 ± 3.35	102.68 ± 0.03	`b8e09f0`	@Hedede
RTX A4500	20 GB / GDDR6 / 320 bit	2827.20 ± 66.43	97.32 ± 2.80	`5cdb27e`	@aleksyx
RTX 5060 Ti	16 GB / GDDR7 / 128 bit	3737.25 ± 6.79	90.94 ± 0.02	`89d1029`	@mike-llamacpp
RTX 2070 SUPER	8 GB / GDDR6 / 256 bit	2088.34 ± 1.94	88.06 ± 0.28	`bc07349`	@phstudy
RTX A4000	16 GB / GDDR6 / 256 bit	2684.06 ± 15.28	83.77 ± 0.37	`65349f2`	@TinyServal
Titan Xp	12 GB / GDDR5X / 384 bit	1154.96 ± 1.46	76.08 ± 0.08	`c4510dc`	@Hedede
RTX 3060	12 GB / GDDR6 / 192 bit	2137.50 ± 10.12	75.57 ± 0.07	`baa9255`	@QuantiusBenignus
Quadro RTX 4000	8 GB / GDDR6 / 256 bit	1536.89 ± 0.90	65.62 ± 0.62	`7d77f07`	@Hedede
RTX 4060 Ti	8 GB / GDDR6 / 128 bit	3394.63 ± 7.44	63.86 ± 0.01	`89d1029`	@mike-llamacpp
GTX 1080 Ti	11 GB / GDDR5X / 352 bit	1084.41 ± 3.01	62.49 ± 0.06	`9c35706`	@ariya
RTX A4000 Ada	20 GB / GDDR6 / 160 bit	2779.77 ± 9.91	61.83 ± 0.04	`a74a0d6`	@sdwolfz
RTX 2060 SUPER	8 GB / GDDR6 / 256 bit	1420.24 ± 1.95	60.04 ± 0.01	`5c0eb5e`	@ggerganov
Tesla P100	16 GB / HBM2 / 4096 bit	760.80 ± 2.92	58.35 ± 0.00	`b8372ee`	@Hedede
DGX Spark	128 GB / LPDDR5x	3062.31 ± 11.02	57.21 ± 0.06	`5acd455`	@ggerganov
Tesla P40	24 GB / GDDR5 / 384 bit	1007.42 ± 1.23	54.74 ± 0.07	`c76b420`	@m18coppola
RTX 2000 Ada	16 GB / GDDR6 / 128 bit	1956.22 ± 7.74	50.62 ± 0.04	`756cfea`	@DigitalRudeness
Tesla T4	16 GB / GDDR6 / 256 bit	1219.06 ± 4.18	46.38 ± 0.73	`d32e03f`	@pt13762104
RTX 4050 Laptop	6 GB / GDDR6 / 96 bit	1725.85 + 17.85	43.72 + 0.41	`d79d8f3`	@TimCabbage
GTX 1660	6 GB / GDDR5 / 192 bit	148.91 ± 0.01	41.35 ± 0.02	`9515c61`	@ariya
Tesla M40	24 GB / GDDR5 / 384 bit	282.65 ± 0.15	38.04 ± 0.02	`97d5117`	@Hedede
GTX 1070 Ti	8 GB / GDDR5 / 256 bit	714.44 ± 2.04	37.82 ± 0.02	`79c1160`	@pebaryan
Jetson AGX Orin	64 GB / LPDDR5 / 256 bit	991.31 ± 1.15	33.58 ± 0.14	`c1b1876`	@TinyServal
Tesla P4	8 GB / GDDR5 / 256 bit	514.53 ± 3.06	33.29 ± 0.00	`c76b420`	@m18coppola
P106-100	6 GB / GDDR5 / 192 bit	406.94 ± 0.25	30.40 ± 0.02	`5fd160b`	@pebaryan
GTX 1060	6 GB / GDDR5 / 192 bit	416.85 ± 1.75	27.79 ± 0.02	`5fd160b`	@pebaryan
Quadro T1000	4 GB / GDDR5 / 128 bit	79.44 ± 0.01	27.82 ± 0.18	`f6da8cb`	@hanabu
Quadro P2000	5 GB / GDDR5 / 160 bit	309.30 ± 0.05	23.63 ± 0.00	`baa9255`	@TinyServal
Quadro P1000	4 GB / GDDR5 / 128 bit	183.40 ± 0.11	13.99 ± 0.13	`1e74897`	@aleksyx
Tesla K80	12 GB / GDDR5 / 384 bit	133.14 ± 0.55	13.80 ± 0.02	`32732f2`	@pebaryan

CUDA Scoreboard for Llama 2 7B, Q4_0 (with FA)

Chip	Memory	pp512 t/s	tg128 t/s	Commit	Thanks to
RTX 5090	32 GB / GDDR7 / 512 bit	14970.15 ± 381.06	300.40 ± 0.28	`8cf6b42`	@totaldev
RTX PRO 6000 Blackwell	96 GB / GDDR7 / 512 bit	16618.98 ± 20.66	281.11 ± 0.41	`5143fa8`	@Tom94
H100 80 GB	80 GB / HBM3 / 5120 bit	11263.29 ± 98.34	280.74 ± 1.17	`5143fa8`	@Hedede
A100 80 GB	80 GB / HBM2e / 5120 bit	5285.96 ± 6.58	200.90 ± 0.12	`5143fa8`	@Hedede
RTX 4090 D	24 GB / GDDR6X / 384 bit	12506.97 ± 11.51	191.57 ± 0.03	`79c1160`	@autonomous-AI-lab
RTX 4090	24 GB / GDDR6X / 384 bit	14770.63 ± 102.93	188.96 ± 0.05	`2241453`	@lhl
RTX 5080	16 GB / GDDR7 / 256 bit	9487.70 ± 21.89	184.68 ± 0.05	`8a4280c`	@Hedede
RTX 5070 Ti	16 GB / GDDR7 / 256 bit	8419.56 ± 35.50	182.43 ± 0.09	`933414c`	@TinyServal
RTX 6000 Ada	48 GB / GDDR6 / 384 bit	10576.85 ± 530.21	179.47 ± 0.32	`b8e09f0`	@Hedede
RTX 3090 Ti	24 GB / GDDR6X / 384 bit	6924.01 ± 10.76	172.26 ± 1.31	`9c35706`	@slaren
RTX PRO 4500 Blackwell	32 GB / GDDR7 / 256 bit	7251.66 ± 92.40	168.90 ± 0.20	`becc481`	@Hedede
RTX 3090	24 GB / GDDR6X / 384 bit	5560.06 ± 16.28	161.89 ± 0.18	`c76b420`	@m18coppola
L40	48 GB / GDDR6 / 384 bit	10097.64 ± 671.22	153.76 ± 0.12	`ee09828`	@Hedede
RTX 4080 SUPER	16 GB / GDDR6X / 256 bit	9439.01 ± 56.75	147.48 ± 1.41	`81086cd`	@zacharyarnaise
RTX 4080	16 GB / GDDR6X / 256 bit	9205.93 ± 22.31	143.47 ± 0.02	`20638e4`	@Ristovski
RTX A6000	48 GB / GDDR6 / 384 bit	5662.39 ± 13.87	144.87 ± 0.18	`4795c91`	@Hedede
RTX 3080	10 GB / GDDR6X / 320 bit	5569.56 ± 14.04	139.95 ± 0.95	`9c35706`	@slaren
RTX PRO 4000 Blackwell	24 GB / GDDR7 / 192 bit	5674.44 ± 139.53	136.38 ± 0.13	`7d77f07`	@Hedede
RTX A5000	24 GB / GDDR6 / 384 bit	4552.15 ± 9.68	135.83 ± 0.11	`e5155e6`	@Hedede
Tesla V100	32 GB / HBM2 / 4096 bit	2973.78 ± 3.62	134.76 ± 0.02	`51f5a45`	@Hedede
RTX 4070 Ti SUPER	16 GB / GDDR6X / 256 bit	7612.32 ± 37.35	132.85 ± 0.31	`9c35706`	@Ristovski
RTX 5070	12 GB / GDDR7 / 192 bit	5783.44 ± 36.95	128.21 ± 2.52		@Spyro000
A40	48 GB / GDDR6 / 384 bit	5256.38 ± 19.39	126.24 ± 0.06	`3470a5c`	@Hedede
A30	24 GB / HBM2e / 3072 bit	3068.72 ± 0.63	131.93 ± 0.18	`583cb83`	@Hedede
Titan V	12 GB / HBM2 / 3072 bit	2481.25 ± 1.31	112.17 ± 0.01	`e56abd2`	@Hedede
RTX 2080 Ti	11 GB / GDDR6 / 352 bit	3107.61 ± 4.34	109.17 ± 0.07	`9c35706`	@ariya
Quadro RTX 6000	24 GB / GDDR6 / 384 bit	3053.96 ± 1.37	104.38 ± 0.04	`b8e09f0`	@Hedede
Quadro RTX 8000	48 GB / GDDR6 / 384 bit	3052.35 ± 5.64	103.63 ± 0.02	`b8e09f0`	@Hedede
RTX A4500	20 GB / GDDR6 / 320 bit	3453.10 ± 49.19	103.00 ± 0.25	`5cdb27e`	@aleksyx
RTX 5060 Ti	16 GB / GDDR7 / 128 bit	4195.53 ± 1.98	93.46 ± 0.01	`89d1029`	@mike-llamacpp
RTX 2070 SUPER	8 GB / GDDR6 / 256 bit	2293.29 ± 5.91	87.71 ± 0.29	`bc07349`	@phstudy
RTX A4000	16 GB / GDDR6 / 256 bit	2807.83 ± 52.44	85.17 ± 0.66	`65349f2`	@TinyServal
RTX 3060	12 GB / GDDR6 / 192 bit	2407.67 ± 3.73	76.92 ± 0.03	`baa9255`	@QuantiusBenignus
Titan Xp	12 GB / GDDR5X / 384 bit	1218.12 ± 1.82	73.84 ± 0.04	`c4510dc`	@Hedede
Quadro RTX 4000	8 GB / GDDR6 / 256 bit	1662.80 ± 2.04	67.62 ± 0.67	`7d77f07`	@Hedede
RTX 4060 Ti	8 GB / GDDR6 / 128 bit	3803.45 ± 70.80	64.03 ± 0.53	`89d1029`	@mike-llamacpp
RTX A4000 Ada	20 GB / GDDR6 / 160 bit	3171.86 ± 4.34	61.37 ± 0.01	`a74a0d6`	@sdwolfz
Tesla P100	16 GB / HBM2 / 4096 bit	787.36 ± 3.27	61.99 ± 0.00	`b8372ee`	@Hedede
GTX 1080 Ti	11 GB / GDDR5X / 352 bit	1138.14 ± 2.02	61.38 ± 0.03	`9c35706`	@ariya
RTX 2060 SUPER	8 GB / GDDR6 / 256 bit	1563.77 ± 0.51	61.13 ± 0.05	`5c0eb5e`	@ggerganov
DGX Spark	128 GB / LPDDR5x	3661.37 ± 38.66	56.74 ± 0.03	`5acd455`	@ggerganov
Tesla P40	24 GB / GDDR5 / 384 bit	1079.66 ± 0.18	53.73 ± 0.05	`c76b420`	@m18coppola
RTX 2000 Ada	16 GB / GDDR6 / 128 bit	2250.14 ± 5.91	50.71 ± 0.01	`756cfea`	@DigitalRudeness
Tesla T4	16 GB / GDDR6 / 256 bit	1309.73 ± 1.02	44.03 ± 0.57	`d32e03f`	@pt13762104
GTX 1660	6 GB / GDDR5 / 192 bit	154.45 ± 0.52	41.43 ± 0.01	`9515c61`	@ariya
Tesla M40	24 GB / GDDR5 / 384 bit	290.17 ± 0.11	39.98 ± 0.01	`97d5117`	@Hedede
GTX 1070 Ti	8 GB / GDDR5 / 256 bit	790.52 ± 2.39	37.87 ± 0.00	`79c1160`	@pebaryan
Jetson AGX Orin	64 GB / LPDDR5 / 256 bit	1171.96 ± 4.70	35.88 ± 0.18	`c1b1876`	@TinyServal
Tesla P4	8 GB / GDDR5 / 256 bit	529.53 ± 2.12	33.12 ± 0.03	`c76b420`	@m18coppola
P106-100	6 GB / GDDR5 / 192 bit	438.49 ± 0.38	30.64 ± 0.06	`5fd160b`	@pebaryan
GTX 1060	6 GB / GDDR5 / 192 bit	446.19 ± 0.81	28.18 ± 0.01	`5fd160b`	@pebaryan
Quadro T1000	4 GB / GDDR5 / 128 bit	27.46 ± 0.23	27.46 ± 0.23	`f6da8cb`	@hanabu
Quadro P2000	5 GB / GDDR5 / 160 bit	311.55 ± 0.19	23.76 ± 0.01	`baa9255`	@TinyServal
Tesla K80	12 GB / GDDR5 / 384 bit	133.36 ± 0.60	14.27 ± 0.32	`32732f2`	@pebaryan
Quadro P1000	4 GB / GDDR5 / 128 bit	173.82 ± 0.02	13.65 ± 0.14	`1e74897`	@aleksyx

More detailed test

The main idea of this test is to show a decrease in performance with increasing size.

llama-bench -m llama-2-7b.Q4_0.gguf -ngl 99 -fa 0,1 -p 512,1024,2048,4096,8192,16384,32768 -n 128,256,512,1024,2048

m18coppola · 2025-08-01T16:11:12Z

m18coppola
Aug 1, 2025

Here's the results for my devices. Not sure how to get a "cuda info string" though.

CUDA Scoreboard for Llama 2 7B, Q4_0 (no FA)

Chip	pp512 t/s	tg128 t/s	Commit
Tesla P4	514.53 ± 3.06	33.29 ± 0.00	`c76b420`
Tesla P40	1007.42 ± 1.23	54.74 ± 0.07	`c76b420`
RTX 3090	5174.69 ± 21.83	158.16 ± 0.21	`c76b420`

CUDA Scoreboard for Llama 2 7B, Q4_0 (with FA)

Chip	pp512 t/s	tg128 t/s	Commit
Tesla P4	529.53 ± 2.12	33.12 ± 0.03	`c76b420`
Tesla P40	1079.66 ± 0.18	53.73 ± 0.05	`c76b420`
RTX 3090	5560.06 ± 16.28	161.89 ± 0.18	`c76b420`

0 replies

bennmann · 2025-08-01T19:48:15Z

bennmann
Aug 1, 2025

While technically not directly related, there may also be value in comparing AMD ROCM build here too, as ROCM acts a replacement (sometimes a directly compatible layer) for most CUDA calls.

I admit risk of confusion for Nvidia users in the thread if this path is taken.

1 reply

olegshulyakov Aug 1, 2025
Author

As I know you cannot run ROCm on Nvidia GPU. If you would like to see compared results check Vulkan thread. You can find there results for Vulkan/CUDA and Vulkan/ROCm.

UPD: Created ROCm discussion.

slaren · 2025-08-01T20:21:40Z

slaren
Aug 1, 2025
Maintainer

Device 0: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6, VMM: yes

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp512	6567.49 ± 20.30
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg128	171.19 ± 3.98
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp512	6924.01 ± 10.76
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg128	172.26 ± 1.31

build: 9c35706 (6060)

Device 0: NVIDIA GeForce RTX 3080, compute capability 8.6, VMM: yes

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp512	5013.86 ± 24.80
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg128	139.65 ± 0.99
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp512	5569.56 ± 14.04
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg128	139.95 ± 0.95

build: 9c35706 (6060)

0 replies

Ristovski · 2025-08-01T21:10:34Z

Ristovski
Aug 1, 2025

Device 0: NVIDIA GeForce RTX 4070 Ti SUPER, compute capability 8.9, VMM: yes

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp512	6924.53 ± 13.87
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg128	132.26 ± 0.16
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp512	7612.32 ± 37.35
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg128	132.85 ± 0.31

build: 9c35706 (647)

3 replies

Ristovski Aug 7, 2025

@olegshulyakov One more benchmark for RTX 4080:

Device 0: NVIDIA GeForce RTX 4080, compute capability 8.9, VMM: yes

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp512	8031.64 ± 26.49
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg128	142.49 ± 0.16
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp512	9205.93 ± 22.31
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg128	143.47 ± 0.02

build: 20638e4 (2)

olegshulyakov Aug 7, 2025
Author

@Ristovski why so slow? Have you undervolted it? It pretty as RTX 3080, I expected somewhere between RTX 3090 and 3080 Ti =(

Ristovski Aug 7, 2025

@Ristovski why so slow? Have you undervolted it? It pretty as RTX 3080, I expected somewhere between RTX 3090 and 3080 Ti =(

Hmm indeed, I didn't give much thought to the score at first. It should be stock but not completely sure as that is one of our work machines. I didn't have much time to investigate today, will check again tomorrow!

RodriMora · 2025-08-01T22:51:37Z

RodriMora
Aug 1, 2025

Device 0: 3090. Power limit to 250w

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp512	4175.47 ± 27.79
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg128	137.72 ± 0.46
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp512	4377.03 ± 89.10
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg128	138.34 ± 0.96

build: 9c35706 (6060)

Device 2: 5090. Power limit to 400w

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp512	12706.26 ± 13.30
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg128	236.73 ± 1.29
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp512	13823.36 ± 20.99
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg128	245.02 ± 1.08

build: 9c35706 (6060)

3 replies

olegshulyakov Aug 2, 2025
Author

Can you please launch them without a limit on full power?

RodriMora Aug 2, 2025

Sure, results with defaults power limits:

3090 at 390W
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp512	5405.83 ± 5.80
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg128	151.04 ± 0.24
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp512	5932.44 ± 10.87
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg128	155.36 ± 0.09

build: 9c35706 (6060)

5090 at 600W
Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp512	14751.98 ± 136.24
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg128	239.62 ± 0.37
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp512	16041.54 ± 85.27
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg128	248.57 ± 0.05

build: 9c35706 (6060)

cmp-nct Dec 20, 2025

crazy, the additional 200W on the 5090 were likely consumed but the performance change was irrelevantly small

ariya · 2025-08-02T04:34:57Z

ariya
Aug 2, 2025

Device 0: NVIDIA GeForce GTX 1080 Ti, compute capability 6.1, VMM: yes

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	100	pp512	1084.41 ± 3.01
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	100	tg128	62.49 ± 0.06

Device 0: NVIDIA GeForce GTX 1080 Ti, compute capability 6.1, VMM: yes

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	100	1	pp512	1138.14 ± 2.02
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	100	1	tg128	61.38 ± 0.03

build: 9c35706 (6060)

0 replies

ariya · 2025-08-02T17:01:55Z

ariya
Aug 2, 2025

@olegshulyakov To help users quickly understand the approximate largest models that can run on each GPU, I suggest adding a VRAM column next to the GPU name on the main scoreboard.

Example:

Chip	VRAM	pp512 t/s	tg128 t/s	Commit
RTX 3090 Ti	24 GB	6567.49 $\pm$ 20.30	171.19 $\pm$ 3.98	`9c35706`
RTX 3090	24 GB	5174.69 $\pm$ 21.83	158.16 $\pm$ 0.21	`c76b420`
RTX 3080	10 GB	5013.86 $\pm$ 24.80	139.65 $\pm$ 0.99	`9c35706`

1 reply

olegshulyakov Aug 2, 2025
Author

Made it a little bit better 🙂

ggerganov · 2025-08-02T19:17:24Z

ggerganov
Aug 2, 2025
Maintainer

Device 0: NVIDIA GeForce RTX 2060 SUPER, compute capability 7.5, VMM: yes

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp512	1420.24 ± 1.95
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg128	60.04 ± 0.01
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp512	1563.77 ± 0.51
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg128	61.13 ± 0.05

build: 5c0eb5e (6075)

1 reply

olegshulyakov Aug 2, 2025
Author

@ggerganov Can you please add "performance" label?

mike-llamacpp · 2025-08-02T20:45:23Z

mike-llamacpp
Aug 2, 2025

@olegshulyakov I see you grabbed some of my numbers from the Vulkan thread. However, I flooded that post with a bunch of data that probably came across as noise. While you quoted my correct numbers for Non-FA, the FA results you grabbed were actually when run on two GPUs instead of one. To make things easier, here are the numbers from a single card:

RTX 5060 Ti 16 GB

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 4 CUDA devices:
  Device 0: NVIDIA GeForce RTX 5060 Ti, compute capability 12.0, VMM: yes
  Device 1: NVIDIA GeForce RTX 5060 Ti, compute capability 12.0, VMM: yes
  Device 2: NVIDIA GeForce RTX 5060 Ti, compute capability 12.0, VMM: yes
  Device 3: NVIDIA GeForce RTX 5060 Ti, compute capability 12.0, VMM: yes

model	size	params	backend	ngl	sm	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	none	0	pp512	3737.25 ± 6.79
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	none	0	tg128	90.94 ± 0.02
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	none	1	pp512	4195.53 ± 1.98
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	none	1	tg128	93.46 ± 0.01

build: 89d10295 (6002)

And here's another GPU for the collection:

RTX 4060 Ti 8 GB

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4060 Ti, compute capability 8.9, VMM: yes

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp512	3394.63 ± 7.44
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg128	63.86 ± 0.01
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp512	3803.45 ± 70.80
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg128	64.03 ± 0.53

build: 89d10295 (6002)

2 replies

rohan-sircar Aug 5, 2025

Nice 64GB VRAM setup you got there!

And here's another GPU for the collection:

We all be here showing off our GPU collections 😅

mike-llamacpp Aug 5, 2025

Thanks. It isn't the fastest setup around, especially when working with 70B+ models, but it is completely usable for inference. There are also some benefits I like about these particular cards (Gigabyte Windforce):

Two slots thick and only ~200 mm in length makes them easy to fit in a wide variety of cases
Physical x8 PCI-e connector lets them fit in either x8 or x16 slots without modification (5060 TIs only use 8 lanes anyhow)
Quiet (Silent when idle)
Low idle power consumption (~5 watts per card)
Relatively low power draw under full load (<180W each), so easy to power all four with an inexpensive PSU

ariya · 2025-08-04T06:20:35Z

ariya
Aug 4, 2025

Device 0: NVIDIA GeForce RTX 2080 Ti, compute capability 7.5, VMM: yes

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	100	0	pp512	2890.66 ± 2.42
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	100	0	tg128	107.51 ± 0.21
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	100	1	pp512	3107.61 ± 4.34
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	100	1	tg128	109.17 ± 0.07

build: 9c35706 (6060)

0 replies

lhl · 2025-08-06T08:26:05Z

lhl
Aug 6, 2025

Yeah also saw numbers for my 4090 taken from the Vulkan thread. Re-ran CUDA results so you can get the latest FA and non-FA results from same build:

FA:

❯ CUDA_VISIBLE_DEVICES=0 build/bin/llama-bench -m /models/llm/gguf/llama-2-7b.Q4_0.gguf -fa 1
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp512	14770.63 ± 102.93
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg128	188.96 ± 0.05

Non-FA:

❯ CUDA_VISIBLE_DEVICES=0 build/bin/llama-bench -m /models/llm/gguf/llama-2-7b.Q4_0.gguf
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	pp512	11992.70 ± 107.99
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	tg128	186.21 ± 0.13


build: 224145325 (6098)

nvidia-dkms 575.64.03-1

❯ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2025 NVIDIA Corporation
Built on Tue_May_27_02:21:03_PDT_2025
Cuda compilation tools, release 12.9, V12.9.86
Build cuda_12.9.r12.9/compiler.36037853_0

0 replies

pebaryan · 2025-08-07T10:10:35Z

pebaryan
Aug 7, 2025

NVIDIA P106-100
6GB VRAM
Win 11
Driver Version: 566.36 CUDA Version: 12.7

I ran two times, took the best on 2 different build

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA P106-100, compute capability 6.1, VMM: no

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA,RPC	99	0	pp512	406.94 ± 0.25
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA,RPC	99	0	tg128	30.40 ± 0.02
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA,RPC	99	1	pp512	438.49 ± 0.38
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA,RPC	99	1	tg128	30.64 ± 0.06

build: 5fd160b (6106)

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA P106-100, compute capability 6.1, VMM: no

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA,RPC	99	0	pp512	425.73 ± 0.82
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA,RPC	99	0	tg128	29.42 ± 0.03
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA,RPC	99	1	pp512	436.90 ± 0.88
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA,RPC	99	1	tg128	29.94 ± 0.03

build: 860a9e4 (5688)

Sadly, nvidia was not supporting this device for the vulkan driver

2 replies

pebaryan Aug 7, 2025

I just bricked my gtx 1070 Ti :( so i would not be able to reproduce the result with newer build

olegshulyakov Aug 7, 2025
Author

@pebaryan I've taken the last build one.

DigitalRudeness · 2025-08-07T10:38:52Z

DigitalRudeness
Aug 7, 2025

Would like to participate with a slightly exotic one from my cute server cube.. :-) (RTX 2000 Ada, 16GB, 75W)

I did two runs:

pull/compilation of llama.cpp from yesterday:

gml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA RTX 2000 Ada Generation, compute capability 8.9, VMM: yes

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp512	1956.22 ± 7.74
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg128	50.62 ± 0.04
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp512	2250.14 ± 5.91
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg128	50.71 ± 0.01

build: 756cfea (6105)

fresh pull/compilation of llama.cpp ~5min ago:

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA RTX 2000 Ada Generation, compute capability 8.9, VMM: yes

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp512	1952.82 ± 7.35
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg128	50.59 ± 0.04
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp512	2237.16 ± 6.18
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg128	50.67 ± 0.01

build: 1d72c84 (6109)

Seems to make no big difference... ^^

0 replies

pebaryan · 2025-08-11T09:42:50Z

pebaryan
Aug 11, 2025

I finally got my hands on similar card as before (NP106) but with display output

NVIDIA GTX 1060
6GB GDDR5 192-bit
Driver 566.36

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce GTX 1060 6GB, compute capability 6.1, VMM: yes

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA,RPC	99	0	pp512	416.85 ± 1.75
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA,RPC	99	0	tg128	27.79 ± 0.02
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA,RPC	99	1	pp512	446.19 ± 0.81
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA,RPC	99	1	tg128	28.18 ± 0.01

build: 5fd160b (6106)

1 reply

pebaryan Aug 11, 2025

just realized i didn't use the latest build, not that difference though

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce GTX 1060 6GB, compute capability 6.1, VMM: yes

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA,RPC	99	0	pp512	413.59 ± 2.04
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA,RPC	99	0	tg128	27.74 ± 0.06
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA,RPC	99	1	pp512	443.66 ± 0.25
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA,RPC	99	1	tg128	28.08 ± 0.04

build: 79c1160 (6123)

u-aniskavets · 2025-12-22T21:01:15Z

u-aniskavets
Dec 22, 2025

3080 Laptop 8Gb 150W

Chip	Memory	pp512 t/s	tg128 t/s
RTX 3080 Laptop	8 GB / GDDR6 / 256 bit	2970.97 ± 14.10	90.78 ± 0.08

Context / how it was run:
OS: Windows 11 Home (10.0.26200 build 26200), WDDM
Driver: NVIDIA 591.59

Binaries: prebuilt llama.cpp release b7509 (Windows x64 CUDA 12.4).
I extracted BOTH archives into the same folder:

llama-b7509-bin-win-cuda-12.4-x64.zip

cudart-llama-bin-win-cuda-12.4-x64.zip

Model: llama-2-7b.Q4_0.gguf

GPU: NVIDIA GeForce RTX 3080 Laptop GPU (8GB, 150W), WDDM

No special environment variables were set (the bench output shows GGML_CUDA_FORCE_MMQ/CUBLAS = "no").

NGL=99, tested both -fa 0 and -fa 1.

Power / utilization logging (separate terminal while llama-bench was running):
nvidia-smi --query-gpu=timestamp,pstate,utilization.gpu,memory.used,memory.total,power.draw --format=csv -l 1

Observed power.draw:

Short runs: ~149W (P0, near 97–100% GPU utilization)

Longer sweep runs: power.draw often dropped to ~125W (still high utilization); likely power/thermal behavior on a laptop.

Important note about pp8192:

In the long sweep (fa=0) pp8192 shows a sharp drop (102.22 ± 39.55 t/s).
At that point VRAM is essentially full (~7.8–8.0GB), so this looks like VRAM oversubscription / paging/spill.
I stopped the sweep after pp8192 because anything beyond that would be unreliable on 8GB VRAM.

llama-bench -m llama-2-7b.Q4_0.gguf -ngl 99 -fa 0,1 -p 512,1024,2048,4096,8192,16384,32768 -n 128,256,512,1024,2048

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3080 Laptop GPU, compute capability 8.6, VMM: yes
load_backend: loaded CUDA backend from C:\llama-b7509-bin-win-cuda-12.4-x64\ggml-cuda.dll
load_backend: loaded RPC backend from C:\llama-b7509-bin-win-cuda-12.4-x64\ggml-rpc.dll
load_backend: loaded CPU backend from C:\llama-b7509-bin-win-cuda-12.4-x64\ggml-cpu-alderlake.dll

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	pp512	2951.47 ± 28.00
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	pp1024	2747.36 ± 14.05
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	pp2048	2527.92 ± 4.43
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	pp4096	2195.11 ± 4.56
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	pp8192	102.22 ± 39.55

^C — stopped after pp8192 (VRAM nearly full / spill suspected), did not run pp16384/pp32768.

llama-bench.exe -m .\llama-2-7b.Q4_0.gguf -ngl 99 -fa 0,1 -p 512,1024,2048,4096 -n 128,256,512,1024,2048

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3080 Laptop GPU, compute capability 8.6, VMM: yes
load_backend: loaded CUDA backend from C:\llama-b7509-bin-win-cuda-12.4-x64\ggml-cuda.dll
load_backend: loaded RPC backend from C:\llama-b7509-bin-win-cuda-12.4-x64\ggml-rpc.dll
load_backend: loaded CPU backend from C:\llama-b7509-bin-win-cuda-12.4-x64\ggml-cpu-alderlake.dll

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp512	2970.97 ± 14.10
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp1024	2776.18 ± 10.49
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp2048	2553.65 ± 5.01
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp4096	2213.43 ± 4.22
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg128	90.78 ± 0.08
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg256	90.62 ± 0.15
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg512	87.45 ± 0.16
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg1024	85.32 ± 0.20
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg2048	81.04 ± 0.16
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp512	3211.59 ± 49.51
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp1024	3219.32 ± 45.43
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp2048	3076.62 ± 10.88
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp4096	2905.61 ± 3.25
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg128	92.45 ± 0.32
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg256	92.72 ± 0.22
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg512	90.96 ± 0.11
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg1024	88.24 ± 0.06
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg2048	83.41 ± 0.28

build: e3b35dd (7509)

4 replies

elfarolab Dec 22, 2025

Hello,
out of curiosity, what configure parameters did you use? Could you paste here the complete command line?
Thank you so much!

u-aniskavets Dec 22, 2025

Hi! Sure
I used the prebuilt Windows CUDA release (b7509), no custom build.

Command line:
.\llama-bench.exe -m .\llama-2-7b.Q4_0.gguf -ngl 99 -fa 0,1

No extra env vars were set (GGML_CUDA_FORCE_MMQ/CUBLAS are “no” by default).

Power logging in parallel:
nvidia-smi --query-gpu=timestamp,pstate,power.draw,utilization.gpu --format=csv -l 1
or
nvidia-smi --query-gpu=timestamp,pstate,utilization.gpu,memory.used,memory.total,power.draw --format=csv -l 1

elfarolab Dec 22, 2025

These days, some people is saying they had regression in performances. Myself, I lost 15%. Your numbers look lower than results from others. Do you also feel you had penalization in performances? Thank you so much.

u-aniskavets Dec 22, 2025

First run, so no baseline to confirm a regression. pp512 varies ~3% (2896–2982 t/s), likely due to laptop power/thermals. I’ll re-test occasionally and report if I see a sustained >15% change

DoctorFranky · 2025-12-29T23:29:00Z

DoctorFranky
Dec 29, 2025

Hardware A (Windows):
GPU: NVIDIA GeForce RTX 3080 Ti (cc 8.6)
Command: llama-bench -m llama-2-7b.Q4_0.gguf -ngl 99 -fa 0,1
Results:
FA=0 pp512 5069.16 ± 40.04
FA=0 tg128 161.26 ± 0.26
FA=1 pp512 5928.37 ± 69.81
FA=1 tg128 169.94 ± 0.71
Build: b7574 (5b1248c)

Hardware B (Unraid / Docker full-cuda):
GPU: Quadro P5000 (cc 6.1)
Command (docker): ghcr.io/ggml-org/llama.cpp:full-cuda --bench --model llama-2-7b.Q4_0.gguf --n-gpu-layers 99 --flash-attn 0
Results:
FA=0 pp512 805.36 ± 0.17
FA=0 tg128 48.61 ± 0.02
Build: b7565 (382808c)

0 replies

Crophes · 2026-01-18T19:04:05Z

Crophes
Jan 18, 2026

Jetson AGX Orin 64GB

/llama.cpp/build/bin/llama-bench -m /models/llama-2-7b.Q4_0.gguf -fa 0,1
ggml_cuda_init: found 1 CUDA devices:
Device 0: Orin, compute capability 8.7, VMM: yes

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp512	945.39 ± 0.59
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg128	34.21 ± 0.00
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp512	1090.49 ± 0.46
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg128	36.61 ± 0.01

build: 287a330 (7772)

I used the same commit as @TinyServal (which is from Nov, 2025) as a comparison:

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: Orin, compute capability 8.7, VMM: yes

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp512	1003.58 ± 0.60
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg128	33.15 ± 0.00
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp512	1182.68 ± 0.21
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg128	35.34 ± 0.00

build: c1b1876 (6987)

Comparison

Prompt processing (pp512) is ~6–9% lower.
Token generation (tg128) is ~3–4% higher.

0 replies

fairydreaming · 2026-01-22T09:55:12Z

fairydreaming
Jan 22, 2026
Collaborator

NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition

$ ./bin/llama-bench -m /mnt/md0/models/llama-2-7b.Q4_0.gguf -ngl 99 -fa 0,1
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition, compute capability 12.0, VMM: yes

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp512	11443.35 ± 163.44
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg128	270.00 ± 0.85
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp512	12231.90 ± 260.54
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg128	269.96 ± 0.06

build: ad8d85bd9 (7789)

Generation is almost on par with regular RTX PRO 6000 reported in the table, but pp is much lower. Card was power-limited to 300W (default setting).

0 replies

stuszynski · 2026-01-22T14:40:13Z

stuszynski
Jan 22, 2026

$ ./llama-bench -m ~/models/llama-2-7b.Q4_0.gguf -ngl 99 -fa 0,1
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA RTX 4500 Ada Generation, compute capability 8.9, VMM: yes

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp512	5512.71 ± 37.58
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg128	92.94 ± 0.08
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp512	6490.28 ± 33.89
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg128	93.88 ± 0.06

0 replies

kresle · 2026-02-02T10:53:57Z

kresle
Feb 2, 2026

AMD strix halo (AI Max+ 395)

➜  llama-bench -m llama-2-7b.Q4_0.gguf -ngl 99 -fa 0,1 -p 512 -n 128 -r 5
ggml_cuda_init: found 1 ROCm devices:
  Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | ROCm       |  99 |  0 |           pp512 |        523.30 ± 0.25 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | ROCm       |  99 |  0 |           tg128 |         46.73 ± 0.02 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | ROCm       |  99 |  1 |           pp512 |        554.48 ± 0.19 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | ROCm       |  99 |  1 |           tg128 |         50.95 ± 0.03 |

build: 41ea26144 (7899)

4 replies

pt13762104 Feb 3, 2026

There's another thread: #15021, and why is your perfomance so low?

kresle Feb 4, 2026

There's another thread: #15021, and why is your perfomance so low?

My performance is similar to other tests for AMD 395 (i.e. strix halo) in #15021. This device has huge memory but with limited gpu power.

kresle Feb 4, 2026

My performance is similar to other tests for AMD 395 (i.e. strix halo) in #15021. This device has huge memory but with limited gpu power.

Oh... My pp512 performance is truly much slower than others. But I don't think this is an important metric anyway. Maybe that's because I only used performance mode instead of beast mode.

pt13762104 Feb 4, 2026

Yeah, pp512 is limited by compute mostly, so it might be the GPU downclocking.

BorisDNK · 2026-02-11T10:41:10Z

BorisDNK
Feb 11, 2026

Laptop ASUS FA507NV
AMD Ryzen 7 7735HS
64GB DDR5-4800
NVIDIA GeForce RTX 4060 Laptop (AD107M/GN21-X4) [ASUS] 8GB GDDR6 SDRAM 128-bit
Windows 11
ghelper mode: turbo, temp max: 93C
gpu-mode: Ultimate(only dGPU)

c:\ai\llama_cpp>"./bin-win-cuda-13.1-x64/llama-bench.exe" -m "./models/llama-2-7b.Q4_0.gguf" -ngl 99 -fa 0,1
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 4060 Laptop GPU, compute capability 8.9, VMM: yes
load_backend: loaded CUDA backend from c:\ai\llama_cpp\bin-win-cuda-13.1-x64\ggml-cuda.dll
load_backend: loaded RPC backend from c:\ai\llama_cpp\bin-win-cuda-13.1-x64\ggml-rpc.dll
load_backend: loaded CPU backend from c:\ai\llama_cpp\bin-win-cuda-13.1-x64\ggml-cpu-haswell.dll

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp512	2466.31 + 4.85
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg128	58.22 + 0.01
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp512	2928.74 + 4.50
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg128	59.15 + 0.43

build: 2cce9fd (7993)

c:\ai\llama_cpp>"./bin-win-cuda-13.1-x64/llama-bench.exe" -m "./models/llama-2-7b.Q4_0.gguf" -ngl 99 -fa 0,1 -p 512,1024,2048,4096,8192
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 4060 Laptop GPU, compute capability 8.9, VMM: yes
load_backend: loaded CUDA backend from c:\ai\llama_cpp\bin-win-cuda-13.1-x64\ggml-cuda.dll
load_backend: loaded RPC backend from c:\ai\llama_cpp\bin-win-cuda-13.1-x64\ggml-rpc.dll
load_backend: loaded CPU backend from c:\ai\llama_cpp\bin-win-cuda-13.1-x64\ggml-cpu-haswell.dll

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	pp512	2463.26 + 1.21
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	pp1024	2265.21 + 0.56
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	pp2048	2039.07 + 3.43
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	pp4096	1694.17 + 22.19
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	pp8192	170.93 + 0.15

build: 2cce9fd (7993)

c:\ai\llama_cpp>"./bin-win-cuda-13.1-x64/llama-bench.exe" -m "./models/llama-2-7b.Q4_0.gguf" -ngl 99 -fa 0,1 -n 128,256,512,1024,2048
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 4060 Laptop GPU, compute capability 8.9, VMM: yes
load_backend: loaded CUDA backend from c:\ai\llama_cpp\bin-win-cuda-13.1-x64\ggml-cuda.dll
load_backend: loaded RPC backend from c:\ai\llama_cpp\bin-win-cuda-13.1-x64\ggml-rpc.dll
load_backend: loaded CPU backend from c:\ai\llama_cpp\bin-win-cuda-13.1-x64\ggml-cpu-haswell.dll

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp512	2464.58 + 3.29
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg128	57.87 + 0.46
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg256	57.16 + 0.10
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg512	56.47 + 0.45
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg1024	54.72 + 0.61
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg2048	52.28 + 0.02
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp512	2676.60 + 112.08
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg128	59.55 + 0.02
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg256	59.55 + 0.01
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg512	58.47 + 0.03
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg1024	55.96 + 0.16
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg2048	52.79 + 0.10

build: 2cce9fd (7993)

0 replies

pontostroy · 2026-02-16T19:20:09Z

pontostroy
Feb 16, 2026

DGX spark performance was increased
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GB10, compute capability 12.1, VMM: yes
TU: error: ../src/freedreno/vulkan/tu_knl.cc:387: failed to open device /dev/dri/renderD128 (VK_ERROR_INCOMPATIBLE_DRIVER)
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GB10 (NVIDIA) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 49152 | int dot: 0 | matrix cores: KHR_coopmat

model	size	params	backend	ngl	fa	dev	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA,Vulkan	99	0	CUDA0	pp512	3241.51 ± 28.66
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA,Vulkan	99	0	CUDA0	tg128	58.13 ± 0.01
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA,Vulkan	99	1	CUDA0	pp512	3861.19 ± 36.01
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA,Vulkan	99	1	CUDA0	tg128	58.14 ± 0.05

build: d612901 (8076)

0 replies

tryiou · 2026-02-17T09:51:49Z

tryiou
Feb 17, 2026

GTX 1070 | 8 GB / GDDR5 / 256 bit

llama-bench -m llama-2-7b.Q4_0.gguf -ngl 99 -fa 0,1 -p 512,1024,2048,4096 -n 128,256,512,1024,2048
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce GTX 1070, compute capability 6.1, VMM: yes

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp512	631.29 ± 1.87
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp1024	605.29 ± 1.52
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp2048	562.22 ± 1.07
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp4096	496.93 ± 0.49
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg128	41.80 ± 0.02
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg256	41.79 ± 0.02
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg512	40.05 ± 0.00
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg1024	37.60 ± 0.36
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg2048	33.98 ± 0.15
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp512	640.68 ± 0.62
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp1024	620.34 ± 0.55
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp2048	584.88 ± 0.31
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp4096	525.95 ± 0.08
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg128	43.68 ± 0.01
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg256	43.70 ± 0.01
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg512	43.15 ± 0.00
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg1024	41.88 ± 0.20
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg2048	39.51 ± 0.19

build: e48349a (8080)

0 replies

phu54321 · 2026-02-19T14:07:35Z

phu54321
Feb 19, 2026

(base) C:\bin\llama\llama-b8099-bin-win-cuda-13.1-x64> llama-bench -m D:\lmstudio_models\llama-2-7b.Q4_0.gguf -ngl 99 -fa 0,1
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3060 Laptop GPU, compute capability 8.6, VMM: yes
load_backend: loaded CUDA backend from C:\bin\llama\llama-b8099-bin-win-cuda-13.1-x64\ggml-cuda.dll
load_backend: loaded RPC backend from C:\bin\llama\llama-b8099-bin-win-cuda-13.1-x64\ggml-rpc.dll
load_backend: loaded CPU backend from C:\bin\llama\llama-b8099-bin-win-cuda-13.1-x64\ggml-cpu-haswell.dll

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp512	2070.83 ± 10.45
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg128	70.52 ± 0.19
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp512	2346.54 ± 17.52
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg128	73.74 ± 0.19

build: 3bb2fcc (8099)

0 replies

Hedede · 2026-02-28T11:12:20Z

Hedede
Feb 28, 2026

RTX PRO 5000 Blackwell (48GB / GDDR7 / 384 bit)

Driver Version: 580.105.08
CUDA Version: 13.0

ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA RTX PRO 5000 Blackwell, compute capability 12.0, VMM: yes

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp512	9952.91 ± 548.93
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp1024	9182.67 ± 80.42
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp2048	8199.69 ± 52.66
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp4096	6855.22 ± 12.20
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp8192	5151.20 ± 10.88
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp16384	3450.67 ± 9.95
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp32768	1978.52 ± 7.40
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg128	213.54 ± 0.26
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg256	213.07 ± 0.13
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg512	206.08 ± 0.27
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg1024	204.25 ± 0.09
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg2048	196.40 ± 0.10
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp512	11046.55 ± 298.03
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp1024	10059.62 ± 26.27
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp2048	9754.84 ± 17.54
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp4096	9179.72 ± 5.18
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp8192	7939.43 ± 3.62
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp16384	6242.84 ± 1.12
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp32768	4405.04 ± 1.25
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg128	226.31 ± 0.28
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg256	225.05 ± 0.09
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg512	221.08 ± 0.24
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg1024	215.66 ± 0.13
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg2048	205.76 ± 0.07

build: d979f2b (8180)

0 replies

Hedede · 2026-02-28T13:12:01Z

Hedede
Feb 28, 2026

Titan RTX (24 GB / GDDR6 / 384 bit)

ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA TITAN RTX, compute capability 7.5, VMM: yes

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp512	3120.80 ± 18.65
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg128	122.60 ± 0.10
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp512	3436.57 ± 16.05
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg128	129.23 ± 0.10

build: d979f2b (8180)

0 replies

chavdarb · 2026-03-06T07:59:56Z

chavdarb
Mar 6, 2026

Dual 5060 Ti 16GB, on PCI Express 4.0x8, Ryzen 9 5900x

ggml_cuda_init: found 2 CUDA devices:
Device 0: NVIDIA GeForce RTX 5060 Ti, compute capability 12.0, VMM: yes
Device 1: NVIDIA GeForce RTX 5060 Ti, compute capability 12.0, VMM: yes

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp512	3912.30 ± 28.13
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg128	93.90 ± 0.05
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp512	4613.86 ± 19.07
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg128	97.32 ± 0.03

build: f7db3f3 (8214)

0 replies

sdwolfz · 2026-03-30T05:53:00Z

sdwolfz
Mar 30, 2026

DGX Spark

Compile log

CFLAGS="-O3 -mcpu=native -mtune=native -fomit-frame-pointer -pipe"   \
CXXFLAGS="-O3 -mcpu=native -mtune=native -fomit-frame-pointer -pipe" \
cmake -S . -B build               \
  -DGGML_CUDA=ON                  \
  -DGGML_CUDA_F16=ON              \
  -DGGML_CUDA_FORCE_MMQ=ON        \
  -DGGML_NATIVE=ON                \
  -DGGML_LTO=ON                   \
  -DGGML_OPENMP=ON                \
  -DCMAKE_BUILD_TYPE=Release      \
  -DCMAKE_CUDA_ARCHITECTURES=121a

-- The C compiler identification is GNU 13.3.0
-- The CXX compiler identification is GNU 13.3.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
CMAKE_BUILD_TYPE=Release
-- Found Git: /usr/bin/git (found version "2.43.0") 
-- The ASM compiler identification is GNU
-- Found assembler: /usr/bin/cc
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success
-- Found Threads: TRUE  
-- Warning: ccache not found - consider installing it for faster compilation or disable this warning with GGML_CCACHE=OFF
-- CMAKE_SYSTEM_PROCESSOR: aarch64
-- GGML_SYSTEM_ARCH: ARM
-- Including CPU backend
-- Found OpenMP_C: -fopenmp (found version "4.5") 
-- Found OpenMP_CXX: -fopenmp (found version "4.5") 
-- Found OpenMP: TRUE (found version "4.5")  
-- ARM detected
-- Performing Test GGML_COMPILER_SUPPORTS_FP16_FORMAT_I3E
-- Performing Test GGML_COMPILER_SUPPORTS_FP16_FORMAT_I3E - Failed
CMake Warning at ggml/src/ggml-cpu/CMakeLists.txt:146 (message):
  ARM -march/-mcpu not found, -mcpu=native will be used
Call Stack (most recent call first):
  ggml/src/CMakeLists.txt:445 (ggml_add_cpu_backend_variant_impl)


-- Performing Test GGML_MACHINE_SUPPORTS_dotprod
-- Performing Test GGML_MACHINE_SUPPORTS_dotprod - Failed
-- Performing Test GGML_MACHINE_SUPPORTS_nodotprod
-- Performing Test GGML_MACHINE_SUPPORTS_nodotprod - Failed
-- Performing Test GGML_MACHINE_SUPPORTS_i8mm
-- Performing Test GGML_MACHINE_SUPPORTS_i8mm - Failed
-- Performing Test GGML_MACHINE_SUPPORTS_noi8mm
-- Performing Test GGML_MACHINE_SUPPORTS_noi8mm - Failed
-- Performing Test GGML_MACHINE_SUPPORTS_sve
-- Performing Test GGML_MACHINE_SUPPORTS_sve - Failed
-- Performing Test GGML_MACHINE_SUPPORTS_nosve
-- Performing Test GGML_MACHINE_SUPPORTS_nosve - Failed
-- Performing Test GGML_MACHINE_SUPPORTS_sme
-- Performing Test GGML_MACHINE_SUPPORTS_sme - Failed
-- Performing Test GGML_MACHINE_SUPPORTS_nosme
-- Performing Test GGML_MACHINE_SUPPORTS_nosme - Failed
-- Checking for ARM features using flags:
--   -mcpu=native
-- Performing Test HAVE_DOTPROD
-- Performing Test HAVE_DOTPROD - Failed
-- Performing Test HAVE_SVE
-- Performing Test HAVE_SVE - Failed
-- Performing Test HAVE_MATMUL_INT8
-- Performing Test HAVE_MATMUL_INT8 - Failed
-- Performing Test HAVE_FMA
-- Performing Test HAVE_FMA - Success
-- Performing Test HAVE_FP16_VECTOR_ARITHMETIC
-- Performing Test HAVE_FP16_VECTOR_ARITHMETIC - Failed
-- Performing Test HAVE_SME
-- Performing Test HAVE_SME - Failed
-- Adding CPU backend variant ggml-cpu: -mcpu=native 
-- Found CUDAToolkit: /usr/local/cuda/targets/sbsa-linux/include (found version "13.0.88") 
-- CUDA Toolkit found
-- The CUDA compiler identification is NVIDIA 13.0.88
-- Detecting CUDA compiler ABI info
-- Detecting CUDA compiler ABI info - done
-- Check for working CUDA compiler: /usr/local/cuda/bin/nvcc - skipped
-- Detecting CUDA compile features
-- Detecting CUDA compile features - done
-- Replacing 121-real in CMAKE_CUDA_ARCHITECTURES_NATIVE with 121a-real
-- Using CMAKE_CUDA_ARCHITECTURES=121a CMAKE_CUDA_ARCHITECTURES_NATIVE=121a-real
-- CUDA host compiler is GNU 13.3.0
-- Including CUDA backend
-- ggml version: 0.9.8
-- ggml commit:  7c203670f
-- Found OpenSSL: /usr/lib/aarch64-linux-gnu/libcrypto.so (found version "3.0.13")  
-- Performing Test OPENSSL_VERSION_SUPPORTED
-- Performing Test OPENSSL_VERSION_SUPPORTED - Success
-- OpenSSL found: 3.0.13
-- Generating embedded license file for target: common
-- Configuring done (3.7s)
-- Generating done (0.1s)
-- Build files have been written to: /home/sparky/Workspace/ai-spark/src/llama-cpp/build
[  1%] Building C object ggml/src/CMakeFiles/ggml-base.dir/ggml.c.o
[  1%] Building CXX object vendor/cpp-httplib/CMakeFiles/cpp-httplib.dir/httplib.cpp.o
[  1%] Building CXX object ggml/src/CMakeFiles/ggml-base.dir/ggml.cpp.o
[  1%] Building C object examples/gguf-hash/CMakeFiles/sha256.dir/deps/sha256/sha256.c.o
[  1%] Building C object ggml/src/CMakeFiles/ggml-base.dir/ggml-alloc.c.o
[  1%] Building CXX object ggml/src/CMakeFiles/ggml-base.dir/ggml-opt.cpp.o
[  2%] Building C object ggml/src/CMakeFiles/ggml-base.dir/ggml-quants.c.o
[  3%] Building C object examples/gguf-hash/CMakeFiles/xxhash.dir/deps/xxhash/xxhash.c.o
[  3%] Building CXX object ggml/src/CMakeFiles/ggml-base.dir/ggml-backend.cpp.o
[  3%] Building CXX object ggml/src/CMakeFiles/ggml-base.dir/ggml-threading.cpp.o
[  3%] Building CXX object ggml/src/CMakeFiles/ggml-base.dir/gguf.cpp.o
[  3%] Building CXX object tools/mtmd/CMakeFiles/llama-gemma3-cli.dir/deprecation-warning.cpp.o
[  4%] Building CXX object tools/mtmd/CMakeFiles/llama-minicpmv-cli.dir/deprecation-warning.cpp.o
[  4%] Building CXX object common/CMakeFiles/build_info.dir/build-info.cpp.o
[  4%] Building C object examples/gguf-hash/CMakeFiles/sha1.dir/deps/sha1/sha1.c.o
[  4%] Building CXX object tools/mtmd/CMakeFiles/llama-qwen2vl-cli.dir/deprecation-warning.cpp.o
[  5%] Building CXX object tools/mtmd/CMakeFiles/llama-llava-cli.dir/deprecation-warning.cpp.o
[  5%] Built target build_info
[  5%] Built target sha1
[  5%] Linking CXX executable ../../bin/llama-gemma3-cli
[  6%] Linking CXX executable ../../bin/llama-qwen2vl-cli
[  6%] Linking CXX executable ../../bin/llama-llava-cli
[  6%] Built target sha256
[  6%] Linking CXX executable ../../bin/llama-minicpmv-cli
[  6%] Built target llama-gemma3-cli
[  6%] Built target llama-qwen2vl-cli
[  6%] Built target llama-llava-cli
[  6%] Built target llama-minicpmv-cli
[  6%] Linking CXX shared library ../../bin/libggml-base.so
[  6%] Built target xxhash
[  6%] Built target ggml-base
[  6%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/acc.cu.o
[  6%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/arange.cu.o
[  6%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/argmax.cu.o
[  6%] Building CXX object ggml/src/CMakeFiles/ggml-cpu.dir/ggml-cpu/ggml-cpu.cpp.o
[  6%] Building C object ggml/src/CMakeFiles/ggml-cpu.dir/ggml-cpu/quants.c.o
[  6%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/binbcast.cu.o
[  6%] Building CXX object ggml/src/CMakeFiles/ggml-cpu.dir/ggml-cpu/traits.cpp.o
[  6%] Building CXX object ggml/src/CMakeFiles/ggml-cpu.dir/ggml-cpu/amx/amx.cpp.o
[  6%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/add-id.cu.o
[  6%] Building C object ggml/src/CMakeFiles/ggml-cpu.dir/ggml-cpu/ggml-cpu.c.o
[  6%] Building CXX object ggml/src/CMakeFiles/ggml-cpu.dir/ggml-cpu/repack.cpp.o
[  7%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/argsort.cu.o
[  7%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/clamp.cu.o
[  7%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/concat.cu.o
[  7%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/conv-transpose-1d.cu.o
[  8%] Building CXX object ggml/src/CMakeFiles/ggml-cpu.dir/ggml-cpu/hbm.cpp.o
[  9%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/conv2d-dw.cu.o
[  9%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/conv2d-transpose.cu.o
[  9%] Building CXX object ggml/src/CMakeFiles/ggml-cpu.dir/ggml-cpu/amx/mmq.cpp.o
[ 10%] Building CXX object ggml/src/CMakeFiles/ggml-cpu.dir/ggml-cpu/binary-ops.cpp.o
[ 10%] Building CXX object ggml/src/CMakeFiles/ggml-cpu.dir/ggml-cpu/unary-ops.cpp.o
[ 10%] Building CXX object ggml/src/CMakeFiles/ggml-cpu.dir/ggml-cpu/vec.cpp.o
[ 10%] Building CXX object ggml/src/CMakeFiles/ggml-cpu.dir/ggml-cpu/ops.cpp.o
[ 10%] Building CXX object ggml/src/CMakeFiles/ggml-cpu.dir/ggml-cpu/llamafile/sgemm.cpp.o
[ 10%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/conv2d.cu.o
[ 10%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/convert.cu.o
[ 10%] Building C object ggml/src/CMakeFiles/ggml-cpu.dir/ggml-cpu/arch/arm/quants.c.o
[ 11%] Building CXX object ggml/src/CMakeFiles/ggml-cpu.dir/ggml-cpu/arch/arm/repack.cpp.o
[ 11%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/count-equal.cu.o
[ 11%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/cpy.cu.o
[ 11%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/cross-entropy-loss.cu.o
[ 12%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/cumsum.cu.o
[ 12%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/diag.cu.o
[ 12%] Linking CXX shared library ../../bin/libggml-cpu.so
[ 12%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/diagmask.cu.o
[ 12%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/fattn-wmma-f16.cu.o
[ 12%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/fattn-tile.cu.o
[ 12%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/fattn.cu.o
[ 13%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/fill.cu.o
[ 13%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/gated_delta_net.cu.o
[ 13%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/getrows.cu.o
[ 13%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/ggml-cuda.cu.o
[ 13%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/gla.cu.o
[ 13%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/im2col.cu.o
[ 14%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/mean.cu.o
[ 14%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/mmf.cu.o
[ 14%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/mmid.cu.o
[ 14%] Built target ggml-cpu
[ 14%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/mmq.cu.o
[ 14%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/mmvf.cu.o
[ 14%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/mmvq.cu.o
[ 15%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/norm.cu.o
[ 15%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/opt-step-adamw.cu.o
[ 15%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/opt-step-sgd.cu.o
[ 15%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/out-prod.cu.o
[ 15%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/pad.cu.o
[ 15%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/pad_reflect_1d.cu.o
[ 16%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/pool2d.cu.o
[ 16%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/quantize.cu.o
[ 16%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/roll.cu.o
[ 16%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/rope.cu.o
[ 16%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/scale.cu.o
[ 16%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/set-rows.cu.o
[ 17%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/set.cu.o
[ 17%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/softcap.cu.o
[ 17%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/softmax.cu.o
[ 17%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/solve_tri.cu.o
[ 17%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/ssm-conv.cu.o
[ 18%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/ssm-scan.cu.o
[ 18%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/sum.cu.o
[ 18%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/sumrows.cu.o
[ 18%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/top-k.cu.o
[ 18%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/topk-moe.cu.o
[ 19%] Linking CXX static library libcpp-httplib.a
[ 19%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/tri.cu.o
[ 19%] Built target cpp-httplib
[ 20%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/tsembd.cu.o
[ 20%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/unary.cu.o
[ 20%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/upscale.cu.o
[ 20%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/wkv.cu.o
[ 20%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/template-instances/fattn-tile-instance-dkq112-dv112.cu.o
[ 20%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/template-instances/fattn-tile-instance-dkq128-dv128.cu.o
[ 21%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/template-instances/fattn-tile-instance-dkq256-dv256.cu.o
[ 21%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/template-instances/fattn-tile-instance-dkq40-dv40.cu.o
[ 21%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/template-instances/fattn-tile-instance-dkq576-dv512.cu.o
[ 21%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/template-instances/fattn-tile-instance-dkq64-dv64.cu.o
[ 21%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/template-instances/fattn-tile-instance-dkq72-dv72.cu.o
[ 21%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/template-instances/fattn-tile-instance-dkq80-dv80.cu.o
[ 22%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/template-instances/fattn-tile-instance-dkq96-dv96.cu.o
[ 22%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/template-instances/fattn-mma-f16-instance-ncols1_1-ncols2_16.cu.o
[ 22%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/template-instances/fattn-mma-f16-instance-ncols1_1-ncols2_32.cu.o
[ 22%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/template-instances/fattn-mma-f16-instance-ncols1_1-ncols2_8.cu.o
[ 22%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/template-instances/fattn-mma-f16-instance-ncols1_16-ncols2_1.cu.o
[ 22%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/template-instances/fattn-mma-f16-instance-ncols1_16-ncols2_2.cu.o
[ 23%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/template-instances/fattn-mma-f16-instance-ncols1_16-ncols2_4.cu.o
[ 23%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/template-instances/fattn-mma-f16-instance-ncols1_2-ncols2_16.cu.o
[ 23%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/template-instances/fattn-mma-f16-instance-ncols1_2-ncols2_32.cu.o
[ 23%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/template-instances/fattn-mma-f16-instance-ncols1_2-ncols2_4.cu.o
[ 23%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/template-instances/fattn-mma-f16-instance-ncols1_2-ncols2_8.cu.o
[ 23%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/template-instances/fattn-mma-f16-instance-ncols1_32-ncols2_1.cu.o
[ 24%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/template-instances/fattn-mma-f16-instance-ncols1_32-ncols2_2.cu.o
[ 24%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/template-instances/fattn-mma-f16-instance-ncols1_4-ncols2_16.cu.o
[ 24%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/template-instances/fattn-mma-f16-instance-ncols1_4-ncols2_2.cu.o
[ 24%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/template-instances/fattn-mma-f16-instance-ncols1_4-ncols2_4.cu.o
[ 24%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/template-instances/fattn-mma-f16-instance-ncols1_4-ncols2_8.cu.o
[ 24%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/template-instances/fattn-mma-f16-instance-ncols1_64-ncols2_1.cu.o
[ 25%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/template-instances/fattn-mma-f16-instance-ncols1_8-ncols2_1.cu.o
[ 25%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/template-instances/fattn-mma-f16-instance-ncols1_8-ncols2_2.cu.o
[ 25%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/template-instances/fattn-mma-f16-instance-ncols1_8-ncols2_4.cu.o
[ 25%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/template-instances/fattn-mma-f16-instance-ncols1_8-ncols2_8.cu.o
[ 25%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/template-instances/mmq-instance-iq1_s.cu.o
[ 25%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/template-instances/mmq-instance-iq2_s.cu.o
[ 26%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/template-instances/mmq-instance-iq2_xs.cu.o
[ 26%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/template-instances/mmq-instance-iq2_xxs.cu.o
[ 26%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/template-instances/mmq-instance-iq3_s.cu.o
[ 26%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/template-instances/mmq-instance-iq3_xxs.cu.o
[ 26%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/template-instances/mmq-instance-iq4_nl.cu.o
[ 26%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/template-instances/mmq-instance-iq4_xs.cu.o
[ 27%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/template-instances/mmq-instance-mxfp4.cu.o
[ 27%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/template-instances/mmq-instance-q2_k.cu.o
[ 27%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/template-instances/mmq-instance-q3_k.cu.o
[ 27%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/template-instances/mmq-instance-q4_0.cu.o
[ 27%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/template-instances/mmq-instance-q4_1.cu.o
[ 27%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/template-instances/mmq-instance-q4_k.cu.o
[ 28%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/template-instances/mmq-instance-q5_0.cu.o
[ 28%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/template-instances/mmq-instance-q5_1.cu.o
[ 28%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/template-instances/mmq-instance-q5_k.cu.o
[ 28%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/template-instances/mmq-instance-q6_k.cu.o
[ 28%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/template-instances/mmq-instance-q8_0.cu.o
[ 29%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/template-instances/mmf-instance-ncols_1.cu.o
[ 29%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/template-instances/mmf-instance-ncols_10.cu.o
[ 29%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/template-instances/mmf-instance-ncols_11.cu.o
[ 29%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/template-instances/mmf-instance-ncols_12.cu.o
[ 29%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/template-instances/mmf-instance-ncols_13.cu.o
[ 29%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/template-instances/mmf-instance-ncols_14.cu.o
[ 30%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/template-instances/mmf-instance-ncols_15.cu.o
[ 30%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/template-instances/mmf-instance-ncols_16.cu.o
[ 30%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/template-instances/mmf-instance-ncols_2.cu.o
[ 30%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/template-instances/mmf-instance-ncols_3.cu.o
[ 30%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/template-instances/mmf-instance-ncols_4.cu.o
[ 30%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/template-instances/mmf-instance-ncols_5.cu.o
[ 31%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/template-instances/mmf-instance-ncols_6.cu.o
[ 31%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/template-instances/mmf-instance-ncols_7.cu.o
[ 31%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/template-instances/mmf-instance-ncols_8.cu.o
[ 31%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/template-instances/mmf-instance-ncols_9.cu.o
[ 31%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/template-instances/fattn-vec-instance-f16-f16.cu.o
[ 31%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/template-instances/fattn-vec-instance-q4_0-q4_0.cu.o
[ 32%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/template-instances/fattn-vec-instance-bf16-bf16.cu.o
[ 32%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/template-instances/fattn-vec-instance-q8_0-q8_0.cu.o
[ 32%] Linking CUDA shared library ../../../bin/libggml-cuda.so
[ 32%] Built target ggml-cuda
[ 32%] Building CXX object ggml/src/CMakeFiles/ggml.dir/ggml-backend-dl.cpp.o
[ 32%] Building CXX object ggml/src/CMakeFiles/ggml.dir/ggml-backend-reg.cpp.o
[ 32%] Linking CXX shared library ../../bin/libggml.so
[ 32%] Built target ggml
[ 32%] Building CXX object examples/gguf-hash/CMakeFiles/llama-gguf-hash.dir/gguf-hash.cpp.o
[ 32%] Building CXX object examples/gguf/CMakeFiles/llama-gguf.dir/gguf.cpp.o
[ 33%] Building CXX object src/CMakeFiles/llama.dir/llama.cpp.o
[ 33%] Building CXX object src/CMakeFiles/llama.dir/llama-adapter.cpp.o
[ 33%] Building CXX object src/CMakeFiles/llama.dir/llama-cparams.cpp.o
[ 33%] Building CXX object src/CMakeFiles/llama.dir/llama-graph.cpp.o
[ 33%] Building CXX object src/CMakeFiles/llama.dir/llama-hparams.cpp.o
[ 33%] Building CXX object src/CMakeFiles/llama.dir/llama-kv-cache.cpp.o
[ 33%] Building CXX object src/CMakeFiles/llama.dir/llama-io.cpp.o
[ 33%] Building CXX object src/CMakeFiles/llama.dir/llama-batch.cpp.o
[ 33%] Building CXX object src/CMakeFiles/llama.dir/llama-memory.cpp.o
[ 33%] Building CXX object src/CMakeFiles/llama.dir/llama-arch.cpp.o
[ 34%] Building CXX object src/CMakeFiles/llama.dir/llama-kv-cache-iswa.cpp.o
[ 34%] Building CXX object src/CMakeFiles/llama.dir/llama-context.cpp.o
[ 34%] Building CXX object src/CMakeFiles/llama.dir/llama-chat.cpp.o
[ 35%] Building CXX object src/CMakeFiles/llama.dir/llama-grammar.cpp.o
[ 35%] Building CXX object src/CMakeFiles/llama.dir/llama-impl.cpp.o
[ 35%] Building CXX object src/CMakeFiles/llama.dir/llama-memory-hybrid.cpp.o
[ 35%] Building CXX object src/CMakeFiles/llama.dir/llama-memory-recurrent.cpp.o
[ 35%] Building CXX object src/CMakeFiles/llama.dir/llama-memory-hybrid-iswa.cpp.o
[ 35%] Building CXX object src/CMakeFiles/llama.dir/llama-mmap.cpp.o
[ 36%] Building CXX object src/CMakeFiles/llama.dir/llama-model-loader.cpp.o
[ 36%] Building CXX object src/CMakeFiles/llama.dir/llama-model-saver.cpp.o
[ 36%] Building CXX object src/CMakeFiles/llama.dir/llama-model.cpp.o
[ 36%] Linking CXX executable ../../bin/llama-gguf
[ 36%] Building CXX object src/CMakeFiles/llama.dir/llama-quant.cpp.o
[ 36%] Linking CXX executable ../../bin/llama-gguf-hash
[ 36%] Built target llama-gguf
[ 36%] Building CXX object src/CMakeFiles/llama.dir/llama-sampler.cpp.o
[ 36%] Built target llama-gguf-hash
[ 36%] Building CXX object src/CMakeFiles/llama.dir/llama-vocab.cpp.o
[ 37%] Building CXX object src/CMakeFiles/llama.dir/unicode-data.cpp.o
[ 37%] Building CXX object src/CMakeFiles/llama.dir/unicode.cpp.o
[ 37%] Building CXX object src/CMakeFiles/llama.dir/models/afmoe.cpp.o
[ 37%] Building CXX object src/CMakeFiles/llama.dir/models/apertus.cpp.o
[ 37%] Building CXX object src/CMakeFiles/llama.dir/models/arcee.cpp.o
[ 37%] Building CXX object src/CMakeFiles/llama.dir/models/arctic.cpp.o
[ 38%] Building CXX object src/CMakeFiles/llama.dir/models/arwkv7.cpp.o
[ 38%] Building CXX object src/CMakeFiles/llama.dir/models/baichuan.cpp.o
[ 38%] Building CXX object src/CMakeFiles/llama.dir/models/bailingmoe.cpp.o
[ 38%] Building CXX object src/CMakeFiles/llama.dir/models/bailingmoe2.cpp.o
[ 38%] Building CXX object src/CMakeFiles/llama.dir/models/bert.cpp.o
[ 39%] Building CXX object src/CMakeFiles/llama.dir/models/bitnet.cpp.o
[ 39%] Building CXX object src/CMakeFiles/llama.dir/models/bloom.cpp.o
[ 39%] Building CXX object src/CMakeFiles/llama.dir/models/chameleon.cpp.o
[ 39%] Building CXX object src/CMakeFiles/llama.dir/models/chatglm.cpp.o
[ 39%] Building CXX object src/CMakeFiles/llama.dir/models/codeshell.cpp.o
[ 39%] Building CXX object src/CMakeFiles/llama.dir/models/cogvlm.cpp.o
[ 40%] Building CXX object src/CMakeFiles/llama.dir/models/cohere2-iswa.cpp.o
[ 40%] Building CXX object src/CMakeFiles/llama.dir/models/command-r.cpp.o
[ 40%] Building CXX object src/CMakeFiles/llama.dir/models/dbrx.cpp.o
[ 40%] Building CXX object src/CMakeFiles/llama.dir/models/deci.cpp.o
[ 40%] Building CXX object src/CMakeFiles/llama.dir/models/deepseek.cpp.o
[ 40%] Building CXX object src/CMakeFiles/llama.dir/models/deepseek2.cpp.o
[ 41%] Building CXX object src/CMakeFiles/llama.dir/models/delta-net-base.cpp.o
[ 41%] Building CXX object src/CMakeFiles/llama.dir/models/dots1.cpp.o
[ 41%] Building CXX object src/CMakeFiles/llama.dir/models/dream.cpp.o
[ 41%] Building CXX object src/CMakeFiles/llama.dir/models/ernie4-5-moe.cpp.o
[ 41%] Building CXX object src/CMakeFiles/llama.dir/models/ernie4-5.cpp.o
[ 41%] Building CXX object src/CMakeFiles/llama.dir/models/eurobert.cpp.o
[ 42%] Building CXX object src/CMakeFiles/llama.dir/models/exaone-moe.cpp.o
[ 42%] Building CXX object src/CMakeFiles/llama.dir/models/exaone.cpp.o
[ 42%] Building CXX object src/CMakeFiles/llama.dir/models/exaone4.cpp.o
[ 42%] Building CXX object src/CMakeFiles/llama.dir/models/falcon-h1.cpp.o
[ 42%] Building CXX object src/CMakeFiles/llama.dir/models/falcon.cpp.o
[ 42%] Building CXX object src/CMakeFiles/llama.dir/models/gemma-embedding.cpp.o
[ 43%] Building CXX object src/CMakeFiles/llama.dir/models/gemma.cpp.o
[ 43%] Building CXX object src/CMakeFiles/llama.dir/models/gemma2-iswa.cpp.o
[ 43%] Building CXX object src/CMakeFiles/llama.dir/models/gemma3.cpp.o
[ 43%] Building CXX object src/CMakeFiles/llama.dir/models/gemma3n-iswa.cpp.o
[ 43%] Building CXX object src/CMakeFiles/llama.dir/models/glm4-moe.cpp.o
[ 43%] Building CXX object src/CMakeFiles/llama.dir/models/glm4.cpp.o
[ 44%] Building CXX object src/CMakeFiles/llama.dir/models/gpt2.cpp.o
[ 44%] Building CXX object src/CMakeFiles/llama.dir/models/gptneox.cpp.o
[ 44%] Building CXX object src/CMakeFiles/llama.dir/models/granite-hybrid.cpp.o
[ 44%] Building CXX object src/CMakeFiles/llama.dir/models/granite.cpp.o
[ 44%] Building CXX object src/CMakeFiles/llama.dir/models/grok.cpp.o
[ 44%] Building CXX object src/CMakeFiles/llama.dir/models/grovemoe.cpp.o
[ 45%] Building CXX object src/CMakeFiles/llama.dir/models/hunyuan-dense.cpp.o
[ 45%] Building CXX object src/CMakeFiles/llama.dir/models/hunyuan-moe.cpp.o
[ 45%] Building CXX object src/CMakeFiles/llama.dir/models/internlm2.cpp.o
[ 45%] Building CXX object src/CMakeFiles/llama.dir/models/jais.cpp.o
[ 45%] Building CXX object src/CMakeFiles/llama.dir/models/jais2.cpp.o
[ 45%] Building CXX object src/CMakeFiles/llama.dir/models/jamba.cpp.o
[ 46%] Building CXX object src/CMakeFiles/llama.dir/models/kimi-linear.cpp.o
[ 46%] Building CXX object src/CMakeFiles/llama.dir/models/lfm2.cpp.o
[ 46%] Building CXX object src/CMakeFiles/llama.dir/models/llada-moe.cpp.o
[ 46%] Building CXX object src/CMakeFiles/llama.dir/models/llada.cpp.o
[ 46%] Building CXX object src/CMakeFiles/llama.dir/models/llama-iswa.cpp.o
[ 46%] Building CXX object src/CMakeFiles/llama.dir/models/llama.cpp.o
[ 47%] Building CXX object src/CMakeFiles/llama.dir/models/maincoder.cpp.o
[ 47%] Building CXX object src/CMakeFiles/llama.dir/models/mamba-base.cpp.o
[ 47%] Building CXX object src/CMakeFiles/llama.dir/models/mamba.cpp.o
[ 47%] Building CXX object src/CMakeFiles/llama.dir/models/mimo2-iswa.cpp.o
[ 47%] Building CXX object src/CMakeFiles/llama.dir/models/minicpm3.cpp.o
[ 47%] Building CXX object src/CMakeFiles/llama.dir/models/minimax-m2.cpp.o
[ 48%] Building CXX object src/CMakeFiles/llama.dir/models/mistral3.cpp.o
[ 48%] Building CXX object src/CMakeFiles/llama.dir/models/modern-bert.cpp.o
[ 48%] Building CXX object src/CMakeFiles/llama.dir/models/nemotron-h.cpp.o
[ 48%] Building CXX object src/CMakeFiles/llama.dir/models/mpt.cpp.o
[ 48%] Building CXX object src/CMakeFiles/llama.dir/models/nemotron.cpp.o
[ 49%] Building CXX object src/CMakeFiles/llama.dir/models/neo-bert.cpp.o
[ 49%] Building CXX object src/CMakeFiles/llama.dir/models/olmo.cpp.o
[ 49%] Building CXX object src/CMakeFiles/llama.dir/models/olmo2.cpp.o
[ 49%] Building CXX object src/CMakeFiles/llama.dir/models/olmoe.cpp.o
[ 49%] Building CXX object src/CMakeFiles/llama.dir/models/openai-moe-iswa.cpp.o
[ 49%] Building CXX object src/CMakeFiles/llama.dir/models/openelm.cpp.o
[ 50%] Building CXX object src/CMakeFiles/llama.dir/models/orion.cpp.o
[ 50%] Building CXX object src/CMakeFiles/llama.dir/models/paddleocr.cpp.o
[ 50%] Building CXX object src/CMakeFiles/llama.dir/models/pangu-embedded.cpp.o
[ 50%] Building CXX object src/CMakeFiles/llama.dir/models/phi2.cpp.o
[ 50%] Building CXX object src/CMakeFiles/llama.dir/models/phi3.cpp.o
[ 50%] Building CXX object src/CMakeFiles/llama.dir/models/plamo.cpp.o
[ 51%] Building CXX object src/CMakeFiles/llama.dir/models/plamo2.cpp.o
[ 51%] Building CXX object src/CMakeFiles/llama.dir/models/plamo3.cpp.o
[ 51%] Building CXX object src/CMakeFiles/llama.dir/models/plm.cpp.o
[ 51%] Building CXX object src/CMakeFiles/llama.dir/models/qwen.cpp.o
[ 51%] Building CXX object src/CMakeFiles/llama.dir/models/qwen2.cpp.o
[ 51%] Building CXX object src/CMakeFiles/llama.dir/models/qwen2moe.cpp.o
[ 52%] Building CXX object src/CMakeFiles/llama.dir/models/qwen2vl.cpp.o
[ 52%] Building CXX object src/CMakeFiles/llama.dir/models/qwen3.cpp.o
[ 52%] Building CXX object src/CMakeFiles/llama.dir/models/qwen35.cpp.o
[ 52%] Building CXX object src/CMakeFiles/llama.dir/models/qwen3moe.cpp.o
[ 52%] Building CXX object src/CMakeFiles/llama.dir/models/qwen35moe.cpp.o
[ 52%] Building CXX object src/CMakeFiles/llama.dir/models/qwen3next.cpp.o
[ 53%] Building CXX object src/CMakeFiles/llama.dir/models/qwen3vl-moe.cpp.o
[ 53%] Building CXX object src/CMakeFiles/llama.dir/models/qwen3vl.cpp.o
[ 53%] Building CXX object src/CMakeFiles/llama.dir/models/refact.cpp.o
[ 53%] Building CXX object src/CMakeFiles/llama.dir/models/rnd1.cpp.o
[ 53%] Building CXX object src/CMakeFiles/llama.dir/models/rwkv6-base.cpp.o
[ 53%] Building CXX object src/CMakeFiles/llama.dir/models/rwkv6.cpp.o
[ 54%] Building CXX object src/CMakeFiles/llama.dir/models/rwkv6qwen2.cpp.o
[ 54%] Building CXX object src/CMakeFiles/llama.dir/models/rwkv7-base.cpp.o
[ 54%] Building CXX object src/CMakeFiles/llama.dir/models/rwkv7.cpp.o
[ 54%] Building CXX object src/CMakeFiles/llama.dir/models/seed-oss.cpp.o
[ 54%] Building CXX object src/CMakeFiles/llama.dir/models/smallthinker.cpp.o
[ 54%] Building CXX object src/CMakeFiles/llama.dir/models/smollm3.cpp.o
[ 55%] Building CXX object src/CMakeFiles/llama.dir/models/stablelm.cpp.o
[ 55%] Building CXX object src/CMakeFiles/llama.dir/models/starcoder.cpp.o
[ 55%] Building CXX object src/CMakeFiles/llama.dir/models/step35-iswa.cpp.o
[ 55%] Building CXX object src/CMakeFiles/llama.dir/models/starcoder2.cpp.o
[ 55%] Building CXX object src/CMakeFiles/llama.dir/models/t5-dec.cpp.o
[ 55%] Building CXX object src/CMakeFiles/llama.dir/models/t5-enc.cpp.o
[ 56%] Building CXX object src/CMakeFiles/llama.dir/models/wavtokenizer-dec.cpp.o
[ 56%] Building CXX object src/CMakeFiles/llama.dir/models/xverse.cpp.o
[ 56%] Linking CXX shared library ../bin/libllama.so
[ 56%] Built target llama
[ 56%] Building C object tests/CMakeFiles/test-c.dir/test-c.c.o
[ 56%] Building CXX object common/CMakeFiles/common.dir/chat-diff-analyzer.cpp.o
[ 56%] Building CXX object common/CMakeFiles/common.dir/console.cpp.o
[ 56%] Building CXX object common/CMakeFiles/common.dir/debug.cpp.o
[ 56%] Building CXX object common/CMakeFiles/common.dir/download.cpp.o
[ 56%] Building CXX object common/CMakeFiles/common.dir/json-partial.cpp.o
[ 56%] Building CXX object tools/mtmd/CMakeFiles/mtmd.dir/mtmd.cpp.o
[ 56%] Building CXX object common/CMakeFiles/common.dir/json-schema-to-grammar.cpp.o
[ 56%] Building CXX object common/CMakeFiles/common.dir/arg.cpp.o
[ 56%] Building CXX object common/CMakeFiles/common.dir/llguidance.cpp.o
[ 56%] Building CXX object common/CMakeFiles/common.dir/chat-auto-parser-generator.cpp.o
[ 56%] Building CXX object common/CMakeFiles/common.dir/chat-auto-parser-helpers.cpp.o
[ 56%] Building CXX object examples/simple/CMakeFiles/llama-simple.dir/simple.cpp.o
[ 57%] Building CXX object common/CMakeFiles/common.dir/chat-peg-parser.cpp.o
[ 57%] Building CXX object common/CMakeFiles/common.dir/chat.cpp.o
[ 57%] Building CXX object common/CMakeFiles/common.dir/common.cpp.o
[ 57%] Building CXX object examples/simple-chat/CMakeFiles/llama-simple-chat.dir/simple-chat.cpp.o
[ 58%] Building CXX object common/CMakeFiles/common.dir/hf-cache.cpp.o
[ 58%] Building CXX object tools/mtmd/CMakeFiles/mtmd.dir/mtmd-image.cpp.o
[ 59%] Building CXX object tools/mtmd/CMakeFiles/mtmd.dir/mtmd-audio.cpp.o
[ 59%] Linking C executable ../bin/test-c
[ 59%] Built target test-c
[ 59%] Building CXX object tools/mtmd/CMakeFiles/mtmd.dir/mtmd-helper.cpp.o
[ 59%] Building CXX object tools/mtmd/CMakeFiles/mtmd.dir/clip.cpp.o
[ 59%] Linking CXX executable ../../bin/llama-simple
[ 59%] Built target llama-simple
[ 59%] Building CXX object tools/mtmd/CMakeFiles/mtmd.dir/models/cogvlm.cpp.o
[ 59%] Linking CXX executable ../../bin/llama-simple-chat
[ 59%] Built target llama-simple-chat
[ 59%] Building CXX object common/CMakeFiles/common.dir/log.cpp.o
[ 59%] Building CXX object common/CMakeFiles/common.dir/ngram-cache.cpp.o
[ 60%] Building CXX object common/CMakeFiles/common.dir/ngram-map.cpp.o
[ 60%] Building CXX object common/CMakeFiles/common.dir/ngram-mod.cpp.o
[ 60%] Building CXX object common/CMakeFiles/common.dir/peg-parser.cpp.o
[ 60%] Building CXX object common/CMakeFiles/common.dir/preset.cpp.o
[ 60%] Building CXX object tools/mtmd/CMakeFiles/mtmd.dir/models/conformer.cpp.o
[ 61%] Building CXX object tools/mtmd/CMakeFiles/mtmd.dir/models/glm4v.cpp.o
[ 61%] Building CXX object tools/mtmd/CMakeFiles/mtmd.dir/models/internvl.cpp.o
[ 61%] Building CXX object tools/mtmd/CMakeFiles/mtmd.dir/models/kimivl.cpp.o
[ 61%] Building CXX object tools/mtmd/CMakeFiles/mtmd.dir/models/kimik25.cpp.o
[ 61%] Building CXX object common/CMakeFiles/common.dir/regex-partial.cpp.o
[ 61%] Building CXX object tools/mtmd/CMakeFiles/mtmd.dir/models/nemotron-v2-vl.cpp.o
[ 61%] Building CXX object common/CMakeFiles/common.dir/reasoning-budget.cpp.o
[ 61%] Building CXX object tools/mtmd/CMakeFiles/mtmd.dir/models/llama4.cpp.o
[ 62%] Building CXX object tools/mtmd/CMakeFiles/mtmd.dir/models/llava.cpp.o
[ 62%] Building CXX object tools/mtmd/CMakeFiles/mtmd.dir/models/minicpmv.cpp.o
[ 62%] Building CXX object tools/mtmd/CMakeFiles/mtmd.dir/models/paddleocr.cpp.o
[ 62%] Building CXX object tools/mtmd/CMakeFiles/mtmd.dir/models/pixtral.cpp.o
[ 62%] Building CXX object tools/mtmd/CMakeFiles/mtmd.dir/models/qwen2vl.cpp.o
[ 62%] Building CXX object tools/mtmd/CMakeFiles/mtmd.dir/models/qwen3vl.cpp.o
[ 63%] Building CXX object common/CMakeFiles/common.dir/sampling.cpp.o
[ 64%] Building CXX object tools/mtmd/CMakeFiles/mtmd.dir/models/siglip.cpp.o
[ 64%] Building CXX object tools/mtmd/CMakeFiles/mtmd.dir/models/whisper-enc.cpp.o
[ 64%] Building CXX object common/CMakeFiles/common.dir/speculative.cpp.o
[ 64%] Building CXX object tools/mtmd/CMakeFiles/mtmd.dir/models/deepseekocr.cpp.o
[ 64%] Building CXX object common/CMakeFiles/common.dir/unicode.cpp.o
[ 64%] Building CXX object tools/mtmd/CMakeFiles/mtmd.dir/models/mobilenetv5.cpp.o
[ 64%] Building CXX object common/CMakeFiles/common.dir/jinja/lexer.cpp.o
[ 64%] Building CXX object common/CMakeFiles/common.dir/jinja/parser.cpp.o
[ 64%] Building CXX object common/CMakeFiles/common.dir/jinja/runtime.cpp.o
[ 64%] Building CXX object tools/mtmd/CMakeFiles/mtmd.dir/models/youtuvl.cpp.o
[ 65%] Building CXX object common/CMakeFiles/common.dir/jinja/value.cpp.o
[ 65%] Building CXX object common/CMakeFiles/common.dir/jinja/string.cpp.o
[ 65%] Building CXX object common/CMakeFiles/common.dir/jinja/caps.cpp.o
[ 65%] Building CXX object common/CMakeFiles/common.dir/__/license.cpp.o
[ 65%] Linking CXX shared library ../../bin/libmtmd.so
[ 65%] Built target mtmd
[ 65%] Linking CXX static library libcommon.a
[ 65%] Built target common
[ 65%] Building CXX object tests/CMakeFiles/test-sampling.dir/test-sampling.cpp.o
[ 65%] Building CXX object tests/CMakeFiles/test-json-schema-to-grammar.dir/test-json-schema-to-grammar.cpp.o
[ 66%] Building CXX object tests/CMakeFiles/test-gbnf-validator.dir/test-gbnf-validator.cpp.o
[ 66%] Building CXX object tests/CMakeFiles/test-llama-archs.dir/test-llama-archs.cpp.o
[ 66%] Building CXX object tests/CMakeFiles/test-jinja.dir/test-jinja.cpp.o
[ 66%] Building CXX object tests/CMakeFiles/test-quantize-stats.dir/test-quantize-stats.cpp.o
[ 66%] Building CXX object tests/CMakeFiles/test-tokenizer-0.dir/test-tokenizer-0.cpp.o
[ 66%] Building CXX object tests/CMakeFiles/test-llama-grammar.dir/test-llama-grammar.cpp.o
[ 66%] Building CXX object tests/CMakeFiles/test-grammar-integration.dir/test-grammar-integration.cpp.o
[ 66%] Building CXX object tests/CMakeFiles/test-tokenizer-1-bpe.dir/test-tokenizer-1-bpe.cpp.o
[ 66%] Building CXX object tests/CMakeFiles/test-chat.dir/test-chat.cpp.o
[ 66%] Building CXX object tests/CMakeFiles/test-chat-template.dir/test-chat-template.cpp.o
[ 66%] Building CXX object tests/CMakeFiles/test-reasoning-budget.dir/test-reasoning-budget.cpp.o
[ 66%] Building CXX object tests/CMakeFiles/test-tokenizer-1-spm.dir/test-tokenizer-1-spm.cpp.o
[ 66%] Building CXX object tests/CMakeFiles/test-grammar-parser.dir/test-grammar-parser.cpp.o
[ 66%] Building CXX object tests/CMakeFiles/test-chat-auto-parser.dir/test-chat-auto-parser.cpp.o
[ 66%] Building CXX object tests/CMakeFiles/test-chat-peg-parser.dir/test-chat-peg-parser.cpp.o
[ 66%] Building CXX object tests/CMakeFiles/test-log.dir/test-log.cpp.o
[ 66%] Building CXX object tests/CMakeFiles/test-peg-parser.dir/test-peg-parser.cpp.o
[ 66%] Building CXX object tests/CMakeFiles/test-json-partial.dir/test-json-partial.cpp.o
[ 67%] Building CXX object tests/CMakeFiles/test-log.dir/get-model.cpp.o
[ 67%] Linking CXX executable ../bin/test-log
[ 67%] Built target test-log
[ 68%] Building CXX object tests/CMakeFiles/test-peg-parser.dir/peg-parser/simple-tokenize.cpp.o
[ 68%] Linking CXX executable ../bin/test-gbnf-validator
[ 68%] Linking CXX executable ../bin/test-tokenizer-1-bpe
[ 68%] Built target test-gbnf-validator
[ 68%] Building CXX object tests/CMakeFiles/test-jinja.dir/get-model.cpp.o
[ 68%] Building CXX object tests/CMakeFiles/test-grammar-parser.dir/get-model.cpp.o
[ 69%] Building CXX object tests/CMakeFiles/test-reasoning-budget.dir/get-model.cpp.o
[ 69%] Linking CXX executable ../bin/test-reasoning-budget
[ 69%] Building CXX object tests/CMakeFiles/test-json-partial.dir/get-model.cpp.o
[ 69%] Built target test-tokenizer-1-bpe
[ 69%] Building CXX object tests/CMakeFiles/test-chat-template.dir/get-model.cpp.o
[ 70%] Building CXX object tests/CMakeFiles/test-llama-archs.dir/get-model.cpp.o
[ 70%] Building CXX object tests/CMakeFiles/test-sampling.dir/get-model.cpp.o
[ 70%] Building CXX object tests/CMakeFiles/test-llama-grammar.dir/get-model.cpp.o
[ 70%] Building CXX object tests/CMakeFiles/test-json-schema-to-grammar.dir/get-model.cpp.o
[ 71%] Building CXX object tests/CMakeFiles/test-chat-peg-parser.dir/peg-parser/simple-tokenize.cpp.o
[ 71%] Linking CXX executable ../bin/test-sampling
[ 72%] Building CXX object tests/CMakeFiles/test-chat.dir/get-model.cpp.o
[ 72%] Built target test-reasoning-budget
[ 72%] Building CXX object tests/CMakeFiles/test-chat-auto-parser.dir/get-model.cpp.o
[ 72%] Building CXX object tests/CMakeFiles/test-grammar-integration.dir/get-model.cpp.o
[ 72%] Building CXX object tests/CMakeFiles/test-peg-parser.dir/peg-parser/test-basic.cpp.o
[ 72%] Linking CXX executable ../bin/test-llama-archs
[ 72%] Linking CXX executable ../bin/test-llama-grammar
[ 72%] Building CXX object tests/CMakeFiles/test-regex-partial.dir/test-regex-partial.cpp.o
[ 72%] Building CXX object tests/CMakeFiles/test-chat-peg-parser.dir/get-model.cpp.o
[ 72%] Built target test-sampling
[ 73%] Linking CXX executable ../bin/test-tokenizer-0
[ 73%] Building CXX object tests/CMakeFiles/test-thread-safety.dir/test-thread-safety.cpp.o
[ 73%] Building CXX object tests/CMakeFiles/test-regex-partial.dir/get-model.cpp.o
[ 73%] Built target test-llama-archs
[ 73%] Linking CXX executable ../bin/test-tokenizer-1-spm
[ 73%] Built target test-llama-grammar
[ 73%] Building CXX object tests/CMakeFiles/test-thread-safety.dir/get-model.cpp.o
[ 73%] Building CXX object tests/CMakeFiles/test-peg-parser.dir/peg-parser/test-gbnf-generation.cpp.o
[ 73%] Building CXX object tests/CMakeFiles/test-peg-parser.dir/peg-parser/test-json-parser.cpp.o
[ 74%] Building CXX object tests/CMakeFiles/test-arg-parser.dir/test-arg-parser.cpp.o
[ 74%] Built target test-tokenizer-0
[ 74%] Building CXX object tests/CMakeFiles/test-peg-parser.dir/peg-parser/test-json-serialization.cpp.o
[ 74%] Built target test-tokenizer-1-spm
[ 74%] Building CXX object tests/CMakeFiles/test-peg-parser.dir/peg-parser/test-python-dict-parser.cpp.o
[ 75%] Building CXX object tests/CMakeFiles/test-peg-parser.dir/peg-parser/test-unicode.cpp.o
[ 76%] Linking CXX executable ../bin/test-grammar-parser
[ 76%] Built target test-grammar-parser
[ 76%] Building CXX object tests/CMakeFiles/test-arg-parser.dir/get-model.cpp.o
[ 76%] Building CXX object tests/CMakeFiles/test-peg-parser.dir/get-model.cpp.o
[ 76%] Building CXX object tests/CMakeFiles/test-opt.dir/test-opt.cpp.o
[ 76%] Linking CXX executable ../bin/test-regex-partial
[ 76%] Built target test-regex-partial
[ 76%] Building CXX object tests/CMakeFiles/test-opt.dir/get-model.cpp.o
[ 77%] Linking CXX executable ../bin/test-json-partial
[ 77%] Building CXX object tests/CMakeFiles/test-gguf.dir/test-gguf.cpp.o
[ 77%] Built target test-json-partial
[ 78%] Building CXX object tests/CMakeFiles/test-backend-ops.dir/test-backend-ops.cpp.o
[ 78%] Linking CXX executable ../bin/test-thread-safety
[ 78%] Built target test-thread-safety
[ 78%] Building CXX object tests/CMakeFiles/test-gguf.dir/get-model.cpp.o
[ 78%] Building CXX object tests/CMakeFiles/test-backend-ops.dir/get-model.cpp.o
[ 78%] Linking CXX executable ../bin/test-opt
[ 78%] Building CXX object tests/CMakeFiles/test-model-load-cancel.dir/test-model-load-cancel.cpp.o
[ 78%] Building CXX object tests/CMakeFiles/test-model-load-cancel.dir/get-model.cpp.o
[ 78%] Built target test-opt
[ 78%] Building CXX object tests/CMakeFiles/test-autorelease.dir/test-autorelease.cpp.o
[ 78%] Linking CXX executable ../bin/test-model-load-cancel
[ 78%] Linking CXX executable ../bin/test-arg-parser
[ 78%] Built target test-model-load-cancel
[ 78%] Building CXX object tests/CMakeFiles/test-autorelease.dir/get-model.cpp.o
[ 78%] Building CXX object tests/CMakeFiles/test-backend-sampler.dir/test-backend-sampler.cpp.o
[ 78%] Linking CXX executable ../bin/test-quantize-stats
[ 78%] Built target test-arg-parser
[ 78%] Building CXX object tests/CMakeFiles/test-backend-sampler.dir/get-model.cpp.o
[ 78%] Linking CXX executable ../bin/test-autorelease
[ 78%] Built target test-quantize-stats
[ 78%] Building CXX object tests/CMakeFiles/test-state-restore-fragmented.dir/test-state-restore-fragmented.cpp.o
[ 79%] Building CXX object tests/CMakeFiles/test-barrier.dir/test-barrier.cpp.o
[ 79%] Built target test-autorelease
[ 79%] Building CXX object tests/CMakeFiles/test-barrier.dir/get-model.cpp.o
[ 80%] Building CXX object tests/CMakeFiles/test-state-restore-fragmented.dir/get-model.cpp.o
[ 80%] Linking CXX executable ../bin/test-grammar-integration
[ 80%] Building CXX object tests/CMakeFiles/test-quantize-fns.dir/test-quantize-fns.cpp.o
[ 80%] Built target test-grammar-integration
[ 80%] Building CXX object tests/CMakeFiles/test-quantize-fns.dir/get-model.cpp.o
[ 80%] Linking CXX executable ../bin/test-barrier
[ 81%] Building CXX object tests/CMakeFiles/test-quantize-perf.dir/test-quantize-perf.cpp.o
[ 81%] Built target test-barrier
[ 81%] Building CXX object tests/CMakeFiles/test-quantize-perf.dir/get-model.cpp.o
[ 81%] Building CXX object tests/CMakeFiles/test-rope.dir/test-rope.cpp.o
[ 81%] Building C object tests/CMakeFiles/test-mtmd-c-api.dir/test-mtmd-c-api.c.o
[ 81%] Linking CXX executable ../bin/test-quantize-fns
[ 81%] Linking CXX executable ../bin/test-gguf
[ 82%] Building CXX object tests/CMakeFiles/test-mtmd-c-api.dir/get-model.cpp.o
[ 82%] Built target test-quantize-fns
[ 82%] Linking CXX executable ../bin/test-mtmd-c-api
[ 82%] Building CXX object tests/CMakeFiles/gguf-model-data.dir/gguf-model-data.cpp.o
[ 83%] Building CXX object tests/CMakeFiles/test-rope.dir/get-model.cpp.o
[ 83%] Built target test-gguf
[ 83%] Built target test-mtmd-c-api
[ 83%] Building CXX object tests/CMakeFiles/test-alloc.dir/test-alloc.cpp.o
[ 83%] Linking CXX executable ../bin/test-rope
[ 83%] Building CXX object tests/CMakeFiles/export-graph-ops.dir/export-graph-ops.cpp.o
[ 83%] Built target test-rope
[ 83%] Building CXX object examples/batched/CMakeFiles/llama-batched.dir/batched.cpp.o
[ 83%] Linking CXX executable ../bin/test-json-schema-to-grammar
[ 83%] Linking CXX executable ../bin/test-quantize-perf
[ 83%] Built target test-json-schema-to-grammar
[ 83%] Building CXX object tests/CMakeFiles/test-alloc.dir/get-model.cpp.o
[ 84%] Building CXX object examples/debug/CMakeFiles/llama-debug.dir/debug.cpp.o
[ 84%] Built target test-quantize-perf
[ 85%] Building CXX object examples/embedding/CMakeFiles/llama-embedding.dir/embedding.cpp.o
[ 85%] Linking CXX executable ../bin/test-state-restore-fragmented
[ 85%] Building CXX object examples/idle/CMakeFiles/llama-idle.dir/idle.cpp.o
[ 85%] Building CXX object examples/eval-callback/CMakeFiles/llama-eval-callback.dir/eval-callback.cpp.o
[ 85%] Built target test-state-restore-fragmented
[ 85%] Building CXX object examples/lookahead/CMakeFiles/llama-lookahead.dir/lookahead.cpp.o
[ 85%] Linking CXX executable ../bin/test-alloc
[ 85%] Built target test-alloc
[ 85%] Building CXX object examples/lookup/CMakeFiles/llama-lookup.dir/lookup.cpp.o
[ 85%] Linking CXX executable ../bin/export-graph-ops
[ 85%] Linking CXX executable ../../bin/llama-batched
[ 85%] Built target export-graph-ops
[ 86%] Building CXX object examples/lookup/CMakeFiles/llama-lookup-create.dir/lookup-create.cpp.o
[ 86%] Built target llama-batched
[ 86%] Building CXX object examples/lookup/CMakeFiles/llama-lookup-merge.dir/lookup-merge.cpp.o
[ 86%] Linking CXX executable ../bin/test-backend-sampler
[ 86%] Building CXX object examples/lookup/CMakeFiles/llama-lookup-stats.dir/lookup-stats.cpp.o
[ 87%] Building CXX object examples/parallel/CMakeFiles/llama-parallel.dir/parallel.cpp.o
[ 87%] Linking CXX executable ../../bin/llama-embedding
[ 87%] Built target test-backend-sampler
[ 87%] Building CXX object examples/passkey/CMakeFiles/llama-passkey.dir/passkey.cpp.o
[ 87%] Linking CXX executable ../../bin/llama-idle
[ 87%] Linking CXX executable ../../bin/llama-lookup-merge
[ 87%] Built target llama-embedding
[ 87%] Building CXX object examples/retrieval/CMakeFiles/llama-retrieval.dir/retrieval.cpp.o
[ 87%] Linking CXX executable ../bin/test-chat-template
[ 87%] Built target llama-lookup-merge
[ 87%] Linking CXX executable ../../bin/llama-lookahead
[ 87%] Building CXX object examples/save-load-state/CMakeFiles/llama-save-load-state.dir/save-load-state.cpp.o
[ 87%] Built target test-chat-template
[ 87%] Built target llama-idle
[ 87%] Building CXX object examples/speculative/CMakeFiles/llama-speculative.dir/speculative.cpp.o
[ 87%] Building CXX object examples/speculative-simple/CMakeFiles/llama-speculative-simple.dir/speculative-simple.cpp.o
[ 87%] Linking CXX executable ../../bin/llama-lookup
[ 87%] Built target llama-lookahead
[ 88%] Building CXX object examples/gen-docs/CMakeFiles/llama-gen-docs.dir/gen-docs.cpp.o
[ 88%] Built target llama-lookup
[ 89%] Building CXX object examples/training/CMakeFiles/llama-finetune.dir/finetune.cpp.o
[ 89%] Linking CXX executable ../../bin/llama-eval-callback
[ 89%] Linking CXX executable ../../bin/llama-lookup-stats
[ 89%] Linking CXX executable ../../bin/llama-parallel
[ 90%] Linking CXX executable ../../bin/llama-save-load-state
[ 90%] Linking CXX executable ../../bin/llama-lookup-create
[ 90%] Built target llama-eval-callback
[ 90%] Building CXX object examples/diffusion/CMakeFiles/llama-diffusion-cli.dir/diffusion-cli.cpp.o
[ 90%] Built target llama-lookup-stats
[ 90%] Building CXX object examples/convert-llama2c-to-ggml/CMakeFiles/llama-convert-llama2c-to-ggml.dir/convert-llama2c-to-ggml.cpp.o
[ 90%] Built target llama-parallel
[ 90%] Building CXX object pocs/vdot/CMakeFiles/llama-vdot.dir/vdot.cpp.o
[ 90%] Built target llama-lookup-create
[ 91%] Building CXX object pocs/vdot/CMakeFiles/llama-q8dot.dir/q8dot.cpp.o
[ 91%] Built target llama-save-load-state
[ 91%] Building CXX object tools/batched-bench/CMakeFiles/llama-batched-bench.dir/batched-bench.cpp.o
[ 91%] Linking CXX executable ../../bin/llama-passkey
[ 91%] Linking CXX executable ../../bin/llama-retrieval
[ 91%] Built target llama-passkey
[ 92%] Building CXX object tools/gguf-split/CMakeFiles/llama-gguf-split.dir/gguf-split.cpp.o
[ 92%] Linking CXX executable ../../bin/llama-vdot
[ 92%] Built target llama-retrieval
[ 92%] Building CXX object tools/imatrix/CMakeFiles/llama-imatrix.dir/imatrix.cpp.o
[ 92%] Building CXX object tools/llama-bench/CMakeFiles/llama-bench.dir/llama-bench.cpp.o
[ 92%] Linking CXX executable ../../bin/llama-q8dot
[ 92%] Built target llama-vdot
[ 92%] Building CXX object tools/completion/CMakeFiles/llama-completion.dir/completion.cpp.o
[ 92%] Built target llama-q8dot
[ 92%] Linking CXX executable ../../bin/llama-debug
[ 92%] Linking CXX executable ../../bin/llama-gen-docs
[ 92%] Linking CXX executable ../../bin/llama-finetune
[ 92%] Building CXX object tools/perplexity/CMakeFiles/llama-perplexity.dir/perplexity.cpp.o
[ 92%] Built target llama-gen-docs
[ 93%] Linking CXX executable ../../bin/llama-batched-bench
[ 93%] Linking CXX static library libgguf-model-data.a
[ 93%] Building CXX object tools/quantize/CMakeFiles/llama-quantize.dir/quantize.cpp.o
[ 93%] Built target llama-debug
[ 94%] Building CXX object tools/server/CMakeFiles/server-context.dir/server-task.cpp.o
[ 94%] Built target llama-finetune
[ 94%] Built target gguf-model-data
[ 94%] Linking CXX executable ../../bin/llama-gguf-split
[ 94%] Building CXX object tools/server/CMakeFiles/server-context.dir/server-queue.cpp.o
[ 94%] Building CXX object tools/tokenize/CMakeFiles/llama-tokenize.dir/tokenize.cpp.o
[ 94%] Linking CXX executable ../bin/test-peg-parser
[ 94%] Built target llama-gguf-split
[ 95%] Linking CXX executable ../../bin/llama-speculative
/home/sparky/Workspace/ai-spark/src/llama-cpp/tools/perplexity/perplexity.cpp: In lambda function:
/home/sparky/Workspace/ai-spark/src/llama-cpp/tools/perplexity/perplexity.cpp:1771:41: note: parameter passing for argument of type ‘std::pair<double, double>’ when C++17 is enabled changed to match C++14 in GCC 10.1
 1771 |             return std::make_pair(0., 0.);
      |                                         ^
[ 95%] Building CXX object tools/parser/CMakeFiles/llama-debug-template-parser.dir/debug-template-parser.cpp.o
[ 95%] Built target test-peg-parser
[ 95%] Building CXX object tools/parser/CMakeFiles/llama-template-analysis.dir/template-analysis.cpp.o
[ 95%] Built target llama-batched-bench
[ 95%] Building CXX object tools/tts/CMakeFiles/llama-tts.dir/tts.cpp.o
[ 95%] Linking CXX executable ../../bin/llama-speculative-simple
[ 95%] Linking CXX executable ../../bin/llama-convert-llama2c-to-ggml
[ 95%] Built target llama-speculative
[ 96%] Linking CXX executable ../../bin/llama-tokenize
[ 96%] Building CXX object tools/mtmd/CMakeFiles/llama-mtmd-cli.dir/mtmd-cli.cpp.o
[ 96%] Built target llama-convert-llama2c-to-ggml
[ 96%] Building CXX object tools/server/CMakeFiles/server-context.dir/server-common.cpp.o
[ 96%] Built target llama-speculative-simple
[ 96%] Building CXX object tools/server/CMakeFiles/server-context.dir/server-context.cpp.o
[ 96%] Built target llama-tokenize
[ 96%] Building CXX object tools/mtmd/CMakeFiles/llama-mtmd-debug.dir/debug/mtmd-debug.cpp.o
[ 96%] Linking CXX executable ../../bin/llama-diffusion-cli
[ 96%] Linking CXX executable ../../bin/llama-quantize
[ 96%] Built target llama-quantize
[ 96%] Building CXX object tools/cvector-generator/CMakeFiles/llama-cvector-generator.dir/cvector-generator.cpp.o
[ 96%] Built target llama-diffusion-cli
[ 96%] Building CXX object tools/export-lora/CMakeFiles/llama-export-lora.dir/export-lora.cpp.o
[ 96%] Linking CXX executable ../bin/test-chat
[ 96%] Linking CXX executable ../bin/test-chat-auto-parser
[ 96%] Linking CXX executable ../../bin/llama-perplexity
[ 96%] Built target test-chat
[ 96%] Building CXX object tools/server/CMakeFiles/server-context.dir/server-tools.cpp.o
[ 96%] Built target test-chat-auto-parser
[ 96%] Building CXX object tools/fit-params/CMakeFiles/llama-fit-params.dir/fit-params.cpp.o
[ 96%] Linking CXX executable ../bin/test-chat-peg-parser
[ 96%] Built target llama-perplexity
[ 96%] Building CXX object tools/results/CMakeFiles/llama-results.dir/results.cpp.o
[ 96%] Built target test-chat-peg-parser
[ 96%] Building CXX object tests/CMakeFiles/test-gguf-model-data.dir/test-gguf-model-data.cpp.o
[ 96%] Linking CXX executable ../../bin/llama-mtmd-debug
[ 97%] Linking CXX executable ../bin/test-gguf-model-data
[ 97%] Built target test-gguf-model-data
[ 97%] Built target llama-mtmd-debug
[ 97%] Linking CXX executable ../../bin/llama-results
[ 97%] Linking CXX executable ../../bin/llama-mtmd-cli
[ 97%] Built target llama-results
[ 97%] Linking CXX executable ../../bin/llama-fit-params
[ 97%] Built target llama-mtmd-cli
[ 98%] Linking CXX executable ../../bin/llama-completion
[ 98%] Built target llama-fit-params
[ 98%] Linking CXX executable ../../bin/llama-imatrix
[ 98%] Linking CXX executable ../../bin/llama-cvector-generator
[ 98%] Built target llama-completion
[ 98%] Linking CXX executable ../../bin/llama-export-lora
[ 98%] Built target llama-cvector-generator
[ 98%] Built target llama-imatrix
[ 98%] Built target llama-export-lora
[ 98%] Linking CXX executable ../bin/test-jinja
[ 98%] Built target test-jinja
[ 98%] Linking CXX executable ../../bin/llama-template-analysis
[ 98%] Linking CXX executable ../../bin/llama-debug-template-parser
[ 98%] Built target llama-template-analysis
[ 98%] Built target llama-debug-template-parser
[ 98%] Linking CXX executable ../../bin/llama-bench
[ 98%] Built target llama-bench
[ 98%] Linking CXX executable ../bin/test-backend-ops
[ 98%] Linking CXX executable ../../bin/llama-tts
[ 98%] Built target test-backend-ops
[ 98%] Built target llama-tts
[ 99%] Linking CXX static library libserver-context.a
[ 99%] Built target server-context
[ 99%] Generating loading.html.hpp
[ 99%] Generating index.html.gz.hpp
[ 99%] Building CXX object tools/cli/CMakeFiles/llama-cli.dir/cli.cpp.o
[ 99%] Building CXX object tools/server/CMakeFiles/llama-server.dir/server-models.cpp.o
[ 99%] Building CXX object tools/server/CMakeFiles/llama-server.dir/server-http.cpp.o
[ 99%] Building CXX object tools/server/CMakeFiles/llama-server.dir/server.cpp.o
[ 99%] Linking CXX executable ../../bin/llama-cli
[ 99%] Built target llama-cli
[100%] Linking CXX executable ../../bin/llama-server
[100%] Built target llama-server

src/llama-cpp/build/bin/llama-bench \
 -m ~/.cache/huggingface/hub/models--TheBloke--Llama-2-7B-GGUF/snapshots/b4e04e128f421c93a5f1e34ac4d7ca9b0af47b80/llama-2-7b.Q4_0.gguf \
 -ngl 100 -fa 0,1 -mmp 0

gml_cuda_init: found 1 CUDA devices (Total VRAM: 124610 MiB):
Device 0: NVIDIA GB10, compute capability 12.1, VMM: yes, VRAM: 124610 MiB

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	100	0	pp512	3293.78 ± 32.18
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	100	0	tg128	54.39 ± 0.07
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	100	1	pp512	3920.63 ± 60.14
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	100	1	tg128	54.91 ± 0.05

build: 7c20367 (8580)

5 replies

pontostroy Mar 30, 2026

Strange, why i have better results

/home/pont/git/llama.cpp/build/bin/llama-bench -hf TheBloke/Llama-2-7B-GGUF:Q4_0 -fa 0,1 --mmap 0 -dev CUDA0
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 124546 MiB):
Device 0: NVIDIA GB10, compute capability 12.1, VMM: yes, VRAM: 124546 MiB
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GB10 (NVIDIA) | uma: 1 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2

model	size	params	backend	ngl	fa	dev	mmap	test	t/s
common_download_file_single_online: using cached file: /home/pont/.cache/huggingface/hub/models--TheBloke--Llama-2-7B-GGUF/snapshots/b4e04e128f421c93a5f1e34ac4d7ca9b0af47b80/llama-2-7b.Q4_0.gguf
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA,Vulkan	99	0	CUDA0	0	pp512	3306.61 ± 59.73
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA,Vulkan	99	0	CUDA0	0	tg128	59.06 ± 0.06
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA,Vulkan	99	1	CUDA0	0	pp512	4016.46 ± 43.89
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA,Vulkan	99	1	CUDA0	0	tg128	59.62 ± 0.41

build: 7c20367 (8580)

sdwolfz Mar 30, 2026

@pontostroy Any difference in your compile log vs mine? Build/runtime vars or firmware/cuda versions?

pontostroy Mar 30, 2026

@sdwolfz
Latest apt updates and fw available for ASUS gx10

cmake -B build -DGGML_NATIVE=ON -DGGML_CUDA=ON -DGGML_CURL=ON -DGGML_RPC=ON -DCMAKE_CUDA_ARCHITECTURES=121a-real -DLLAMA_BUILD_TOOLS=true -DBUILD_SHARED_LIBS=ON -DCMAKE_C_FLAGS="-march=native -O3 -fomit-frame-pointer -pipe" -DCMAKE_CXX_FLAGS="-march=native -O3 -fomit-frame-pointer -pipe" -DGGML_CUDA_FA_ALL_VARIANTS=ON -DGGML_CUDA_GRAPH=ON -DGGML_CCACHE=OFF -DGGML_CUDA_FORCE_MMQ=ON -DGGML_LTO=O -DGGML_VULKAN=on 
cmake --build build --config Release -j 16 --clean-first

You can try running my llama.cpp image in Docker

docker run -it -v ~/.cache/huggingface:/root/.cache/huggingface/ --gpus=all ghcr.io/pontostroy/llama.cpp:full-cuda13 --bench -hf TheBloke/Llama-2-7B-GGUF:Q4_0 -fa 0,1 --mmap 0
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 124546 MiB):
Device 0: NVIDIA GB10, compute capability 12.1, VMM: yes, VRAM: 124546 MiB
load_backend: loaded CUDA backend from /app/libggml-cuda.so
load_backend: loaded CPU backend from /app/libggml-cpu.so
common_download_file_single_online: using cached file: /root/.cache/huggingface/hub/models--TheBloke--Llama-2-7B-GGUF/snapshots/b4e04e128f421c93a5f1e34ac4d7ca9b0af47b80/llama-2-7b.Q4_0.gguf

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp512	3261.99 ± 71.57
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg128	58.38 ± 0.04
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp512	3954.97 ± 53.51
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg128	59.05 ± 0.08

build: a8d94ae (8582)

sdwolfz Mar 30, 2026

Nope, getting the same results with both your build arguments and your docker image. I guess it's just hardware differences between the Asus DGX and Founders Edition 🤷‍♂️

pt13762104 Mar 31, 2026

Might have something to do with memory clocks...

phuongnamzz · 2026-04-02T08:40:27Z

phuongnamzz
Apr 2, 2026

./llama-bench -m ../../../llama-2-7b.Q4_0.gguf -ngl 99 -fa 0,1 -p 512,1024,2048,4096,8192,16384,3
2768 -n 128,256,512,1024,2048
ggml_cuda_init: found 4 CUDA devices (Total VRAM: 389000 MiB):
Device 0: NVIDIA RTX PRO 6000 Blackwell Workstation Edition, compute capability 12.0, VMM: yes, VRAM: 97250 MiB
Device 1: NVIDIA RTX PRO 6000 Blackwell Workstation Edition, compute capability 12.0, VMM: yes, VRAM: 97250 MiB
Device 2: NVIDIA RTX PRO 6000 Blackwell Workstation Edition, compute capability 12.0, VMM: yes, VRAM: 97250 MiB
Device 3: NVIDIA RTX PRO 6000 Blackwell Workstation Edition, compute capability 12.0, VMM: yes, VRAM: 97250 MiB

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp512	15928.07 ± 800.27
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp1024	21464.62 ± 21.04
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp2048	26467.24 ± 10.96
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp4096	26840.74 ± 26.38
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp8192	23064.67 ± 4.40
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp16384	16623.47 ± 4.44
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp32768	9726.40 ± 2.77
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg128	264.66 ± 0.20
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg256	264.72 ± 0.05
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg512	255.24 ± 0.67
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg1024	252.82 ± 0.14
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg2048	243.36 ± 0.16
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp512	18012.64 ± 298.36
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp1024	19095.11 ± 7.86
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp2048	19860.15 ± 16.29
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp4096	19489.79 ± 8.16
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp8192	17749.22 ± 3.45
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp16384	14526.55 ± 6.11
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp32768	9486.20 ± 1.76
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg128	275.79 ± 0.19
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg256	276.16 ± 0.08
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg512	271.15 ± 0.37
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg1024	265.54 ± 0.20
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg2048	253.86 ± 0.08

build: dae2bf4 (8631)

0 replies

This comment has been hidden.

Sign in to view

This comment has been hidden.

Sign in to view

This comment has been hidden.

Sign in to view

This comment has been hidden.

Sign in to view

This comment has been hidden.

Sign in to view

This comment has been hidden.

Sign in to view

Performance of llama.cpp on Nvidia CUDA #15013

Uh oh!

Uh oh!

Instructions

CUDA Scoreboard for Llama 2 7B, Q4_0 (no FA)

CUDA Scoreboard for Llama 2 7B, Q4_0 (with FA)

More detailed test

Replies: 100 comments · 74 replies

This comment has been hidden.

This comment has been hidden.

This comment has been hidden.

This comment has been hidden.

This comment has been hidden.

Uh oh!

CUDA Scoreboard for Llama 2 7B, Q4_0 (no FA)

CUDA Scoreboard for Llama 2 7B, Q4_0 (with FA)

Uh oh!

Uh oh!

Uh oh!

olegshulyakov Aug 1, 2025 Author

Uh oh!

slaren Aug 1, 2025 Maintainer

Uh oh!

Uh oh!

Uh oh!

Uh oh!

olegshulyakov Aug 7, 2025 Author

Uh oh!

Uh oh!

Uh oh!

olegshulyakov Aug 2, 2025 Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

olegshulyakov Aug 2, 2025 Author

Uh oh!

ggerganov Aug 2, 2025 Maintainer

Uh oh!

olegshulyakov Aug 2, 2025 Author

Uh oh!

Uh oh!

RTX 5060 Ti 16 GB

RTX 4060 Ti 8 GB

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Replies: 100 comments 74 replies

olegshulyakov Aug 1, 2025
Author

slaren
Aug 1, 2025
Maintainer

olegshulyakov Aug 7, 2025
Author

olegshulyakov Aug 2, 2025
Author

olegshulyakov Aug 2, 2025
Author

ggerganov
Aug 2, 2025
Maintainer

olegshulyakov Aug 2, 2025
Author