Performance of llama.cpp on Nvidia CUDA #15013
Replies: 100 comments 74 replies
This comment has been hidden.
This comment has been hidden.
-
|
Here's the results for my devices. Not sure how to get a "cuda info string" though. CUDA Scoreboard for Llama 2 7B, Q4_0 (no FA)
CUDA Scoreboard for Llama 2 7B, Q4_0 (with FA)
|
Beta Was this translation helpful? Give feedback.
-
|
While technically not directly related, there may also be value in comparing AMD ROCM build here too, as ROCM acts a replacement (sometimes a directly compatible layer) for most CUDA calls. I admit risk of confusion for Nvidia users in the thread if this path is taken. |
Beta Was this translation helpful? Give feedback.
-
|
Device 0: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6, VMM: yes
build: 9c35706 (6060) Device 0: NVIDIA GeForce RTX 3080, compute capability 8.6, VMM: yes
build: 9c35706 (6060) |
Beta Was this translation helpful? Give feedback.
-
|
Device 0: NVIDIA GeForce RTX 4070 Ti SUPER, compute capability 8.9, VMM: yes
build: 9c35706 (647) |
Beta Was this translation helpful? Give feedback.
-
|
Device 0: 3090. Power limit to 250w
build: 9c35706 (6060) Device 2: 5090. Power limit to 400w
build: 9c35706 (6060) |
Beta Was this translation helpful? Give feedback.
-
|
Device 0: NVIDIA GeForce GTX 1080 Ti, compute capability 6.1, VMM: yes
Device 0: NVIDIA GeForce GTX 1080 Ti, compute capability 6.1, VMM: yes
build: 9c35706 (6060) |
Beta Was this translation helpful? Give feedback.
-
|
@olegshulyakov To help users quickly understand the approximate largest models that can run on each GPU, I suggest adding a VRAM column next to the GPU name on the main scoreboard. Example:
|
Beta Was this translation helpful? Give feedback.
-
|
Device 0: NVIDIA GeForce RTX 2060 SUPER, compute capability 7.5, VMM: yes
build: 5c0eb5e (6075) |
Beta Was this translation helpful? Give feedback.
-
|
@olegshulyakov I see you grabbed some of my numbers from the Vulkan thread. However, I flooded that post with a bunch of data that probably came across as noise. While you quoted my correct numbers for Non-FA, the FA results you grabbed were actually when run on two GPUs instead of one. To make things easier, here are the numbers from a single card: RTX 5060 Ti 16 GB
And here's another GPU for the collection: RTX 4060 Ti 8 GB
|
Beta Was this translation helpful? Give feedback.
-
|
Device 0: NVIDIA GeForce RTX 2080 Ti, compute capability 7.5, VMM: yes
build: 9c35706 (6060) |
Beta Was this translation helpful? Give feedback.
-
|
Yeah also saw numbers for my 4090 taken from the Vulkan thread. Re-ran CUDA results so you can get the latest FA and non-FA results from same build: FA:
Non-FA:
nvidia-dkms 575.64.03-1 ❯ nvcc --version |
Beta Was this translation helpful? Give feedback.
-
|
NVIDIA P106-100 I ran two times, took the best on 2 different build
build: 5fd160b (6106)
build: 860a9e4 (5688) Sadly, nvidia was not supporting this device for the vulkan driver |
Beta Was this translation helpful? Give feedback.
-
|
Would like to participate with a slightly exotic one from my cute server cube.. :-) (RTX 2000 Ada, 16GB, 75W) I did two runs:
gml_cuda_init: GGML_CUDA_FORCE_MMQ: no
build: 756cfea (6105)
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
build: 1d72c84 (6109) Seems to make no big difference... ^^ |
Beta Was this translation helpful? Give feedback.
-
|
I finally got my hands on similar card as before (NP106) but with display output NVIDIA GTX 1060
build: 5fd160b (6106) |
Beta Was this translation helpful? Give feedback.
-
|
3080 Laptop 8Gb 150W
^C — stopped after pp8192 (VRAM nearly full / spill suspected), did not run pp16384/pp32768.
build: e3b35dd (7509) |
Beta Was this translation helpful? Give feedback.
-
|
Hardware A (Windows): Hardware B (Unraid / Docker full-cuda): |
Beta Was this translation helpful? Give feedback.
-
|
Jetson AGX Orin 64GB /llama.cpp/build/bin/llama-bench -m /models/llama-2-7b.Q4_0.gguf -fa 0,1
build: 287a330 (7772) I used the same commit as @TinyServal (which is from Nov, 2025) as a comparison: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
build: c1b1876 (6987) Comparison
|
Beta Was this translation helpful? Give feedback.
-
|
NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition
Generation is almost on par with regular RTX PRO 6000 reported in the table, but pp is much lower. Card was power-limited to 300W (default setting). |
Beta Was this translation helpful? Give feedback.
-
$ ./llama-bench -m ~/models/llama-2-7b.Q4_0.gguf -ngl 99 -fa 0,1
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA RTX 4500 Ada Generation, compute capability 8.9, VMM: yes
|
Beta Was this translation helpful? Give feedback.
-
|
AMD strix halo (AI Max+ 395) |
Beta Was this translation helpful? Give feedback.
-
|
Laptop ASUS FA507NV c:\ai\llama_cpp>"./bin-win-cuda-13.1-x64/llama-bench.exe" -m "./models/llama-2-7b.Q4_0.gguf" -ngl 99 -fa 0,1
build: 2cce9fd (7993) c:\ai\llama_cpp>"./bin-win-cuda-13.1-x64/llama-bench.exe" -m "./models/llama-2-7b.Q4_0.gguf" -ngl 99 -fa 0,1 -p 512,1024,2048,4096,8192
build: 2cce9fd (7993) c:\ai\llama_cpp>"./bin-win-cuda-13.1-x64/llama-bench.exe" -m "./models/llama-2-7b.Q4_0.gguf" -ngl 99 -fa 0,1 -n 128,256,512,1024,2048
build: 2cce9fd (7993) |
Beta Was this translation helpful? Give feedback.
-
|
DGX spark performance was increased
build: d612901 (8076) |
Beta Was this translation helpful? Give feedback.
-
|
GTX 1070 | 8 GB / GDDR5 / 256 bit llama-bench -m llama-2-7b.Q4_0.gguf -ngl 99 -fa 0,1 -p 512,1024,2048,4096 -n 128,256,512,1024,2048
build: e48349a (8080) |
Beta Was this translation helpful? Give feedback.
-
|
(base) C:\bin\llama\llama-b8099-bin-win-cuda-13.1-x64> llama-bench -m D:\lmstudio_models\llama-2-7b.Q4_0.gguf -ngl 99 -fa 0,1
build: 3bb2fcc (8099) |
Beta Was this translation helpful? Give feedback.
-
|
RTX PRO 5000 Blackwell (48GB / GDDR7 / 384 bit) Driver Version: 580.105.08 ggml_cuda_init: found 1 CUDA devices:
build: d979f2b (8180) |
Beta Was this translation helpful? Give feedback.
-
|
Titan RTX (24 GB / GDDR6 / 384 bit) ggml_cuda_init: found 1 CUDA devices:
build: d979f2b (8180) |
Beta Was this translation helpful? Give feedback.
-
|
Dual 5060 Ti 16GB, on PCI Express 4.0x8, Ryzen 9 5900x ggml_cuda_init: found 2 CUDA devices:
build: f7db3f3 (8214) |
Beta Was this translation helpful? Give feedback.
-
DGX SparkCompile logCFLAGS="-O3 -mcpu=native -mtune=native -fomit-frame-pointer -pipe" \
CXXFLAGS="-O3 -mcpu=native -mtune=native -fomit-frame-pointer -pipe" \
cmake -S . -B build \
-DGGML_CUDA=ON \
-DGGML_CUDA_F16=ON \
-DGGML_CUDA_FORCE_MMQ=ON \
-DGGML_NATIVE=ON \
-DGGML_LTO=ON \
-DGGML_OPENMP=ON \
-DCMAKE_BUILD_TYPE=Release \
-DCMAKE_CUDA_ARCHITECTURES=121asrc/llama-cpp/build/bin/llama-bench \
-m ~/.cache/huggingface/hub/models--TheBloke--Llama-2-7B-GGUF/snapshots/b4e04e128f421c93a5f1e34ac4d7ca9b0af47b80/llama-2-7b.Q4_0.gguf \
-ngl 100 -fa 0,1 -mmp 0gml_cuda_init: found 1 CUDA devices (Total VRAM: 124610 MiB):
build: 7c20367 (8580) |
Beta Was this translation helpful? Give feedback.
-
|
./llama-bench -m ../../../llama-2-7b.Q4_0.gguf -ngl 99 -fa 0,1 -p 512,1024,2048,4096,8192,16384,3
build: dae2bf4 (8631) |
Beta Was this translation helpful? Give feedback.


Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
This is similar to the Performance of llama.cpp on Apple Silicon M-series, Performance of llama.cpp on AMD ROCm(HIP) and Performance of llama.cpp with Vulkan, but for CUDA! I think it's good to consolidate and discuss our results here.
We'll be testing the Llama 2 7B model like the other thread to keep things consistent, and use Q4_0 as it's simple to compute and small enough to fit on a 4GB GPU. You can download it here.
Instructions
Either run the commands below or download one of our CUDA releases. If you have multiple GPUs please run the test on a single GPU using
-sm none -mg YOUR_GPU_NUMBERunless the model is too big to fit in VRAM.Share your llama-bench results along with the git hash and CUDA info string in the comments. Feel free to try other models and compare backends, but only valid runs will be placed on the scoreboard.
If multiple entries are posted for the same device I'll prioritize newer commits with substantial CUDA updates, otherwise I'll pick the one with the highest overall score at my discretion. Performance may vary depending on driver, operating system, board manufacturer, etc. even if the chip is the same. For integrated graphics note that your memory speed and number of channels will greatly affect your inference speed!
CUDA Scoreboard for Llama 2 7B, Q4_0 (no FA)
CUDA Scoreboard for Llama 2 7B, Q4_0 (with FA)
More detailed test
The main idea of this test is to show a decrease in performance with increasing size.
Beta Was this translation helpful? Give feedback.
All reactions