Conversation
Make ggml-alloc work with concurrently dispatch.
9167c84 to
9b9905f
Compare
slaren
left a comment
There was a problem hiding this comment.
I cannot test it, but looks good.
|
The Based on what I'm seeing, it fails when BLAS / Accelerate is enabled. |
|
NVM, without Metal it fails because of this: However, with Metal enabled, it should work I think. However, I get this error: $ ▶ LLAMA_METAL=1 make -j && ./perplexity -m models/70B-v2/ggml-model-q4_0.bin -f build/wikitext-2-raw/wiki.test.raw -ngl 1 -gqa 8
I llama.cpp build info:
I UNAME_S: Darwin
I UNAME_P: arm
I UNAME_M: arm64
I CFLAGS: -I. -O3 -std=c11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -pthread -DGGML_USE_K_QUANTS -DGGML_USE_ACCELERATE -DGGML_USE_METAL -DGGML_METAL_NDEBUG
I CXXFLAGS: -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_K_QUANTS -DGGML_USE_METAL
I LDFLAGS: -framework Accelerate -framework Foundation -framework Metal -framework MetalKit
I CC: Apple clang version 14.0.3 (clang-1403.0.22.14.1)
I CXX: Apple clang version 14.0.3 (clang-1403.0.22.14.1)
make: Nothing to be done for `default'.
main: build = 992 (0919a0f)
main: seed = 1692218180
llama.cpp: loading model from models/70B-v2/ggml-model-q4_0.bin
llama_model_load_internal: warning: assuming 70B model based on GQA == 8
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 512
llama_model_load_internal: n_embd = 8192
llama_model_load_internal: n_mult = 4096
llama_model_load_internal: n_head = 64
llama_model_load_internal: n_head_kv = 8
llama_model_load_internal: n_layer = 80
llama_model_load_internal: n_rot = 128
llama_model_load_internal: n_gqa = 8
llama_model_load_internal: rnorm_eps = 5.0e-06
llama_model_load_internal: n_ff = 28672
llama_model_load_internal: freq_base = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype = 2 (mostly Q4_0)
llama_model_load_internal: model size = 70B
llama_model_load_internal: ggml ctx size = 0.21 MB
llama_model_load_internal: mem required = 37070.96 MB (+ 160.00 MB per state)
llama_new_context_with_model: kv self size = 160.00 MB
ggml_metal_init: allocating
ggml_metal_init: loading '/Users/ggerganov/development/github/llama.cpp/ggml-metal.metal'
ggml_metal_init: loaded kernel_add 0x130006dd0
ggml_metal_init: loaded kernel_add_row 0x130007510
ggml_metal_init: loaded kernel_mul 0x130007a50
ggml_metal_init: loaded kernel_mul_row 0x1300080a0
ggml_metal_init: loaded kernel_scale 0x1300085e0
ggml_metal_init: loaded kernel_silu 0x130008b20
ggml_metal_init: loaded kernel_relu 0x130009060
ggml_metal_init: loaded kernel_gelu 0x1300095a0
ggml_metal_init: loaded kernel_soft_max 0x130009c70
ggml_metal_init: loaded kernel_diag_mask_inf 0x13000a2f0
ggml_metal_init: loaded kernel_get_rows_f16 0x13000a9c0
ggml_metal_init: loaded kernel_get_rows_q4_0 0x13000b200
ggml_metal_init: loaded kernel_get_rows_q4_1 0x13000b8d0
ggml_metal_init: loaded kernel_get_rows_q2_K 0x13000bfa0
ggml_metal_init: loaded kernel_get_rows_q3_K 0x13000c670
ggml_metal_init: loaded kernel_get_rows_q4_K 0x13000cd40
ggml_metal_init: loaded kernel_get_rows_q5_K 0x13000d410
ggml_metal_init: loaded kernel_get_rows_q6_K 0x13000dae0
ggml_metal_init: loaded kernel_rms_norm 0x13000e1c0
ggml_metal_init: loaded kernel_norm 0x13000ea00
ggml_metal_init: loaded kernel_mul_mat_f16_f32 0x13000f2d0
ggml_metal_init: loaded kernel_mul_mat_q4_0_f32 0x13000fa50
ggml_metal_init: loaded kernel_mul_mat_q4_1_f32 0x1300101d0
ggml_metal_init: loaded kernel_mul_mat_q2_K_f32 0x130010ad0
ggml_metal_init: loaded kernel_mul_mat_q3_K_f32 0x130011250
ggml_metal_init: loaded kernel_mul_mat_q4_K_f32 0x1300119d0
ggml_metal_init: loaded kernel_mul_mat_q5_K_f32 0x130012150
ggml_metal_init: loaded kernel_mul_mat_q6_K_f32 0x130012ad0
ggml_metal_init: loaded kernel_mul_mm_f16_f32 0x1300134f0
ggml_metal_init: loaded kernel_mul_mm_q4_0_f32 0x130013cb0
ggml_metal_init: loaded kernel_mul_mm_q4_1_f32 0x130014470
ggml_metal_init: loaded kernel_mul_mm_q2_K_f32 0x130014c30
ggml_metal_init: loaded kernel_mul_mm_q3_K_f32 0x130015170
ggml_metal_init: loaded kernel_mul_mm_q4_K_f32 0x130015930
ggml_metal_init: loaded kernel_mul_mm_q5_K_f32 0x1300160f0
ggml_metal_init: loaded kernel_mul_mm_q6_K_f32 0x1300168b0
ggml_metal_init: loaded kernel_rope 0x130016df0
ggml_metal_init: loaded kernel_alibi_f32 0x1300176d0
ggml_metal_init: loaded kernel_cpy_f32_f16 0x130017f80
ggml_metal_init: loaded kernel_cpy_f32_f32 0x130018830
ggml_metal_init: loaded kernel_cpy_f16_f16 0x1300190e0
ggml_metal_init: recommendedMaxWorkingSetSize = 147456.00 MB
ggml_metal_init: hasUnifiedMemory = true
ggml_metal_init: maxTransferRate = built-in GPU
llama_new_context_with_model: compute buffer total size = 145.35 MB
llama_new_context_with_model: max tensor size = 205.08 MB
ggml_metal_add_buffer: allocated 'data ' buffer, size = 37071.20 MB, (37071.64 / 147456.00)
ggml_metal_add_buffer: allocated 'eval ' buffer, size = 1.36 MB, (37073.00 / 147456.00)
ggml_metal_add_buffer: allocated 'kv ' buffer, size = 162.00 MB, (37235.00 / 147456.00)
ggml_metal_add_buffer: allocated 'alloc ' buffer, size = 144.02 MB, (37379.02 / 147456.00)
system_info: n_threads = 16 / 24 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
perplexity: calculating perplexity over 655 chunks, batch_size=512
ggml_allocr_alloc: not enough space in the buffer (needed 58720256, largest block available 41943072)
GGML_ASSERT: ggml-alloc.c:133: !"not enough space in the buffer"
Abort trap: 6 |
|
This error is very surprising to me because the graph that the allocator uses to measure the buffer size should be identical to the graph used by the perplexity tool ( |
|
On a second thought, I don't think this could cause this issue. This could be diagnosed by defining |
|
There is a possibility that this could happen if the addresses of the weights are within this range: |
|
@ggerganov @slaren fixed in #2639 Strange that it didn't trigger before. Anyway, now our allocator is more robust.
|
|
Nice. Unrelated to that, I think that the |
|
Removing this line has no effect on token generation. Btw, may I ask what kinds of structs are stored there? |
|
This is the buffer used by the |
|
Yes, it should be removed. I was initially thinking about using it for storing temporary results (e.g. quantized versions of F32 intermediate tensors), but seems this will be obsoleted by ggml-org/ggml#455 |
|
Ahh, thank you for your explanation! |
Make ggml-alloc compatible with concurrent dispatch.
Test on M1 Max with
n_ctx=512, n_natch=512.Memory usage of compute buffer:
Token generation:
The reduced memory usage comes from #2411, and the inference speedup comes from #2358.