Problem
CPU inference dequantizes ALL quantized tensors (Q4K/Q8/etc.) to F32 at model load time, requiring num_params × 4 bytes of RAM for the F32 working set.
For 32B models: 32B × 4 = 128 GB F32, which exceeds the 119 GB unified memory on Project DIGITS (GB10). The process is OOM-killed at ~103 GB RSS.
7B models work fine (7B × 4 = 28 GB).
Proposed Fix
Implement per-layer dequantization: only hold one transformer layer's F32 tensors in memory at a time during the forward pass.
- At layer
i: dequant Q4K→F32 for layer i's weights, run forward, release F32
- Peak memory: ~400 MB (single layer) instead of 128 GB (all layers)
- The Q4K weights stay mmap'd (~18 GB) throughout
This is how llama.cpp and other CPU inference engines handle large models.
Context
- Hardware: NVIDIA Project DIGITS (GB10), 119 GB LPDDR5X unified memory, 20 ARM cores
- Model: Qwen2.5-Coder-32B-Instruct Q4_K_M (19 GB .apr file)
- OOM details:
total-vm:108GB, anon-rss:103GB before kill
- 7B Q4K HumanEval result: 85.37% pass@1 on same hardware (CPU inference works for 7B)
- GPU blocked: sm_121 (Blackwell) parity gate failure, CPU-only for now
Files
realizar/src/infer/ — CPU inference engine
- Forward pass tensor loading and dequantization logic
Impact
Unlocks 32B+ model inference on consumer hardware with 64-128 GB RAM.
Problem
CPU inference dequantizes ALL quantized tensors (Q4K/Q8/etc.) to F32 at model load time, requiring
num_params × 4 bytesof RAM for the F32 working set.For 32B models: 32B × 4 = 128 GB F32, which exceeds the 119 GB unified memory on Project DIGITS (GB10). The process is OOM-killed at ~103 GB RSS.
7B models work fine (7B × 4 = 28 GB).
Proposed Fix
Implement per-layer dequantization: only hold one transformer layer's F32 tensors in memory at a time during the forward pass.
i: dequant Q4K→F32 for layeri's weights, run forward, release F32This is how llama.cpp and other CPU inference engines handle large models.
Context
total-vm:108GB, anon-rss:103GBbefore killFiles
realizar/src/infer/— CPU inference engineImpact
Unlocks 32B+ model inference on consumer hardware with 64-128 GB RAM.