Skip to content

Per-layer F32 dequantization for CPU inference (32B OOM on 119GB) #478

@noahgift

Description

@noahgift

Problem

CPU inference dequantizes ALL quantized tensors (Q4K/Q8/etc.) to F32 at model load time, requiring num_params × 4 bytes of RAM for the F32 working set.

For 32B models: 32B × 4 = 128 GB F32, which exceeds the 119 GB unified memory on Project DIGITS (GB10). The process is OOM-killed at ~103 GB RSS.

7B models work fine (7B × 4 = 28 GB).

Proposed Fix

Implement per-layer dequantization: only hold one transformer layer's F32 tensors in memory at a time during the forward pass.

  • At layer i: dequant Q4K→F32 for layer i's weights, run forward, release F32
  • Peak memory: ~400 MB (single layer) instead of 128 GB (all layers)
  • The Q4K weights stay mmap'd (~18 GB) throughout

This is how llama.cpp and other CPU inference engines handle large models.

Context

  • Hardware: NVIDIA Project DIGITS (GB10), 119 GB LPDDR5X unified memory, 20 ARM cores
  • Model: Qwen2.5-Coder-32B-Instruct Q4_K_M (19 GB .apr file)
  • OOM details: total-vm:108GB, anon-rss:103GB before kill
  • 7B Q4K HumanEval result: 85.37% pass@1 on same hardware (CPU inference works for 7B)
  • GPU blocked: sm_121 (Blackwell) parity gate failure, CPU-only for now

Files

  • realizar/src/infer/ — CPU inference engine
  • Forward pass tensor loading and dequantization logic

Impact

Unlocks 32B+ model inference on consumer hardware with 64-128 GB RAM.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions