Five-Whys
- Why doesn't batch mode use wgpu? →
init_batch_model() in batch.rs only tries OwnedQuantizedModelCuda. When CUDA parity fails, it falls to CPU.
- Why no wgpu in batch init? → The
BatchModel struct only has gpu: Option<OwnedQuantizedModelCuda> and cpu: Option<OwnedQuantizedModel>. No wgpu variant.
- Why no wgpu variant? → Batch was designed before wgpu inference existed.
- Why does this matter? → 32B model in worker mode times out (316s per problem). Batch mode loads model once. Without wgpu batch, 32B eval is CPU-only (slow) or worker-mode (timeouts).
- Fix: Add wgpu to
init_batch_model — when CUDA parity fails, try wgpu before CPU.
Contract
gpu-multi-backend-parity-v1.yaml equation backend_priority:
select(backends) = first(b in [cuda, wgpu, cpu] where parity(b) >= 0.98)
Currently violated in batch path — batch skips wgpu entirely.
Acceptance Criteria
apr run model.apr --batch-jsonl prompts.jsonl --gpu uses wgpu when CUDA parity fails
- Batch output shows
"used_gpu": true with wgpu backend
- 32B MBPP eval completes without timeouts
Five-Whys
init_batch_model()inbatch.rsonly triesOwnedQuantizedModelCuda. When CUDA parity fails, it falls to CPU.BatchModelstruct only hasgpu: Option<OwnedQuantizedModelCuda>andcpu: Option<OwnedQuantizedModel>. No wgpu variant.init_batch_model— when CUDA parity fails, try wgpu before CPU.Contract
gpu-multi-backend-parity-v1.yamlequationbackend_priority:Currently violated in batch path — batch skips wgpu entirely.
Acceptance Criteria
apr run model.apr --batch-jsonl prompts.jsonl --gpuuses wgpu when CUDA parity fails"used_gpu": truewith wgpu backend