Skip to content

GH-560: Wire wgpu fallback into batch inference path #560

@noahgift

Description

@noahgift

Five-Whys

  1. Why doesn't batch mode use wgpu?init_batch_model() in batch.rs only tries OwnedQuantizedModelCuda. When CUDA parity fails, it falls to CPU.
  2. Why no wgpu in batch init? → The BatchModel struct only has gpu: Option<OwnedQuantizedModelCuda> and cpu: Option<OwnedQuantizedModel>. No wgpu variant.
  3. Why no wgpu variant? → Batch was designed before wgpu inference existed.
  4. Why does this matter? → 32B model in worker mode times out (316s per problem). Batch mode loads model once. Without wgpu batch, 32B eval is CPU-only (slow) or worker-mode (timeouts).
  5. Fix: Add wgpu to init_batch_model — when CUDA parity fails, try wgpu before CPU.

Contract

gpu-multi-backend-parity-v1.yaml equation backend_priority:

select(backends) = first(b in [cuda, wgpu, cpu] where parity(b) >= 0.98)

Currently violated in batch path — batch skips wgpu entirely.

Acceptance Criteria

  • apr run model.apr --batch-jsonl prompts.jsonl --gpu uses wgpu when CUDA parity fails
  • Batch output shows "used_gpu": true with wgpu backend
  • 32B MBPP eval completes without timeouts

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions