Skip to content

Runtime CUDA Memory Ballooning to OOM in Mistralrs while Candle-vllm is Stable #1589

@sempervictus

Description

@sempervictus

Describe the bug

Attaching files, sending RAG data, or images to models being run by mistralrs causes the runtime to allocate more and more memory in the GPU(s) until an OOM crash occurs.

This is reproducible with any coding agent, web search MCP, or shell agent which can send files. The problem occurs both on CC890 devices and CC700.

When a model can be run by candle vllm, its memory profile is stable and no additional use occurs when passing data - context window exhaustion is the problem on that side. Mistralrs can truncate but truncation causes the same effect as the OOM - requries a restart of the service.

Latest commit or version

Current master on both, problem has been ongoing over may revisions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions