Runtime CUDA Memory Ballooning to OOM in Mistralrs while Candle-vllm is Stable

## Describe the bug
Attaching files, sending RAG data, or images to models being run by mistralrs causes the runtime to allocate more and more memory in the GPU(s) until an OOM crash occurs.

This is reproducible with any coding agent, web search MCP, or shell agent which can send files. The problem occurs both on CC890 devices and CC700.

When a model can be run by candle vllm, its memory profile is stable and no additional use occurs when passing data - context window exhaustion is the problem on that side. Mistralrs can truncate but truncation causes the same effect as the OOM - requries a restart of the service.


## Latest commit or version
Current master on both, problem has been ongoing over may revisions.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Runtime CUDA Memory Ballooning to OOM in Mistralrs while Candle-vllm is Stable #1589

Describe the bug

Latest commit or version

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Runtime CUDA Memory Ballooning to OOM in Mistralrs while Candle-vllm is Stable #1589

Description

Describe the bug

Latest commit or version

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions