Describe the bug
Attaching files, sending RAG data, or images to models being run by mistralrs causes the runtime to allocate more and more memory in the GPU(s) until an OOM crash occurs.
This is reproducible with any coding agent, web search MCP, or shell agent which can send files. The problem occurs both on CC890 devices and CC700.
When a model can be run by candle vllm, its memory profile is stable and no additional use occurs when passing data - context window exhaustion is the problem on that side. Mistralrs can truncate but truncation causes the same effect as the OOM - requries a restart of the service.
Latest commit or version
Current master on both, problem has been ongoing over may revisions.
Describe the bug
Attaching files, sending RAG data, or images to models being run by mistralrs causes the runtime to allocate more and more memory in the GPU(s) until an OOM crash occurs.
This is reproducible with any coding agent, web search MCP, or shell agent which can send files. The problem occurs both on CC890 devices and CC700.
When a model can be run by candle vllm, its memory profile is stable and no additional use occurs when passing data - context window exhaustion is the problem on that side. Mistralrs can truncate but truncation causes the same effect as the OOM - requries a restart of the service.
Latest commit or version
Current master on both, problem has been ongoing over may revisions.