OS
Linux
GPU Library
CUDA 12.x
Python version
3.12
Describe the bug
When prompting Qwen VL models with a long(>4096 max_seq_len ) enough prompt the call fails with the following error:
Dec 06 23:53:16 ailab llama-swap[3174452]: models-local/qwen3-vl-32b-instruct-exl3. Skipping inline model load.
Dec 06 23:53:16 ailab llama-swap[3174452]: 2025-12-06 23:53:16.673 INFO: Received chat completion request
Dec 06 23:53:16 ailab llama-swap[3174452]: 96b5accf70144d28907c306816d5513e
Dec 06 23:53:16 ailab llama-swap[3174452]: 2025-12-06 23:53:16.712 ERROR: Traceback (most recent call last):
Dec 06 23:53:16 ailab llama-swap[3174452]: 2025-12-06 23:53:16.712 ERROR: File
Dec 06 23:53:16 ailab llama-swap[3174452]: "/home/user1/projects/tabbyAPI/endpoints/OAI/utils/chat_completion.py", line 437,
Dec 06 23:53:16 ailab llama-swap[3174452]: in generate_chat_completion
Dec 06 23:53:16 ailab llama-swap[3174452]: 2025-12-06 23:53:16.712 ERROR: generations = await
Dec 06 23:53:16 ailab llama-swap[3174452]: asyncio.gather(*gen_tasks)
Dec 06 23:53:16 ailab llama-swap[3174452]: 2025-12-06 23:53:16.712 ERROR:
Dec 06 23:53:16 ailab llama-swap[3174452]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Dec 06 23:53:16 ailab llama-swap[3174452]: 2025-12-06 23:53:16.712 ERROR: File
Dec 06 23:53:16 ailab llama-swap[3174452]: "/home/user1/projects/tabbyAPI/backends/exllamav3/model.py", line 692, in generate
Dec 06 23:53:16 ailab llama-swap[3174452]: 2025-12-06 23:53:16.712 ERROR: async for generation in
Dec 06 23:53:16 ailab llama-swap[3174452]: self.stream_generate(
Dec 06 23:53:16 ailab llama-swap[3174452]: 2025-12-06 23:53:16.712 ERROR: File
Dec 06 23:53:16 ailab llama-swap[3174452]: "/home/user1/projects/tabbyAPI/backends/exllamav3/model.py", line 779, in
Dec 06 23:53:16 ailab llama-swap[3174452]: stream_generate
Dec 06 23:53:16 ailab llama-swap[3174452]: 2025-12-06 23:53:16.712 ERROR: async for generation_chunk in
Dec 06 23:53:16 ailab llama-swap[3174452]: self.generate_gen(
Dec 06 23:53:16 ailab llama-swap[3174452]: 2025-12-06 23:53:16.712 ERROR: File
Dec 06 23:53:16 ailab llama-swap[3174452]: "/home/user1/projects/tabbyAPI/backends/exllamav3/model.py", line 968, in
Dec 06 23:53:16 ailab llama-swap[3174452]: generate_gen
Dec 06 23:53:16 ailab llama-swap[3174452]: 2025-12-06 23:53:16.712 ERROR: raise ValueError(
Dec 06 23:53:16 ailab llama-swap[3174452]: 2025-12-06 23:53:16.712 ERROR: ValueError: Prompt length 10083 is greater
Dec 06 23:53:16 ailab llama-swap[3174452]: than max_seq_len 4096
Dec 06 23:53:16 ailab llama-swap[3174452]: 2025-12-06 23:53:16.715 ERROR: Sent to request: Chat completion
Dec 06 23:53:16 ailab llama-swap[3174452]: 96b5accf70144d28907c306816d5513e aborted. Maybe the model was unloaded? Please
Dec 06 23:53:16 ailab llama-swap[3174452]: check the server console.
Dec 06 23:53:16 ailab llama-swap[3174452]: 2025-12-06 23:53:16.716 INFO: 192.168.10.45:0 - "POST /v1/chat/completions
Dec 06 23:53:16 ailab llama-swap[3174452]: HTTP/1.1" 503
Dec 06 23:53:16 ailab llama-swap[3174452]: [WARN] metrics skipped, HTTP status=503, path=/v1/chat/completions
I believe the model configuration may not be assigned correctly to max_seq_len and fails here:
|
if context_len > self.max_seq_len: |
Please let me know if you need more information.
Reproduction steps
Download a version of turboderp/Qwen3-VL-32B-Instruct-exl3 and run a call to endpoint with an image and a text prompt with > 4096 max_seq_len
Expected behavior
The api call should respect the model configuration from config.json
Logs
No response
Additional context
No response
Acknowledgements
OS
Linux
GPU Library
CUDA 12.x
Python version
3.12
Describe the bug
When prompting Qwen VL models with a long(>4096 max_seq_len ) enough prompt the call fails with the following error:
I believe the model configuration may not be assigned correctly to max_seq_len and fails here:
tabbyAPI/backends/exllamav3/model.py
Line 967 in 8b6b793
Please let me know if you need more information.
Reproduction steps
Download a version of turboderp/Qwen3-VL-32B-Instruct-exl3 and run a call to endpoint with an image and a text prompt with > 4096 max_seq_len
Expected behavior
The api call should respect the model configuration from config.json
Logs
No response
Additional context
No response
Acknowledgements