[BUG] Qwen VL models with text prompt longer than max_seq_len 4096 length error

### OS

Linux

### GPU Library

CUDA 12.x

### Python version

3.12

### Describe the bug

When prompting Qwen VL models with a long(>4096 max_seq_len ) enough prompt the call fails with the following error:
```
Dec 06 23:53:16 ailab llama-swap[3174452]: models-local/qwen3-vl-32b-instruct-exl3. Skipping inline model load.
Dec 06 23:53:16 ailab llama-swap[3174452]: 2025-12-06 23:53:16.673 INFO:     Received chat completion request
Dec 06 23:53:16 ailab llama-swap[3174452]: 96b5accf70144d28907c306816d5513e
Dec 06 23:53:16 ailab llama-swap[3174452]: 2025-12-06 23:53:16.712 ERROR:    Traceback (most recent call last):
Dec 06 23:53:16 ailab llama-swap[3174452]: 2025-12-06 23:53:16.712 ERROR:      File
Dec 06 23:53:16 ailab llama-swap[3174452]: "/home/user1/projects/tabbyAPI/endpoints/OAI/utils/chat_completion.py", line 437,
Dec 06 23:53:16 ailab llama-swap[3174452]: in generate_chat_completion
Dec 06 23:53:16 ailab llama-swap[3174452]: 2025-12-06 23:53:16.712 ERROR:        generations = await
Dec 06 23:53:16 ailab llama-swap[3174452]: asyncio.gather(*gen_tasks)
Dec 06 23:53:16 ailab llama-swap[3174452]: 2025-12-06 23:53:16.712 ERROR:
Dec 06 23:53:16 ailab llama-swap[3174452]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Dec 06 23:53:16 ailab llama-swap[3174452]: 2025-12-06 23:53:16.712 ERROR:      File
Dec 06 23:53:16 ailab llama-swap[3174452]: "/home/user1/projects/tabbyAPI/backends/exllamav3/model.py", line 692, in generate
Dec 06 23:53:16 ailab llama-swap[3174452]: 2025-12-06 23:53:16.712 ERROR:        async for generation in
Dec 06 23:53:16 ailab llama-swap[3174452]: self.stream_generate(
Dec 06 23:53:16 ailab llama-swap[3174452]: 2025-12-06 23:53:16.712 ERROR:      File
Dec 06 23:53:16 ailab llama-swap[3174452]: "/home/user1/projects/tabbyAPI/backends/exllamav3/model.py", line 779, in
Dec 06 23:53:16 ailab llama-swap[3174452]: stream_generate
Dec 06 23:53:16 ailab llama-swap[3174452]: 2025-12-06 23:53:16.712 ERROR:        async for generation_chunk in
Dec 06 23:53:16 ailab llama-swap[3174452]: self.generate_gen(
Dec 06 23:53:16 ailab llama-swap[3174452]: 2025-12-06 23:53:16.712 ERROR:      File
Dec 06 23:53:16 ailab llama-swap[3174452]: "/home/user1/projects/tabbyAPI/backends/exllamav3/model.py", line 968, in
Dec 06 23:53:16 ailab llama-swap[3174452]: generate_gen
Dec 06 23:53:16 ailab llama-swap[3174452]: 2025-12-06 23:53:16.712 ERROR:        raise ValueError(
Dec 06 23:53:16 ailab llama-swap[3174452]: 2025-12-06 23:53:16.712 ERROR:    ValueError: Prompt length 10083 is greater
Dec 06 23:53:16 ailab llama-swap[3174452]: than max_seq_len 4096
Dec 06 23:53:16 ailab llama-swap[3174452]: 2025-12-06 23:53:16.715 ERROR:    Sent to request: Chat completion
Dec 06 23:53:16 ailab llama-swap[3174452]: 96b5accf70144d28907c306816d5513e aborted. Maybe the model was unloaded? Please
Dec 06 23:53:16 ailab llama-swap[3174452]: check the server console.
Dec 06 23:53:16 ailab llama-swap[3174452]: 2025-12-06 23:53:16.716 INFO:     192.168.10.45:0 - "POST /v1/chat/completions
Dec 06 23:53:16 ailab llama-swap[3174452]: HTTP/1.1" 503
Dec 06 23:53:16 ailab llama-swap[3174452]: [WARN] metrics skipped, HTTP status=503, path=/v1/chat/completions
```

I believe the model configuration may not be assigned correctly to max_seq_len and fails here: https://github.com/theroyallab/tabbyAPI/blob/8b6b793bfc4b848986d55340aed1f02e55ff9db8/backends/exllamav3/model.py#L967

Please let me know if you need more information.

 

### Reproduction steps

Download a version of  turboderp/Qwen3-VL-32B-Instruct-exl3 and run a call to endpoint with an **image and a text prompt** with > 4096 max_seq_len

### Expected behavior

The api call should respect the model configuration from config.json

### Logs

_No response_

### Additional context

_No response_

### Acknowledgements

- [x] I have looked for similar issues before submitting this one.
- [x] I have read the disclaimer, and this issue is related to a code bug. If I have a question, I will use the Discord server.
- [x] I understand that the developers have lives and my issue will be answered when possible.
- [x] I understand the developers of this program are human, and I will ask my questions politely.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BUG] Qwen VL models with text prompt longer than max_seq_len 4096 length error #401

OS

GPU Library

Python version

Describe the bug

Reproduction steps

Expected behavior

Logs

Additional context

Acknowledgements

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[BUG] Qwen VL models with text prompt longer than max_seq_len 4096 length error #401

Description

OS

GPU Library

Python version

Describe the bug

Reproduction steps

Expected behavior

Logs

Additional context

Acknowledgements

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions