Skip to content

How to set token budget for Gemma4 with Ollama #651

Description

@somthing3000

The Gemma4 documentation by Google suggests setting the token budget to adjust the resolution quality of images for more accurate OCR tasks

https://ai.google.dev/gemma/docs/capabilities/vision/image#variable_resolution_token_budget

However, I do not see any configuration options to set it when running AsyncClient.chat(). Adding max_soft_tokens to options does not seem to do anything.

response = await AsyncClient(host=OLLAMA_HOST).chat(
            model=MODEL,
            messages=messages,
            format=self.result_model.model_json_schema(),
            options={
                "num_ctx": 32 * 1024,
                "temperature": 0.0,
                "max_soft_tokens": 560,    # does nothing
            }
        )

By default, Gemma4 models runs at 280 token budget which is not enough for my OCR task; I am testing with Gemma4-e4b.

To verify, I had a vehicle license plate image uploaded to the my ollama endpoint, the result had a missing letter at the end. Then I swapped to using pure transformers library to load an unquantized gemma4-e4b that also produces the same result. However, the transformers AutoProcessor library had an option to set max_soft_tokens where I set it to 560 and it produced the correct expected result.

// ollama default (q4_k_m) and unquantized gemma4
{
    "license_plate_number": "YRSGNB",    // expected YRSGNBY
    "license_plate_state": "California"
}

// unquantized gemma4 with max_soft_token=560
{
    "license_plate_number": "YRSGNBY",    // expected YRSGNBY
    "license_plate_state": "California"
}

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions