Error with Subjective Judge in Creation MM Benchmark Using Custom Judge Model

Hi,

I am trying to use the Creation MM benchmark to evaluate the Qwen-2.5 7B VLM. Since I don’t have access to the OpenAI API key, I’ve modified the code to use the Hyperbolic API and Meta LLaMA as the judge model.

For initial testing, I ran the workflow on only 5 data items from the dataset. As shown in the attached image, I am successfully retrieving results for the **objective judge,** but I encounter an error for the **subjective judge**. Could you please explain the difference between these two judges and why I might be seeing this error?

I also ran the following standalone script to verify access to the judge model, and that script works as expected.
```

from vlmeval.api import OpenAIWrapper
model = OpenAIWrapper('meta-llama/Llama-3.3-70B-Instruct', verbose=True, use_hyperbolic=True)
msgs = [dict(type='text', value='Hello!')]
code, answer, resp = model.generate_inner(msgs)
print(code, answer, resp)
```

(the code is slighly modified to accommodate hyperbolic api)

<img width="1767" height="641" alt="Image" src="https://github.com/user-attachments/assets/eac2256c-cf8e-4d20-b7dd-420f64ecd17a" />


Let me know if you would require more context.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Error with Subjective Judge in Creation MM Benchmark Using Custom Judge Model #1

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Error with Subjective Judge in Creation MM Benchmark Using Custom Judge Model #1

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions