Unable to Reproduce Official Scores for SOTA Models (GPT-5.1 & Gemini-3-Pro) due to  GPT-5.1  Truncation

## Description
We are attempting to reproduce the CL-Bench leaderboard results using the official codebase but are consistently getting significantly lower scores for SOTA models like `GPT-5.1` and `Gemini-3-Pro`.

Our investigation points to a systematic issue with **output truncation**:
The judge prompt in `eval.py` requires detailed Chain-of-Thought (CoT) reasoning. For these models, the generated reasoning is often very long (10k+ characters), triggering API length limits or content filters. Since the `Overall Score` is the last field in the JSON output, it gets truncated, causing the JSON parser to fail and the sample to be scored as 0.

## Reproduction Discrepancy
We used the exact same configuration as the official code (using `gpt-5.1` as the judge).

| Model | Official Score (Reported) | Our Reproduction | Failure Rate (Parse Error) |
| :--- | :--- | :--- | :--- |
| **GPT-5.1** | ~21.1% | **12.11%** | 9.79% (186/1899 failed) |
| **Gemini-3-Pro** | ~14.8% | **10.27%** |  11.80% (224/1899 failed)  |

## Observed Phenomenon
We found that simply increasing `max_tokens` (even up to 65,536) does not resolve the issue, as the API often terminates generation due to server-side content filters or timeouts when the CoT becomes excessive.
Because the parser strictly requires valid JSON, valid high-quality answers are being penalized solely due to this structural truncation issue.

## Request
1.  **Release of Detailed Results**: Could you please release the raw `graded.jsonl` files for the official leaderboard? This would allow us to verify if our API behavior is deviating from yours or if there's a difference in parsing logic.
2.  **Code Fix**: We suspect the official results might have been obtained with a modified prompt (e.g., outputting the score first, or using a shorter CoT requirement). Could you provide the exact prompt or parsing logic used to achieve the reported scores?

Thank you!


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to Reproduce Official Scores for SOTA Models (GPT-5.1 & Gemini-3-Pro) due to GPT-5.1 Truncation #9

Description

Reproduction Discrepancy

Observed Phenomenon

Request

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Model	Official Score (Reported)	Our Reproduction	Failure Rate (Parse Error)
GPT-5.1	~21.1%	12.11%	9.79% (186/1899 failed)
Gemini-3-Pro	~14.8%	10.27%	11.80% (224/1899 failed)

Unable to Reproduce Official Scores for SOTA Models (GPT-5.1 & Gemini-3-Pro) due to GPT-5.1 Truncation #9

Description

Description

Reproduction Discrepancy

Observed Phenomenon

Request

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions