-
Notifications
You must be signed in to change notification settings - Fork 18
Description
Description
We are attempting to reproduce the CL-Bench leaderboard results using the official codebase but are consistently getting significantly lower scores for SOTA models like GPT-5.1 and Gemini-3-Pro.
Our investigation points to a systematic issue with output truncation:
The judge prompt in eval.py requires detailed Chain-of-Thought (CoT) reasoning. For these models, the generated reasoning is often very long (10k+ characters), triggering API length limits or content filters. Since the Overall Score is the last field in the JSON output, it gets truncated, causing the JSON parser to fail and the sample to be scored as 0.
Reproduction Discrepancy
We used the exact same configuration as the official code (using gpt-5.1 as the judge).
| Model | Official Score (Reported) | Our Reproduction | Failure Rate (Parse Error) |
|---|---|---|---|
| GPT-5.1 | ~21.1% | 12.11% | 9.79% (186/1899 failed) |
| Gemini-3-Pro | ~14.8% | 10.27% | 11.80% (224/1899 failed) |
Observed Phenomenon
We found that simply increasing max_tokens (even up to 65,536) does not resolve the issue, as the API often terminates generation due to server-side content filters or timeouts when the CoT becomes excessive.
Because the parser strictly requires valid JSON, valid high-quality answers are being penalized solely due to this structural truncation issue.
Request
- Release of Detailed Results: Could you please release the raw
graded.jsonlfiles for the official leaderboard? This would allow us to verify if our API behavior is deviating from yours or if there's a difference in parsing logic. - Code Fix: We suspect the official results might have been obtained with a modified prompt (e.g., outputting the score first, or using a shorter CoT requirement). Could you provide the exact prompt or parsing logic used to achieve the reported scores?
Thank you!