Skip to content

Fix eval push verifiers upload parity#588

Open
d42me wants to merge 4 commits intomainfrom
fix/eval-push-vf-results-parity
Open

Fix eval push verifiers upload parity#588
d42me wants to merge 4 commits intomainfrom
fix/eval-push-vf-results-parity

Conversation

@d42me
Copy link
Copy Markdown
Contributor

@d42me d42me commented Apr 30, 2026

Summary

  • Reuse the same verifiers result normalization for automatic uploads and prime eval push.
  • Preserve rollout viewer data by carrying non-standard vf-eval fields such as timing, token_usage, trajectory, status flags, and state columns through info.
  • Keep avg_* metrics and metadata intact for manually pushed vf-eval outputs.
  • Set the evaluation dataset from the pushed environment, matching automatic prime eval run uploads.

Validation

  • uv run pytest packages/prime/tests -q
  • uv run ruff check packages/prime/src/prime_cli/commands/evals.py packages/prime/src/prime_cli/utils/eval_push.py packages/prime/tests/test_eval_push.py
  • uv run ty check packages/prime/src packages/prime/tests/test_eval_push.py

Note

Medium Risk
Changes environment resolution semantics in the Evals SDK and prime eval push, which could affect how evaluations link to environments and may surface new 404/validation errors for previously auto-created names.

Overview
Aligns manual prime eval push uploads with automatic verifiers uploads by reusing shared normalization: avg_* metrics are extracted consistently, result samples are normalized (field aliases, timing-derived total_time/latency_ms), and non-standard vf-eval fields are preserved under info.

Prevents unintended environment creation during evaluation creation/push: the Evals SDK switches name-based environment resolution from /environmentshub/resolve to lookup-only /environmentshub/lookup, slug lookups use GET /environmentshub/{owner}/{name}/@latest, and create_evaluation now allows dataset-only evaluations (only erroring when environments are provided but none resolve). The CLI now treats bare env names as dataset labels and only links environments when an owner slug is provided, while also setting dataset from the pushed env reference.

Reviewed by Cursor Bugbot for commit 61cbfdc. Bugbot is set up for automated code reviews on this repo. Configure here.

@d42me d42me requested review from JannikSt and burnpiro April 30, 2026 21:52
@JannikSt
Copy link
Copy Markdown
Member

JannikSt commented May 4, 2026

@codex review

JannikSt
JannikSt previously approved these changes May 4, 2026
@chatgpt-codex-connector
Copy link
Copy Markdown

Codex Review: Didn't find any major issues. 👍

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants