Skip to content

try to use different backends for binlog and build analysis#502

Draft
baronfel wants to merge 1 commit intodotnet:mainfrom
baronfel:eval-binlog-methods
Draft

try to use different backends for binlog and build analysis#502
baronfel wants to merge 1 commit intodotnet:mainfrom
baronfel:eval-binlog-methods

Conversation

@baronfel
Copy link
Copy Markdown
Member

@baronfel baronfel commented Apr 7, 2026

No description provided.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 7, 2026

Note

This PR is from a fork and modifies infrastructure files (eng/ or .github/).

Changes to infrastructure typically need to be submitted from a branch in dotnet/skills (not a fork) so that CI workflows run with the correct permissions and secrets.

Please consider recreating this PR from an upstream branch. If you don't have push access to dotnet/skills, ask a maintainer to push your branch for you.

@JanKrivanek
Copy link
Copy Markdown
Member

/evaluate

github-actions bot added a commit that referenced this pull request Apr 7, 2026
github-actions bot added a commit that referenced this pull request Apr 7, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 7, 2026

Skill Validation Results

Skill Scenario Quality Skills Loaded Overfit Verdict
check-bin-obj-clash Diagnose bin/obj output path clashes 3.7/5 → 5.0/5 🟢 ✅ check-bin-obj-clash; tools: skill / ✅ check-bin-obj-clash; binlog-generation; tools: skill 🟡 0.36 [1]
incremental-build Analyze incremental build issues 3.0/5 → 4.7/5 🟢 ✅ incremental-build; tools: skill, bash 🟡 0.33 [2]
build-perf-diagnostics Diagnose slow build for a small project 4.3/5 → 4.7/5 🟢 ✅ build-perf-diagnostics; tools: skill / ⚠️ NOT ACTIVATED 🟡 0.47 [3]
resolve-project-references Explain misleading ResolveProjectReferences time 3.0/5 → 5.0/5 🟢 ✅ resolve-project-references; tools: skill, glob / ✅ resolve-project-references; tools: skill ✅ 0.11
eval-performance Analyze MSBuild evaluation performance issues 3.0/5 → 4.0/5 🟢 ✅ eval-performance; tools: skill, bash / ✅ eval-performance; tools: skill 🟡 0.32 [4]
build-parallelism Analyze build parallelism bottlenecks 3.0/5 → 3.3/5 ⏰ 🟢 ✅ build-parallelism; tools: task, glob, skill, read_agent, bash, edit / ⚠️ NOT ACTIVATED 🟡 0.39 [5]
binlog-failure-analysis Diagnose build failures from binlog only (no source files) 4.3/5 → 5.0/5 🟢 ✅ binlog-failure-analysis; tools: skill, view 🟡 0.37 [6]

[1] ⚠️ High run-to-run variance (CV=25.78) — consider re-running with --runs 5
[2] ⚠️ High run-to-run variance (CV=1.92) — consider re-running with --runs 5. (Plugin) Quality improved but weighted score is -24.5% due to: judgment, tokens (58411 → 163329), quality, tool calls (6 → 10), time (40.7s → 58.0s)
[3] ⚠️ High run-to-run variance (CV=1.43) — consider re-running with --runs 5
[4] ⚠️ High run-to-run variance (CV=0.64) — consider re-running with --runs 5
[5] (Plugin) Quality improved but weighted score is -0.4% due to: tokens (55678 → 104094), tool calls (7 → 11), time (46.3s → 72.2s)
[6] ⚠️ High run-to-run variance (CV=1.80) — consider re-running with --runs 5

timeout — run(s) hit the (160s, 360s) scenario timeout limit; scoring may be impacted by aborting model execution before it could produce its full output (increase via timeout in eval.yaml)

Model: claude-opus-4.6 | Judge: claude-opus-4.6

🔍 Full Results - additional metrics and failure investigation steps

▶ Sessions Visualisation -- interactive replay of all evaluation sessions

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants