Skip to content

run-tests: add critical rules, detection guidance, and troubleshooting#514

Draft
Evangelink wants to merge 8 commits intomainfrom
skill/run-tests-improvements
Draft

run-tests: add critical rules, detection guidance, and troubleshooting#514
Evangelink wants to merge 8 commits intomainfrom
skill/run-tests-improvements

Conversation

@Evangelink
Copy link
Copy Markdown
Member

  • Add 'Critical Rules' table to prevent cross-platform VSTest/MTP mistakes
  • Expand Step 1 detection with explicit file-by-file lookup table
  • Add 'dotnet --version' as first detection action
  • Add Troubleshooting section with 7 common error patterns
  • Expand Common Pitfalls from 3 to 7 entries
  • Strengthen negative guidance for SDK version-specific syntax

- Add 'Critical Rules' table to prevent cross-platform VSTest/MTP mistakes
- Expand Step 1 detection with explicit file-by-file lookup table
- Add 'dotnet --version' as first detection action
- Add Troubleshooting section with 7 common error patterns
- Expand Common Pitfalls from 3 to 7 entries
- Strengthen negative guidance for SDK version-specific syntax
@Evangelink
Copy link
Copy Markdown
Member Author

/evaluate

github-actions bot added a commit that referenced this pull request Apr 10, 2026
@github-actions
Copy link
Copy Markdown
Contributor

Skill Validation Results

Skill Scenario Quality Skills Loaded Overfit Verdict
run-tests Run tests in a VSTest MSTest project 4.0/5 → 5.0/5 🟢 ✅ run-tests; tools: skill, glob
run-tests Run tests with trx reporting on MTP project (SDK 9) 3.0/5 → 3.0/5 ⏰ ✅ run-tests; tools: skill, report_intent, bash, view, edit [1]
run-tests Run tests with blame-hang on MTP project (SDK 10) 3.0/5 ⏰ → 3.0/5 ⏰ ✅ run-tests; tools: skill, bash, edit, glob / ✅ run-tests; tools: skill, create, glob [2]
run-tests Run tests in a multi-TFM project targeting a specific framework 2.0/5 → 4.3/5 🟢 ✅ run-tests; tools: skill, bash, glob / ⚠️ NOT ACTIVATED
run-tests Filter MSTest tests by category on VSTest 5.0/5 → 5.0/5 ✅ run-tests; tools: skill, glob, bash / ⚠️ NOT ACTIVATED [3]
run-tests Filter NUnit tests by class name on VSTest 4.0/5 → 5.0/5 🟢 ✅ run-tests; tools: skill, bash / ⚠️ NOT ACTIVATED [4]
run-tests Filter xUnit v3 tests by class on MTP 1.0/5 → 5.0/5 🟢 ✅ run-tests; tools: report_intent, skill, view / ⚠️ NOT ACTIVATED
run-tests Filter xUnit v3 tests by trait on MTP 1.7/5 → 5.0/5 🟢 ✅ run-tests; tools: skill / ✅ run-tests; tools: skill, glob
run-tests Filter TUnit tests by class using treenode-filter 3.0/5 ⏰ → 3.0/5 ⏰ ✅ run-tests; tools: skill / ✅ run-tests; filter-syntax; tools: skill, bash, glob [5]
run-tests Combine multiple filter criteria on VSTest MSTest 3.0/5 → 3.0/5 ⚠️ NOT ACTIVATED / ✅ run-tests; tools: report_intent, view, skill [6]
run-tests MTP project on SDK 9 must use -- separator for args 3.0/5 → 3.0/5 ⚠️ NOT ACTIVATED / ✅ run-tests; tools: report_intent, skill, view, bash [7]
run-tests MTP project on SDK 10 passes args directly 3.0/5 → 3.0/5 ⚠️ NOT ACTIVATED [8]
run-tests Detect test platform from Directory.Build.props 3.0/5 → 3.0/5 ⚠️ NOT ACTIVATED / ✅ run-tests; tools: report_intent, view, skill, bash [9]
run-tests Negative test: do not use MTP syntax for a VSTest project 3.0/5 → 3.0/5 ⚠️ NOT ACTIVATED [10]

[1] ⚠️ High run-to-run variance (CV=1.18) — consider re-running with --runs 5. (Isolated) Quality unchanged but weighted score is -14.5% due to: tokens (270805 → 598314), errors (0 → 1), time (160.4s → 327.9s), tool calls (15 → 27)
[2] ⚠️ High run-to-run variance (CV=2.28) — consider re-running with --runs 5
[3] ⚠️ High run-to-run variance (CV=1.00) — consider re-running with --runs 5. (Plugin) Quality unchanged but weighted score is -1.6% due to: tokens (33321 → 41043)
[4] ⚠️ High run-to-run variance (CV=0.50) — consider re-running with --runs 5
[5] ⚠️ High run-to-run variance (CV=0.96) — consider re-running with --runs 5. (Isolated) Quality unchanged but weighted score is -8.2% due to: errors (0 → 1), time (48.4s → 120.2s), tool calls (2 → 3)
[6] (Plugin) Quality unchanged but weighted score is -5.0% due to: tokens (0 → 45654), tool calls (0 → 3), time (0ms → 17.9s)
[7] ⚠️ High run-to-run variance (CV=1.73) — consider re-running with --runs 5. (Isolated) Quality unchanged but weighted score is -2.5% due to: time (0ms → 1ms)
[8] ⚠️ High run-to-run variance (CV=0.87) — consider re-running with --runs 5. (Isolated) Quality unchanged but weighted score is -2.5% due to: time (0ms → 1ms)
[9] ⚠️ High run-to-run variance (CV=0.87) — consider re-running with --runs 5. (Isolated) Quality unchanged but weighted score is -2.5% due to: time (0ms → 1ms)
[10] (Plugin) Quality unchanged but weighted score is -5.0% due to: tokens (0 → 43072), tool calls (0 → 2), time (0ms → 16.6s)

timeout — run(s) hit the (120s, 300s, 360s) scenario timeout limit; scoring may be impacted by aborting model execution before it could produce its full output (increase via timeout in eval.yaml)

Model: claude-opus-4.6 | Judge: claude-opus-4.6

🔍 Full Results - additional metrics and failure investigation steps

▶ Sessions Visualisation -- interactive replay of all evaluation sessions

@Evangelink
Copy link
Copy Markdown
Member Author

/evaluate

@github-actions
Copy link
Copy Markdown
Contributor

Skill Validation Results

Skill Scenario Quality Skills Loaded Overfit Verdict
run-tests Run tests in a VSTest MSTest project 4.0/5 → 5.0/5 🟢 ✅ run-tests; tools: skill, glob / ✅ run-tests; tools: skill ✅ 0.17
run-tests Run tests with trx reporting on MTP project (SDK 9) 3.3/5 → 4.0/5 🟢 ✅ run-tests; tools: skill, read_bash, report_intent, bash, view, edit / ✅ run-tests; tools: skill, report_intent, bash, view, edit ✅ 0.17 [1]
run-tests Run tests with blame-hang on MTP project (SDK 10) 2.0/5 → 2.7/5 ⏰ 🟢 ✅ run-tests; tools: skill, bash, edit ✅ 0.17 [2]
run-tests Run tests in a multi-TFM project targeting a specific framework 2.0/5 → 4.0/5 🟢 ✅ run-tests; tools: skill, bash, glob, read_bash / ⚠️ NOT ACTIVATED ✅ 0.17
run-tests Filter MSTest tests by category on VSTest 5.0/5 → 5.0/5 ✅ run-tests; tools: skill, bash, glob / ⚠️ NOT ACTIVATED ✅ 0.17 [3]
run-tests Filter NUnit tests by class name on VSTest 4.0/5 → 5.0/5 🟢 ✅ run-tests; tools: skill, bash, glob / ⚠️ NOT ACTIVATED ✅ 0.17
run-tests Filter xUnit v3 tests by class on MTP 1.0/5 → 5.0/5 🟢 ✅ run-tests; tools: skill, view, bash / ⚠️ NOT ACTIVATED ✅ 0.17 [4]
run-tests Filter xUnit v3 tests by trait on MTP 1.7/5 ⏰ → 5.0/5 🟢 ✅ run-tests; tools: skill / ✅ run-tests; tools: skill, glob ✅ 0.17
run-tests Filter TUnit tests by class using treenode-filter 2.7/5 → 4.3/5 🟢 ✅ run-tests; tools: skill, bash ✅ 0.17
run-tests Combine multiple filter criteria on VSTest MSTest 4.7/5 → 4.7/5 ✅ run-tests; tools: skill, bash, glob ✅ 0.17 [5]
run-tests MTP project on SDK 9 must use -- separator for args 1.3/5 → 5.0/5 🟢 ✅ run-tests; tools: skill, glob ✅ 0.17
run-tests MTP project on SDK 10 passes args directly 3.3/5 → 4.0/5 🟢 ✅ run-tests; tools: skill, glob ✅ 0.17 [6]
run-tests Detect test platform from Directory.Build.props 3.0/5 → 5.0/5 🟢 ✅ run-tests; tools: skill ✅ 0.17 [7]
run-tests Negative test: do not use MTP syntax for a VSTest project 4.0/5 → 5.0/5 🟢 ✅ run-tests; tools: skill, view / ✅ run-tests; tools: skill, view, glob ✅ 0.17

[1] ⚠️ High run-to-run variance (CV=0.53) — consider re-running with --runs 5
[2] ⚠️ High run-to-run variance (CV=18.07) — consider re-running with --runs 5
[3] ⚠️ High run-to-run variance (CV=6.03) — consider re-running with --runs 5
[4] ⚠️ High run-to-run variance (CV=0.65) — consider re-running with --runs 5
[5] ⚠️ High run-to-run variance (CV=2.50) — consider re-running with --runs 5
[6] ⚠️ High run-to-run variance (CV=1.00) — consider re-running with --runs 5
[7] ⚠️ High run-to-run variance (CV=0.98) — consider re-running with --runs 5

timeout — run(s) hit the (120s, 300s) scenario timeout limit; scoring may be impacted by aborting model execution before it could produce its full output (increase via timeout in eval.yaml)

Model: claude-opus-4.6 | Judge: claude-opus-4.6

🔍 Full Results - additional metrics and failure investigation steps

▶ Sessions Visualisation -- interactive replay of all evaluation sessions

@Evangelink
Copy link
Copy Markdown
Member Author

/evaluate

github-actions bot added a commit that referenced this pull request Apr 12, 2026
@github-actions
Copy link
Copy Markdown
Contributor

Skill Validation Results

Skill Scenario Quality Skills Loaded Overfit Verdict
run-tests Run tests in a VSTest MSTest project 4.0/5 → 5.0/5 🟢 ✅ run-tests; tools: skill, glob / ✅ run-tests; tools: skill ✅ 0.12
run-tests Run tests with trx reporting on MTP project (SDK 9) 3.0/5 ⏰ → 3.0/5 ⏰ ✅ run-tests; tools: skill / ✅ run-tests; tools: skill, edit ✅ 0.12 [1]
run-tests Run tests with blame-hang on MTP project (SDK 10) 3.0/5 → 3.0/5 ⏰ ✅ run-tests; tools: skill, bash, edit, glob / ✅ run-tests; tools: skill, bash, edit, create, glob ✅ 0.12 [2]
run-tests Run tests in a multi-TFM project targeting a specific framework 2.0/5 → 4.0/5 🟢 ✅ run-tests; tools: skill, bash, read_bash, glob / ⚠️ NOT ACTIVATED ✅ 0.12
run-tests Filter MSTest tests by category on VSTest 5.0/5 → 5.0/5 ✅ run-tests; tools: skill, bash, glob / ⚠️ NOT ACTIVATED ✅ 0.12 [3]
run-tests Filter NUnit tests by class name on VSTest 4.0/5 → 5.0/5 🟢 ✅ run-tests; tools: skill, bash, glob / ⚠️ NOT ACTIVATED ✅ 0.12
run-tests Filter xUnit v3 tests by class on MTP 1.0/5 → 5.0/5 🟢 ✅ run-tests; tools: skill, view, bash / ⚠️ NOT ACTIVATED ✅ 0.12 [4]
run-tests Filter xUnit v3 tests by trait on MTP 2.0/5 → 5.0/5 🟢 ✅ run-tests; tools: skill ✅ 0.12
run-tests Filter TUnit tests by class using treenode-filter 3.0/5 → 4.0/5 🟢 ✅ run-tests; tools: skill, bash / ⚠️ NOT ACTIVATED ✅ 0.12 [5]
run-tests Combine multiple filter criteria on VSTest MSTest 3.0/5 → 3.7/5 🟢 ✅ run-tests; tools: skill, bash, glob ✅ 0.12 [6]
run-tests MTP project on SDK 9 must use -- separator for args 3.0/5 → 3.0/5 ⚠️ NOT ACTIVATED / ✅ run-tests; tools: report_intent, skill, bash, view ✅ 0.12 [7]
run-tests MTP project on SDK 10 passes args directly 3.0/5 → 3.0/5 ⚠️ NOT ACTIVATED / ✅ run-tests; tools: report_intent, skill, view, bash, edit, glob ✅ 0.12 [8]
run-tests Detect test platform from Directory.Build.props 3.0/5 → 3.0/5 ⚠️ NOT ACTIVATED / ✅ run-tests; tools: report_intent, skill, bash, view ✅ 0.12 [9]
run-tests Negative test: do not use MTP syntax for a VSTest project 3.0/5 → 3.0/5 ⚠️ NOT ACTIVATED / ✅ run-tests; tools: report_intent, skill, bash, view, glob ✅ 0.12 [10]

[1] ⚠️ High run-to-run variance (CV=1.26) — consider re-running with --runs 5. (Isolated) Quality unchanged but weighted score is -1.5% due to: tokens (670569 → 813826)
[2] ⚠️ High run-to-run variance (CV=0.56) — consider re-running with --runs 5. (Isolated) Quality unchanged but weighted score is -15.0% due to: tokens (174949 → 585135), errors (0 → 1), tool calls (12 → 28), time (107.9s → 291.7s)
[3] ⚠️ High run-to-run variance (CV=1.00) — consider re-running with --runs 5. (Plugin) Quality unchanged but weighted score is -3.5% due to: tokens (35157 → 49920), tool calls (2 → 3)
[4] ⚠️ High run-to-run variance (CV=0.64) — consider re-running with --runs 5
[5] ⚠️ High run-to-run variance (CV=1.36) — consider re-running with --runs 5
[6] ⚠️ High run-to-run variance (CV=1.73) — consider re-running with --runs 5
[7] ⚠️ High run-to-run variance (CV=1.73) — consider re-running with --runs 5. (Isolated) Quality unchanged but weighted score is -2.5% due to: time (0ms → 1ms)
[8] (Plugin) Quality unchanged but weighted score is -5.0% due to: tokens (0 → 723958), tool calls (0 → 31), time (0ms → 202.3s)
[9] (Isolated) Quality unchanged but weighted score is -2.5% due to: time (0ms → 2ms)
[10] (Plugin) Quality unchanged but weighted score is -5.0% due to: tokens (0 → 84580), tool calls (0 → 8), time (0ms → 30.2s)

timeout — run(s) hit the (300s, 360s) scenario timeout limit; scoring may be impacted by aborting model execution before it could produce its full output (increase via timeout in eval.yaml)

Model: claude-opus-4.6 | Judge: claude-opus-4.6

🔍 Full Results - additional metrics and failure investigation steps

▶ Sessions Visualisation -- interactive replay of all evaluation sessions

@Evangelink
Copy link
Copy Markdown
Member Author

/evaluate

github-actions bot added a commit that referenced this pull request Apr 12, 2026
github-actions bot added a commit that referenced this pull request Apr 12, 2026
@github-actions
Copy link
Copy Markdown
Contributor

Skill Validation Results

Skill Scenario Quality Skills Loaded Overfit Verdict
coverage-analysis Project-wide coverage analysis with existing Cobertura data 2.3/5 → 4.3/5 🟢 ✅ coverage-analysis; tools: skill, create, view ✅ 0.12
coverage-analysis Run coverage from scratch without existing data 3.7/5 → 5.0/5 🟢 ✅ coverage-analysis; tools: skill, create / ✅ coverage-analysis; tools: skill, glob, create, task ✅ 0.12
coverage-analysis Coverage plateau diagnosis 3.0/5 → 4.0/5 🟢 ✅ coverage-analysis; tools: skill, create ✅ 0.12
run-tests Run tests in a VSTest MSTest project 4.0/5 → 5.0/5 🟢 ✅ run-tests; tools: skill [1]
run-tests Run tests with trx reporting on MTP project (SDK 9) 3.0/5 ⏰ → 3.0/5 ⏰ ✅ run-tests; tools: skill, glob, read_bash, edit / ✅ run-tests; tools: skill, edit, stop_bash [2]
run-tests Run tests with blame-hang on MTP project (SDK 10) 3.0/5 → 3.0/5 ⏰ ✅ run-tests; tools: skill, edit, bash, glob [3]
run-tests Run tests in a multi-TFM project targeting a specific framework 4.7/5 → 4.7/5 ✅ run-tests; tools: skill, view, glob, bash / ⚠️ NOT ACTIVATED [4]
run-tests Filter MSTest tests by category on VSTest 5.0/5 → 5.0/5 ✅ run-tests; tools: skill, bash, glob, view / ✅ run-tests; tools: skill, view, bash [5]
run-tests Filter NUnit tests by class name on VSTest 3.0/5 → 3.0/5 ✅ run-tests; tools: skill, report_intent, view, bash / ✅ run-tests; tools: report_intent, skill, view, bash [6]
run-tests Filter xUnit v3 tests by class on MTP 3.0/5 → 3.0/5 ⚠️ NOT ACTIVATED [7]
run-tests Filter xUnit v3 tests by trait on MTP 3.0/5 → 3.0/5 ⚠️ NOT ACTIVATED / ✅ run-tests; tools: report_intent, skill, view, bash, glob [8]
run-tests Filter TUnit tests by class using treenode-filter 3.0/5 → 3.0/5 ⚠️ NOT ACTIVATED [9]
run-tests Combine multiple filter criteria on VSTest MSTest 3.0/5 → 3.0/5 ⚠️ NOT ACTIVATED / ✅ run-tests; tools: report_intent, skill, view, bash, glob [10]
run-tests MTP project on SDK 9 must use -- separator for args 3.0/5 → 3.0/5 ⚠️ NOT ACTIVATED / ✅ run-tests; tools: report_intent, skill, view, bash, glob 🟡
run-tests MTP project on SDK 10 passes args directly 3.0/5 → 3.0/5 ⚠️ NOT ACTIVATED / ✅ run-tests; tools: report_intent, skill, view, bash, edit [11]
run-tests Detect test platform from Directory.Build.props 3.0/5 → 3.0/5 ⚠️ NOT ACTIVATED / ✅ run-tests; tools: report_intent, skill, bash, view 🟡
run-tests Negative test: do not use MTP syntax for a VSTest project 3.0/5 → 3.0/5 ⚠️ NOT ACTIVATED / ✅ run-tests; tools: report_intent, skill, bash, view [12]
mtp-hot-reload Suggest hot reload for failing test in MTP project (SDK 9) 3.0/5 → 3.0/5 ⏰ ✅ mtp-hot-reload; tools: skill ✅ 0.14 [13]
mtp-hot-reload Suggest hot reload for failing test in MTP project (SDK 10) 1.0/5 → 3.7/5 🟢 ✅ mtp-hot-reload; tools: skill, bash, create / ✅ mtp-hot-reload; platform-detection; tools: skill, bash, create ✅ 0.14
mtp-hot-reload Enable hot reload when package already installed 2.0/5 → 5.0/5 🟢 ✅ mtp-hot-reload; tools: skill ✅ 0.14
mtp-hot-reload Suggest launchSettings.json configuration for hot reload 1.0/5 → 4.0/5 🟢 ✅ mtp-hot-reload; tools: skill, bash, create / ✅ mtp-hot-reload; tools: skill, bash, create, glob ✅ 0.14
mtp-hot-reload Use dotnet run not dotnet test for hot reload 2.3/5 → 3.0/5 🟢 ✅ mtp-hot-reload; tools: skill / ✅ mtp-hot-reload; tools: report_intent, skill ✅ 0.14 [14]
mtp-hot-reload Negative: VSTest project cannot use MTP hot reload 2.3/5 → 3.0/5 🟢 ✅ mtp-hot-reload; tools: skill, create ✅ 0.14 [15]
mtp-hot-reload Run specific failing test with hot reload filter 1.0/5 → 5.0/5 🟢 ✅ mtp-hot-reload; tools: skill ✅ 0.14
code-testing-agent Generate tests for ContosoUniversity ASP.NET Core MVC app 3.0/5 → 3.0/5 ✅ code-testing-agent; tools: skill / ✅ code-testing-agent; tools: skill, glob ✅ 0.02

[1] ⚠️ High run-to-run variance (CV=0.65) — consider re-running with --runs 5
[2] ⚠️ High run-to-run variance (CV=0.63) — consider re-running with --runs 5. (Isolated) Quality unchanged but weighted score is -3.9% due to: tokens (219034 → 340404), tool calls (13 → 19)
[3] (Isolated) Quality unchanged but weighted score is -15.0% due to: tokens (50151 → 189977), errors (0 → 1), tool calls (5 → 15), time (41.7s → 300.3s)
[4] ⚠️ High run-to-run variance (CV=3.65) — consider re-running with --runs 5
[5] ⚠️ High run-to-run variance (CV=1.90) — consider re-running with --runs 5. (Plugin) Quality unchanged but weighted score is -8.3% due to: tokens (30751 → 61808), tool calls (2 → 5), time (13.9s → 24.9s)
[6] ⚠️ High run-to-run variance (CV=92.16) — consider re-running with --runs 5
[7] (Plugin) Quality unchanged but weighted score is -5.0% due to: tokens (0 → 39526), tool calls (0 → 3), time (0ms → 18.2s)
[8] ⚠️ High run-to-run variance (CV=1.73) — consider re-running with --runs 5. (Isolated) Quality unchanged but weighted score is -2.5% due to: time (0ms → 1ms)
[9] (Plugin) Quality unchanged but weighted score is -5.0% due to: tokens (0 → 39754), tool calls (0 → 4), time (0ms → 18.4s)
[10] ⚠️ High run-to-run variance (CV=1.00) — consider re-running with --runs 5. (Isolated) Quality unchanged but weighted score is -2.5% due to: time (0ms → 1ms)
[11] ⚠️ High run-to-run variance (CV=1.00) — consider re-running with --runs 5. (Isolated) Quality unchanged but weighted score is -2.5% due to: time (0ms → 1ms)
[12] (Plugin) Quality unchanged but weighted score is -5.0% due to: tokens (0 → 85289), tool calls (0 → 7), time (0ms → 31.9s)
[13] (Isolated) Quality unchanged but weighted score is -15.0% due to: tokens (148440 → 765178), errors (0 → 1), tool calls (14 → 39), time (117.9s → 360.2s)
[14] ⚠️ High run-to-run variance (CV=0.85) — consider re-running with --runs 5
[15] ⚠️ High run-to-run variance (CV=1.36) — consider re-running with --runs 5

timeout — run(s) hit the (300s, 360s) scenario timeout limit; scoring may be impacted by aborting model execution before it could produce its full output (increase via timeout in eval.yaml)

Model: claude-opus-4.6 | Judge: claude-opus-4.6

🔍 Full Results - additional metrics and failure investigation steps

▶ Sessions Visualisation -- interactive replay of all evaluation sessions

@Evangelink
Copy link
Copy Markdown
Member Author

/evaluate

github-actions bot added a commit that referenced this pull request Apr 13, 2026
@github-actions
Copy link
Copy Markdown
Contributor

Skill Validation Results

Skill Scenario Quality Skills Loaded Overfit Verdict
coverage-analysis Project-wide coverage analysis with existing Cobertura data 2.0/5 → 3.7/5 🟢 ✅ coverage-analysis; tools: skill, view, create ✅ 0.08
coverage-analysis Run coverage from scratch without existing data 3.7/5 → 5.0/5 🟢 ✅ coverage-analysis; tools: skill, create, grep / ✅ coverage-analysis; tools: skill, create ✅ 0.08
coverage-analysis Coverage plateau diagnosis 3.0/5 → 4.0/5 🟢 ✅ coverage-analysis; tools: skill, create, read_bash, stop_bash / ✅ coverage-analysis; tools: skill, create ✅ 0.08 [1]
run-tests Run tests in a VSTest MSTest project 4.0/5 → 5.0/5 🟢 ✅ run-tests; tools: skill, glob ✅ 0.10
run-tests Run tests with trx reporting on MTP project (SDK 9) 3.7/5 → 3.7/5 ⏰ ✅ run-tests; tools: skill, glob / ✅ run-tests; tools: skill ✅ 0.10 [2]
run-tests Run tests with blame-hang on MTP project (SDK 10) 1.7/5 ⏰ → 2.3/5 ⏰ 🟢 ✅ run-tests; tools: skill ✅ 0.10
run-tests Run tests in a multi-TFM project targeting a specific framework 4.7/5 → 4.3/5 🔴 ✅ run-tests; tools: skill, view, glob, read_bash / ⚠️ NOT ACTIVATED ✅ 0.10 [3]
run-tests Filter MSTest tests by category on VSTest 5.0/5 → 5.0/5 ✅ run-tests; tools: skill, bash, view / ✅ run-tests; tools: skill, bash, view, glob ✅ 0.10 [4]
run-tests Filter NUnit tests by class name on VSTest 3.0/5 → 5.0/5 🟢 ✅ run-tests; tools: skill, report_intent, view, bash, glob / ✅ run-tests; tools: report_intent, view, skill, bash ✅ 0.10
run-tests Filter xUnit v3 tests by class on MTP 1.0/5 → 5.0/5 🟢 ✅ run-tests; tools: skill, view, bash, report_intent, grep / ⚠️ NOT ACTIVATED ✅ 0.10
run-tests Filter xUnit v3 tests by trait on MTP 1.0/5 → 5.0/5 🟢 ✅ run-tests; tools: skill, view ✅ 0.10
run-tests Filter TUnit tests by class using treenode-filter 2.0/5 → 4.7/5 🟢 ✅ run-tests; tools: skill, bash, glob / ✅ run-tests; tools: skill, bash ✅ 0.10
run-tests Combine multiple filter criteria on VSTest MSTest 4.7/5 → 4.3/5 🔴 ✅ run-tests; tools: skill, bash, glob ✅ 0.10 [5]
run-tests MTP project on SDK 9 must use -- separator for args 1.7/5 → 5.0/5 🟢 ✅ run-tests; tools: skill ✅ 0.10
run-tests MTP project on SDK 10 passes args directly 3.7/5 → 3.3/5 🔴 ✅ run-tests; tools: skill / ✅ run-tests; tools: skill, glob ✅ 0.10 [6]
run-tests Detect test platform from Directory.Build.props 3.0/5 → 5.0/5 🟢 ✅ run-tests; tools: skill ✅ 0.10
run-tests Negative test: do not use MTP syntax for a VSTest project 4.0/5 → 5.0/5 🟢 ✅ run-tests; tools: skill, view ✅ 0.10
mtp-hot-reload Suggest hot reload for failing test in MTP project (SDK 9) 2.3/5 → 3.3/5 ⏰ 🟢 ✅ mtp-hot-reload; tools: skill, read_bash / ✅ mtp-hot-reload; tools: skill ✅ 0.11 [7]
mtp-hot-reload Suggest hot reload for failing test in MTP project (SDK 10) 1.3/5 → 3.7/5 🟢 ✅ mtp-hot-reload; tools: skill, bash, create ✅ 0.11
mtp-hot-reload Enable hot reload when package already installed 2.0/5 → 5.0/5 🟢 ✅ mtp-hot-reload; tools: skill, glob / ✅ mtp-hot-reload; tools: skill ✅ 0.11
mtp-hot-reload Suggest launchSettings.json configuration for hot reload 1.0/5 ⏰ → 4.0/5 🟢 ✅ mtp-hot-reload; tools: skill, create / ✅ mtp-hot-reload; tools: skill, glob, create ✅ 0.11
mtp-hot-reload Use dotnet run not dotnet test for hot reload 2.3/5 → 3.0/5 🟢 ✅ mtp-hot-reload; tools: skill / ✅ mtp-hot-reload; tools: report_intent, skill ✅ 0.11 [8]
mtp-hot-reload Negative: VSTest project cannot use MTP hot reload 2.3/5 → 3.0/5 🟢 ✅ mtp-hot-reload; tools: skill, create ✅ 0.11 [9]
mtp-hot-reload Run specific failing test with hot reload filter 1.0/5 → 5.0/5 🟢 ✅ mtp-hot-reload; tools: skill, bash / ⚠️ NOT ACTIVATED ✅ 0.11 [10]
code-testing-agent Generate tests for ContosoUniversity ASP.NET Core MVC app 3.0/5 → 3.0/5 ✅ code-testing-agent; tools: skill / ✅ code-testing-agent; tools: skill, task, read_agent, glob, grep ✅ 0.02

[1] ⚠️ High run-to-run variance (CV=0.64) — consider re-running with --runs 5
[2] ⚠️ High run-to-run variance (CV=4.88) — consider re-running with --runs 5
[3] ⚠️ High run-to-run variance (CV=1.53) — consider re-running with --runs 5
[4] ⚠️ High run-to-run variance (CV=6.53) — consider re-running with --runs 5
[5] ⚠️ High run-to-run variance (CV=7.42) — consider re-running with --runs 5
[6] ⚠️ High run-to-run variance (CV=1.22) — consider re-running with --runs 5
[7] ⚠️ High run-to-run variance (CV=3.65) — consider re-running with --runs 5
[8] ⚠️ High run-to-run variance (CV=0.71) — consider re-running with --runs 5
[9] ⚠️ High run-to-run variance (CV=2.83) — consider re-running with --runs 5
[10] (Plugin) Quality unchanged but weighted score is -43.0% due to: judgment, quality, errors (0 → 1)

timeout — run(s) hit the (180s, 300s, 360s, 1800s) scenario timeout limit; scoring may be impacted by aborting model execution before it could produce its full output (increase via timeout in eval.yaml)

Model: claude-opus-4.6 | Judge: claude-opus-4.6

🔍 Full Results - additional metrics and failure investigation steps

▶ Sessions Visualisation -- interactive replay of all evaluation sessions

@Evangelink
Copy link
Copy Markdown
Member Author

/evaluate

github-actions bot added a commit that referenced this pull request Apr 14, 2026
@github-actions
Copy link
Copy Markdown
Contributor

Skill Validation Results

Skill Scenario Quality Skills Loaded Overfit Verdict
coverage-analysis Project-wide coverage analysis with existing Cobertura data 2.0/5 → 4.0/5 🟢 ✅ coverage-analysis; tools: skill, view, read_bash, stop_bash, create / ✅ coverage-analysis; tools: skill, view, create, read_bash ✅ 0.10
coverage-analysis Run coverage from scratch without existing data 3.7/5 → 5.0/5 🟢 ✅ coverage-analysis; tools: skill, task, read_agent, create / ✅ coverage-analysis; tools: skill, create ✅ 0.10
coverage-analysis Coverage plateau diagnosis 3.0/5 → 4.3/5 🟢 ✅ coverage-analysis; tools: skill, create, bash ✅ 0.10
run-tests Run tests in a VSTest MSTest project 4.0/5 → 5.0/5 🟢 ✅ run-tests; tools: skill, glob ✅ 0.12
run-tests Run tests with trx reporting on MTP project (SDK 9) 3.0/5 ⏰ → 3.0/5 ⏰ ✅ run-tests; tools: skill, glob, edit ✅ 0.12 [1]
run-tests Run tests with blame-hang on MTP project (SDK 10) 3.0/5 ⏰ → 3.0/5 ⏰ ✅ run-tests; tools: skill ✅ 0.12 [2]
run-tests Run tests in a multi-TFM project targeting a specific framework 4.7/5 → 4.7/5 ✅ run-tests; tools: skill, glob, view / ⚠️ NOT ACTIVATED ✅ 0.12 [3]
run-tests Filter MSTest tests by category on VSTest 5.0/5 → 5.0/5 ✅ run-tests; tools: skill, bash, glob, view ✅ 0.12 [4]
run-tests Filter NUnit tests by class name on VSTest 3.3/5 → 5.0/5 🟢 ✅ run-tests; tools: skill, report_intent, view, bash / ⚠️ NOT ACTIVATED ✅ 0.12
run-tests Filter xUnit v3 tests by class on MTP 1.0/5 → 5.0/5 🟢 ✅ run-tests; tools: skill, view, report_intent, bash / ⚠️ NOT ACTIVATED ✅ 0.12
run-tests Filter xUnit v3 tests by trait on MTP 1.0/5 → 5.0/5 🟢 ✅ run-tests; tools: skill, view ✅ 0.12
run-tests Filter TUnit tests by class using treenode-filter 3.0/5 → 3.0/5 ✅ run-tests; tools: skill, bash / ✅ run-tests; tools: skill, bash, glob ✅ 0.12
run-tests Combine multiple filter criteria on VSTest MSTest 3.0/5 ⏰ → 3.0/5 ⏰ ✅ run-tests; tools: skill, bash, glob ✅ 0.12 [5]
run-tests MTP project on SDK 9 must use -- separator for args 3.0/5 → 3.0/5 ⚠️ NOT ACTIVATED / ✅ run-tests; tools: report_intent, skill, bash, view, glob ✅ 0.12 🟡 [6]
run-tests MTP project on SDK 10 passes args directly 3.0/5 → 3.0/5 ⚠️ NOT ACTIVATED / ✅ run-tests; tools: report_intent, skill, view, bash, glob, edit ✅ 0.12 🟡 [7]
run-tests Detect test platform from Directory.Build.props 3.0/5 → 3.0/5 ⚠️ NOT ACTIVATED / ✅ run-tests; tools: report_intent, skill, bash, view ✅ 0.12 [8]
run-tests Negative test: do not use MTP syntax for a VSTest project 3.0/5 → 3.0/5 ⚠️ NOT ACTIVATED / ✅ run-tests; tools: report_intent, skill, bash, view, glob ✅ 0.12 [9]
mtp-hot-reload Suggest hot reload for failing test in MTP project (SDK 9) 3.0/5 → 3.0/5 ⏰ ✅ mtp-hot-reload; tools: skill ✅ 0.11 [10]
mtp-hot-reload Suggest hot reload for failing test in MTP project (SDK 10) 2.3/5 ⏰ → 3.7/5 🟢 ✅ mtp-hot-reload; tools: skill, create, bash ✅ 0.11 [11]
mtp-hot-reload Enable hot reload when package already installed 2.3/5 ⏰ → 4.3/5 🟢 ✅ mtp-hot-reload; tools: skill, glob ✅ 0.11 [12]
mtp-hot-reload Suggest launchSettings.json configuration for hot reload 1.0/5 → 4.0/5 🟢 ✅ mtp-hot-reload; tools: skill, create, glob, bash ✅ 0.11
mtp-hot-reload Use dotnet run not dotnet test for hot reload 1.7/5 → 3.0/5 🟢 ✅ mtp-hot-reload; tools: skill / ✅ mtp-hot-reload; tools: skill, report_intent ✅ 0.11 [13]
mtp-hot-reload Negative: VSTest project cannot use MTP hot reload 3.0/5 ⏰ → 3.0/5 ✅ mtp-hot-reload; tools: skill, create ✅ 0.11 [14]
mtp-hot-reload Run specific failing test with hot reload filter 3.0/5 → 3.0/5 ⏰ ✅ mtp-hot-reload; tools: skill / ✅ mtp-hot-reload; tools: skill, bash ✅ 0.11 [15]
code-testing-agent Generate tests for ContosoUniversity ASP.NET Core MVC app 3.0/5 ⏰ → 3.0/5 ⏰ ✅ code-testing-agent; tools: skill, bash, create / ✅ code-testing-agent; tools: skill, bash, glob, create, edit ✅ 0.02 [16]

[1] ⚠️ High run-to-run variance (CV=1.58) — consider re-running with --runs 5. (Isolated) Quality unchanged but weighted score is -16.9% due to: completion (✓ → ✗), tokens (600667 → 745380)
[2] ⚠️ High run-to-run variance (CV=2.12) — consider re-running with --runs 5
[3] ⚠️ High run-to-run variance (CV=2.10) — consider re-running with --runs 5. (Isolated) Quality unchanged but weighted score is -25.1% due to: judgment, quality, tokens (39477 → 83424), tool calls (3 → 7)
[4] ⚠️ High run-to-run variance (CV=4.30) — consider re-running with --runs 5. (Plugin) Quality unchanged but weighted score is -7.5% due to: tokens (30807 → 63873), tool calls (2 → 6), time (22.1s → 32.9s)
[5] (Isolated) Quality unchanged but weighted score is -10.8% due to: errors (0 → 1), tool calls (2 → 6), time (68.9s → 180.1s), tokens (26666 → 30760)
[6] ⚠️ High run-to-run variance (CV=1.73) — consider re-running with --runs 5
[7] ⚠️ High run-to-run variance (CV=0.87) — consider re-running with --runs 5
[8] (Isolated) Quality unchanged but weighted score is -2.5% due to: time (0ms → 1ms)
[9] (Plugin) Quality unchanged but weighted score is -5.0% due to: tokens (0 → 85141), tool calls (0 → 8), time (0ms → 32.9s)
[10] (Isolated) Quality unchanged but weighted score is -15.0% due to: tokens (183856 → 569277), errors (0 → 1), tool calls (13 → 31), time (137.8s → 360.1s)
[11] ⚠️ High run-to-run variance (CV=0.93) — consider re-running with --runs 5
[12] ⚠️ High run-to-run variance (CV=0.52) — consider re-running with --runs 5
[13] ⚠️ High run-to-run variance (CV=0.55) — consider re-running with --runs 5
[14] ⚠️ High run-to-run variance (CV=1.45) — consider re-running with --runs 5
[15] (Isolated) Quality unchanged but weighted score is -11.7% due to: errors (0 → 1), tool calls (2 → 5), time (14.0s → 126.5s), tokens (24861 → 33276)
[16] ⚠️ High run-to-run variance (CV=11.74) — consider re-running with --runs 5

timeout — run(s) hit the (180s, 240s, 300s, 360s, 1800s) scenario timeout limit; scoring may be impacted by aborting model execution before it could produce its full output (increase via timeout in eval.yaml)

Model: claude-opus-4.6 | Judge: claude-opus-4.6

🔍 Full Results - additional metrics and failure investigation steps

▶ Sessions Visualisation -- interactive replay of all evaluation sessions

- run-tests: trx reporting MTP SDK 9 (360s -> 480s)
- run-tests: blame-hang MTP SDK 10 (300s -> 420s)
- run-tests: combine filter criteria VSTest (180s -> 300s)
- mtp-hot-reload: hot reload SDK 9 (360s -> 480s)
- mtp-hot-reload: hot reload filter (180s -> 300s)
- code-testing-agent: ContosoUniversity (1800s -> 2400s)
@Evangelink
Copy link
Copy Markdown
Member Author

/evaluate

github-actions bot added a commit that referenced this pull request Apr 14, 2026
github-actions bot added a commit that referenced this pull request Apr 14, 2026
@github-actions
Copy link
Copy Markdown
Contributor

Skill Validation Results

Skill Scenario Quality Skills Loaded Overfit Verdict
coverage-analysis Project-wide coverage analysis with existing Cobertura data 2.3/5 → 3.7/5 🟢 ✅ coverage-analysis; tools: skill, bash, create, view ✅ 0.14
coverage-analysis Run coverage from scratch without existing data 3.3/5 → 5.0/5 🟢 ✅ coverage-analysis; tools: skill, create / ✅ coverage-analysis; tools: skill, create, grep ✅ 0.14
coverage-analysis Coverage plateau diagnosis 3.3/5 → 4.7/5 🟢 ✅ coverage-analysis; tools: skill, create, bash ✅ 0.14 [1]
run-tests Run tests in a VSTest MSTest project 4.0/5 → 5.0/5 🟢 ✅ run-tests; tools: skill / ✅ run-tests; tools: skill, glob ✅ 0.14
run-tests Run tests with trx reporting on MTP project (SDK 9) 3.3/5 → 3.7/5 ⏰ 🟢 ✅ run-tests; tools: skill, read_bash, stop_bash, glob / ✅ run-tests; tools: skill, glob ✅ 0.14
run-tests Run tests with blame-hang on MTP project (SDK 10) 2.3/5 → 2.7/5 ⏰ 🟢 ✅ run-tests; tools: skill, glob / ✅ run-tests; tools: skill, read_bash ✅ 0.14 [2]
run-tests Run tests in a multi-TFM project targeting a specific framework 4.3/5 → 4.7/5 🟢 ✅ run-tests; tools: skill, view, glob / ⚠️ NOT ACTIVATED ✅ 0.14 [3]
run-tests Filter MSTest tests by category on VSTest 5.0/5 → 5.0/5 ✅ run-tests; tools: skill, bash, glob, view / ✅ run-tests; tools: skill, bash, view, glob ✅ 0.14 [4]
run-tests Filter NUnit tests by class name on VSTest 3.0/5 → 5.0/5 🟢 ✅ run-tests; tools: skill, report_intent, view, bash / ✅ run-tests; tools: report_intent, view, skill, bash ✅ 0.14
run-tests Filter xUnit v3 tests by class on MTP 1.0/5 → 5.0/5 🟢 ✅ run-tests; tools: report_intent, skill, view, bash ✅ 0.14 [5]
run-tests Filter xUnit v3 tests by trait on MTP 1.0/5 → 5.0/5 🟢 ✅ run-tests; tools: skill, view ✅ 0.14
run-tests Filter TUnit tests by class using treenode-filter 3.0/5 → 4.7/5 🟢 ✅ run-tests; tools: skill, bash / ✅ run-tests; filter-syntax; tools: skill, bash ✅ 0.14
run-tests Combine multiple filter criteria on VSTest MSTest 4.7/5 → 4.3/5 🔴 ✅ run-tests; tools: skill, bash, glob ✅ 0.14 [6]
run-tests MTP project on SDK 9 must use -- separator for args 2.3/5 → 5.0/5 🟢 ✅ run-tests; tools: skill / ✅ run-tests; tools: skill, glob ✅ 0.14
run-tests MTP project on SDK 10 passes args directly 3.7/5 → 3.0/5 ⏰ 🔴 ✅ run-tests; tools: skill, glob ✅ 0.14 [7]
run-tests Detect test platform from Directory.Build.props 3.0/5 → 5.0/5 🟢 ✅ run-tests; tools: skill ✅ 0.14
run-tests Negative test: do not use MTP syntax for a VSTest project 4.0/5 → 5.0/5 🟢 ✅ run-tests; tools: skill, view, glob / ✅ run-tests; tools: skill, view ✅ 0.14
mtp-hot-reload Suggest hot reload for failing test in MTP project (SDK 9) 3.0/5 → 3.0/5 ⏰ ✅ mtp-hot-reload; tools: skill ✅ 0.07 [8]
mtp-hot-reload Suggest hot reload for failing test in MTP project (SDK 10) 1.0/5 → 4.3/5 🟢 ✅ mtp-hot-reload; tools: skill, bash, create ✅ 0.07
mtp-hot-reload Enable hot reload when package already installed 2.0/5 → 5.0/5 🟢 ✅ mtp-hot-reload; tools: skill ✅ 0.07
mtp-hot-reload Suggest launchSettings.json configuration for hot reload 1.7/5 → 3.7/5 🟢 ✅ mtp-hot-reload; tools: skill, create, bash / ✅ mtp-hot-reload; tools: skill, glob, create, bash ✅ 0.07
mtp-hot-reload Use dotnet run not dotnet test for hot reload 3.0/5 → 3.3/5 🟢 ✅ mtp-hot-reload; tools: skill ✅ 0.07 [9]
mtp-hot-reload Negative: VSTest project cannot use MTP hot reload 3.0/5 ⏰ → 3.0/5 ⏰ ✅ mtp-hot-reload; tools: skill, edit, create ✅ 0.07 [10]
mtp-hot-reload Run specific failing test with hot reload filter 3.0/5 → 3.0/5 ⚠️ NOT ACTIVATED ✅ 0.07 [11]
code-testing-agent Generate tests for ContosoUniversity ASP.NET Core MVC app 3.0/5 ⏰ → 3.0/5 ⏰ ✅ code-testing-agent; tools: skill, edit / ✅ code-testing-agent; tools: skill, glob ✅ 0.02

[1] ⚠️ High run-to-run variance (CV=23.07) — consider re-running with --runs 5. (Isolated) Quality improved but weighted score is -10.0% due to: tokens (42724 → 251858), tool calls (4 → 19), time (28.6s → 132.3s)
[2] ⚠️ High run-to-run variance (CV=1.58) — consider re-running with --runs 5. (Isolated) Quality improved but weighted score is -11.6% due to: judgment, tokens (371107 → 800235), tool calls (22 → 35), time (256.0s → 369.6s)
[3] ⚠️ High run-to-run variance (CV=5.21) — consider re-running with --runs 5
[4] ⚠️ High run-to-run variance (CV=1.71) — consider re-running with --runs 5. (Isolated) Quality unchanged but weighted score is -8.8% due to: tokens (30912 → 54629), tool calls (2 → 6), time (19.5s → 42.0s)
[5] ⚠️ High run-to-run variance (CV=1.38) — consider re-running with --runs 5
[6] ⚠️ High run-to-run variance (CV=10.48) — consider re-running with --runs 5
[7] ⚠️ High run-to-run variance (CV=1.10) — consider re-running with --runs 5
[8] (Plugin) Quality unchanged but weighted score is -15.0% due to: tokens (136892 → 514180), errors (0 → 1), tool calls (13 → 26), time (126.2s → 339.5s)
[9] ⚠️ High run-to-run variance (CV=0.57) — consider re-running with --runs 5
[10] ⚠️ High run-to-run variance (CV=1.05) — consider re-running with --runs 5. (Plugin) Quality unchanged but weighted score is -0.8% due to: tokens (48271 → 241645), tool calls (6 → 13)
[11] (Isolated) Quality unchanged but weighted score is -2.5% due to: time (0ms → 2ms)

timeout — run(s) hit the (300s, 420s, 480s, 2400s) scenario timeout limit; scoring may be impacted by aborting model execution before it could produce its full output (increase via timeout in eval.yaml)

Model: claude-opus-4.6 | Judge: claude-opus-4.6

🔍 Full Results - additional metrics and failure investigation steps

▶ Sessions Visualisation -- interactive replay of all evaluation sessions

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant