run-tests: add critical rules, detection guidance, and troubleshooting by Evangelink · Pull Request #514 · dotnet/skills

Evangelink · 2026-04-10T13:48:35Z

Add 'Critical Rules' table to prevent cross-platform VSTest/MTP mistakes
Expand Step 1 detection with explicit file-by-file lookup table
Add 'dotnet --version' as first detection action
Add Troubleshooting section with 7 common error patterns
Expand Common Pitfalls from 3 to 7 entries
Strengthen negative guidance for SDK version-specific syntax

- Add 'Critical Rules' table to prevent cross-platform VSTest/MTP mistakes - Expand Step 1 detection with explicit file-by-file lookup table - Add 'dotnet --version' as first detection action - Add Troubleshooting section with 7 common error patterns - Expand Common Pitfalls from 3 to 7 entries - Strengthen negative guidance for SDK version-specific syntax

Evangelink · 2026-04-10T14:35:16Z

/evaluate

github-actions · 2026-04-10T14:50:31Z

Skill Validation Results

Skill	Scenario	Quality	Skills Loaded	Overfit	Verdict
run-tests	Run tests in a VSTest MSTest project	4.0/5 → 5.0/5 🟢	✅ run-tests; tools: skill, glob	—	✅
run-tests	Run tests with trx reporting on MTP project (SDK 9)	3.0/5 → 3.0/5 ⏰	✅ run-tests; tools: skill, report_intent, bash, view, edit	—	❌ [1]
run-tests	Run tests with blame-hang on MTP project (SDK 10)	3.0/5 ⏰ → 3.0/5 ⏰	✅ run-tests; tools: skill, bash, edit, glob / ✅ run-tests; tools: skill, create, glob	—	✅ [2]
run-tests	Run tests in a multi-TFM project targeting a specific framework	2.0/5 → 4.3/5 🟢	✅ run-tests; tools: skill, bash, glob / ⚠️ NOT ACTIVATED	—	✅
run-tests	Filter MSTest tests by category on VSTest	5.0/5 → 5.0/5	✅ run-tests; tools: skill, glob, bash / ⚠️ NOT ACTIVATED	—	❌ [3]
run-tests	Filter NUnit tests by class name on VSTest	4.0/5 → 5.0/5 🟢	✅ run-tests; tools: skill, bash / ⚠️ NOT ACTIVATED	—	✅ [4]
run-tests	Filter xUnit v3 tests by class on MTP	1.0/5 → 5.0/5 🟢	✅ run-tests; tools: report_intent, skill, view / ⚠️ NOT ACTIVATED	—	✅
run-tests	Filter xUnit v3 tests by trait on MTP	1.7/5 → 5.0/5 🟢	✅ run-tests; tools: skill / ✅ run-tests; tools: skill, glob	—	✅
run-tests	Filter TUnit tests by class using treenode-filter	3.0/5 ⏰ → 3.0/5 ⏰	✅ run-tests; tools: skill / ✅ run-tests; filter-syntax; tools: skill, bash, glob	—	❌ [5]
run-tests	Combine multiple filter criteria on VSTest MSTest	3.0/5 → 3.0/5	⚠️ NOT ACTIVATED / ✅ run-tests; tools: report_intent, view, skill	—	❌ [6]
run-tests	MTP project on SDK 9 must use -- separator for args	3.0/5 → 3.0/5	⚠️ NOT ACTIVATED / ✅ run-tests; tools: report_intent, skill, view, bash	—	❌ [7]
run-tests	MTP project on SDK 10 passes args directly	3.0/5 → 3.0/5	⚠️ NOT ACTIVATED	—	❌ [8]
run-tests	Detect test platform from Directory.Build.props	3.0/5 → 3.0/5	⚠️ NOT ACTIVATED / ✅ run-tests; tools: report_intent, view, skill, bash	—	❌ [9]
run-tests	Negative test: do not use MTP syntax for a VSTest project	3.0/5 → 3.0/5	⚠️ NOT ACTIVATED	—	❌ [10]

[1] ⚠️ High run-to-run variance (CV=1.18) — consider re-running with --runs 5. (Isolated) Quality unchanged but weighted score is -14.5% due to: tokens (270805 → 598314), errors (0 → 1), time (160.4s → 327.9s), tool calls (15 → 27)
[2] ⚠️ High run-to-run variance (CV=2.28) — consider re-running with --runs 5
[3] ⚠️ High run-to-run variance (CV=1.00) — consider re-running with --runs 5. (Plugin) Quality unchanged but weighted score is -1.6% due to: tokens (33321 → 41043)
[4] ⚠️ High run-to-run variance (CV=0.50) — consider re-running with --runs 5
[5] ⚠️ High run-to-run variance (CV=0.96) — consider re-running with --runs 5. (Isolated) Quality unchanged but weighted score is -8.2% due to: errors (0 → 1), time (48.4s → 120.2s), tool calls (2 → 3)
[6] (Plugin) Quality unchanged but weighted score is -5.0% due to: tokens (0 → 45654), tool calls (0 → 3), time (0ms → 17.9s)
[7] ⚠️ High run-to-run variance (CV=1.73) — consider re-running with --runs 5. (Isolated) Quality unchanged but weighted score is -2.5% due to: time (0ms → 1ms)
[8] ⚠️ High run-to-run variance (CV=0.87) — consider re-running with --runs 5. (Isolated) Quality unchanged but weighted score is -2.5% due to: time (0ms → 1ms)
[9] ⚠️ High run-to-run variance (CV=0.87) — consider re-running with --runs 5. (Isolated) Quality unchanged but weighted score is -2.5% due to: time (0ms → 1ms)
[10] (Plugin) Quality unchanged but weighted score is -5.0% due to: tokens (0 → 43072), tool calls (0 → 2), time (0ms → 16.6s)

⏰ timeout — run(s) hit the (120s, 300s, 360s) scenario timeout limit; scoring may be impacted by aborting model execution before it could produce its full output (increase via timeout in eval.yaml)

Model: claude-opus-4.6 | Judge: claude-opus-4.6

🔍 Full Results - additional metrics and failure investigation steps

▶ Sessions Visualisation -- interactive replay of all evaluation sessions

Evangelink · 2026-04-10T17:12:07Z

/evaluate

github-actions · 2026-04-10T17:25:49Z

Skill Validation Results

Skill	Scenario	Quality	Skills Loaded	Overfit	Verdict
run-tests	Run tests in a VSTest MSTest project	4.0/5 → 5.0/5 🟢	✅ run-tests; tools: skill, glob / ✅ run-tests; tools: skill	✅ 0.17	✅
run-tests	Run tests with trx reporting on MTP project (SDK 9)	3.3/5 → 4.0/5 🟢	✅ run-tests; tools: skill, read_bash, report_intent, bash, view, edit / ✅ run-tests; tools: skill, report_intent, bash, view, edit	✅ 0.17	✅ [1]
run-tests	Run tests with blame-hang on MTP project (SDK 10)	2.0/5 → 2.7/5 ⏰ 🟢	✅ run-tests; tools: skill, bash, edit	✅ 0.17	✅ [2]
run-tests	Run tests in a multi-TFM project targeting a specific framework	2.0/5 → 4.0/5 🟢	✅ run-tests; tools: skill, bash, glob, read_bash / ⚠️ NOT ACTIVATED	✅ 0.17	✅
run-tests	Filter MSTest tests by category on VSTest	5.0/5 → 5.0/5	✅ run-tests; tools: skill, bash, glob / ⚠️ NOT ACTIVATED	✅ 0.17	✅ [3]
run-tests	Filter NUnit tests by class name on VSTest	4.0/5 → 5.0/5 🟢	✅ run-tests; tools: skill, bash, glob / ⚠️ NOT ACTIVATED	✅ 0.17	✅
run-tests	Filter xUnit v3 tests by class on MTP	1.0/5 → 5.0/5 🟢	✅ run-tests; tools: skill, view, bash / ⚠️ NOT ACTIVATED	✅ 0.17	✅ [4]
run-tests	Filter xUnit v3 tests by trait on MTP	1.7/5 ⏰ → 5.0/5 🟢	✅ run-tests; tools: skill / ✅ run-tests; tools: skill, glob	✅ 0.17	✅
run-tests	Filter TUnit tests by class using treenode-filter	2.7/5 → 4.3/5 🟢	✅ run-tests; tools: skill, bash	✅ 0.17	✅
run-tests	Combine multiple filter criteria on VSTest MSTest	4.7/5 → 4.7/5	✅ run-tests; tools: skill, bash, glob	✅ 0.17	✅ [5]
run-tests	MTP project on SDK 9 must use -- separator for args	1.3/5 → 5.0/5 🟢	✅ run-tests; tools: skill, glob	✅ 0.17	✅
run-tests	MTP project on SDK 10 passes args directly	3.3/5 → 4.0/5 🟢	✅ run-tests; tools: skill, glob	✅ 0.17	✅ [6]
run-tests	Detect test platform from Directory.Build.props	3.0/5 → 5.0/5 🟢	✅ run-tests; tools: skill	✅ 0.17	✅ [7]
run-tests	Negative test: do not use MTP syntax for a VSTest project	4.0/5 → 5.0/5 🟢	✅ run-tests; tools: skill, view / ✅ run-tests; tools: skill, view, glob	✅ 0.17	✅

[1] ⚠️ High run-to-run variance (CV=0.53) — consider re-running with --runs 5
[2] ⚠️ High run-to-run variance (CV=18.07) — consider re-running with --runs 5
[3] ⚠️ High run-to-run variance (CV=6.03) — consider re-running with --runs 5
[4] ⚠️ High run-to-run variance (CV=0.65) — consider re-running with --runs 5
[5] ⚠️ High run-to-run variance (CV=2.50) — consider re-running with --runs 5
[6] ⚠️ High run-to-run variance (CV=1.00) — consider re-running with --runs 5
[7] ⚠️ High run-to-run variance (CV=0.98) — consider re-running with --runs 5

⏰ timeout — run(s) hit the (120s, 300s) scenario timeout limit; scoring may be impacted by aborting model execution before it could produce its full output (increase via timeout in eval.yaml)

Model: claude-opus-4.6 | Judge: claude-opus-4.6

🔍 Full Results - additional metrics and failure investigation steps

▶ Sessions Visualisation -- interactive replay of all evaluation sessions

Evangelink · 2026-04-12T16:13:26Z

/evaluate

github-actions · 2026-04-12T16:27:21Z

Skill Validation Results

Skill	Scenario	Quality	Skills Loaded	Overfit	Verdict
run-tests	Run tests in a VSTest MSTest project	4.0/5 → 5.0/5 🟢	✅ run-tests; tools: skill, glob / ✅ run-tests; tools: skill	✅ 0.12	✅
run-tests	Run tests with trx reporting on MTP project (SDK 9)	3.0/5 ⏰ → 3.0/5 ⏰	✅ run-tests; tools: skill / ✅ run-tests; tools: skill, edit	✅ 0.12	❌ [1]
run-tests	Run tests with blame-hang on MTP project (SDK 10)	3.0/5 → 3.0/5 ⏰	✅ run-tests; tools: skill, bash, edit, glob / ✅ run-tests; tools: skill, bash, edit, create, glob	✅ 0.12	❌ [2]
run-tests	Run tests in a multi-TFM project targeting a specific framework	2.0/5 → 4.0/5 🟢	✅ run-tests; tools: skill, bash, read_bash, glob / ⚠️ NOT ACTIVATED	✅ 0.12	✅
run-tests	Filter MSTest tests by category on VSTest	5.0/5 → 5.0/5	✅ run-tests; tools: skill, bash, glob / ⚠️ NOT ACTIVATED	✅ 0.12	❌ [3]
run-tests	Filter NUnit tests by class name on VSTest	4.0/5 → 5.0/5 🟢	✅ run-tests; tools: skill, bash, glob / ⚠️ NOT ACTIVATED	✅ 0.12	✅
run-tests	Filter xUnit v3 tests by class on MTP	1.0/5 → 5.0/5 🟢	✅ run-tests; tools: skill, view, bash / ⚠️ NOT ACTIVATED	✅ 0.12	✅ [4]
run-tests	Filter xUnit v3 tests by trait on MTP	2.0/5 → 5.0/5 🟢	✅ run-tests; tools: skill	✅ 0.12	✅
run-tests	Filter TUnit tests by class using treenode-filter	3.0/5 → 4.0/5 🟢	✅ run-tests; tools: skill, bash / ⚠️ NOT ACTIVATED	✅ 0.12	✅ [5]
run-tests	Combine multiple filter criteria on VSTest MSTest	3.0/5 → 3.7/5 🟢	✅ run-tests; tools: skill, bash, glob	✅ 0.12	✅ [6]
run-tests	MTP project on SDK 9 must use -- separator for args	3.0/5 → 3.0/5	⚠️ NOT ACTIVATED / ✅ run-tests; tools: report_intent, skill, bash, view	✅ 0.12	❌ [7]
run-tests	MTP project on SDK 10 passes args directly	3.0/5 → 3.0/5	⚠️ NOT ACTIVATED / ✅ run-tests; tools: report_intent, skill, view, bash, edit, glob	✅ 0.12	❌ [8]
run-tests	Detect test platform from Directory.Build.props	3.0/5 → 3.0/5	⚠️ NOT ACTIVATED / ✅ run-tests; tools: report_intent, skill, bash, view	✅ 0.12	❌ [9]
run-tests	Negative test: do not use MTP syntax for a VSTest project	3.0/5 → 3.0/5	⚠️ NOT ACTIVATED / ✅ run-tests; tools: report_intent, skill, bash, view, glob	✅ 0.12	❌ [10]

[1] ⚠️ High run-to-run variance (CV=1.26) — consider re-running with --runs 5. (Isolated) Quality unchanged but weighted score is -1.5% due to: tokens (670569 → 813826)
[2] ⚠️ High run-to-run variance (CV=0.56) — consider re-running with --runs 5. (Isolated) Quality unchanged but weighted score is -15.0% due to: tokens (174949 → 585135), errors (0 → 1), tool calls (12 → 28), time (107.9s → 291.7s)
[3] ⚠️ High run-to-run variance (CV=1.00) — consider re-running with --runs 5. (Plugin) Quality unchanged but weighted score is -3.5% due to: tokens (35157 → 49920), tool calls (2 → 3)
[4] ⚠️ High run-to-run variance (CV=0.64) — consider re-running with --runs 5
[5] ⚠️ High run-to-run variance (CV=1.36) — consider re-running with --runs 5
[6] ⚠️ High run-to-run variance (CV=1.73) — consider re-running with --runs 5
[7] ⚠️ High run-to-run variance (CV=1.73) — consider re-running with --runs 5. (Isolated) Quality unchanged but weighted score is -2.5% due to: time (0ms → 1ms)
[8] (Plugin) Quality unchanged but weighted score is -5.0% due to: tokens (0 → 723958), tool calls (0 → 31), time (0ms → 202.3s)
[9] (Isolated) Quality unchanged but weighted score is -2.5% due to: time (0ms → 2ms)
[10] (Plugin) Quality unchanged but weighted score is -5.0% due to: tokens (0 → 84580), tool calls (0 → 8), time (0ms → 30.2s)

⏰ timeout — run(s) hit the (300s, 360s) scenario timeout limit; scoring may be impacted by aborting model execution before it could produce its full output (increase via timeout in eval.yaml)

Model: claude-opus-4.6 | Judge: claude-opus-4.6

🔍 Full Results - additional metrics and failure investigation steps

▶ Sessions Visualisation -- interactive replay of all evaluation sessions

Evangelink · 2026-04-12T17:01:15Z

/evaluate

github-actions · 2026-04-12T17:15:24Z

Skill Validation Results

Skill	Scenario	Quality	Skills Loaded	Overfit	Verdict
coverage-analysis	Project-wide coverage analysis with existing Cobertura data	2.3/5 → 4.3/5 🟢	✅ coverage-analysis; tools: skill, create, view	✅ 0.12	✅
coverage-analysis	Run coverage from scratch without existing data	3.7/5 → 5.0/5 🟢	✅ coverage-analysis; tools: skill, create / ✅ coverage-analysis; tools: skill, glob, create, task	✅ 0.12	✅
coverage-analysis	Coverage plateau diagnosis	3.0/5 → 4.0/5 🟢	✅ coverage-analysis; tools: skill, create	✅ 0.12	✅
run-tests	Run tests in a VSTest MSTest project	4.0/5 → 5.0/5 🟢	✅ run-tests; tools: skill	—	✅ [1]
run-tests	Run tests with trx reporting on MTP project (SDK 9)	3.0/5 ⏰ → 3.0/5 ⏰	✅ run-tests; tools: skill, glob, read_bash, edit / ✅ run-tests; tools: skill, edit, stop_bash	—	❌ [2]
run-tests	Run tests with blame-hang on MTP project (SDK 10)	3.0/5 → 3.0/5 ⏰	✅ run-tests; tools: skill, edit, bash, glob	—	❌ [3]
run-tests	Run tests in a multi-TFM project targeting a specific framework	4.7/5 → 4.7/5	✅ run-tests; tools: skill, view, glob, bash / ⚠️ NOT ACTIVATED	—	❌ [4]
run-tests	Filter MSTest tests by category on VSTest	5.0/5 → 5.0/5	✅ run-tests; tools: skill, bash, glob, view / ✅ run-tests; tools: skill, view, bash	—	❌ [5]
run-tests	Filter NUnit tests by class name on VSTest	3.0/5 → 3.0/5	✅ run-tests; tools: skill, report_intent, view, bash / ✅ run-tests; tools: report_intent, skill, view, bash	—	✅ [6]
run-tests	Filter xUnit v3 tests by class on MTP	3.0/5 → 3.0/5	⚠️ NOT ACTIVATED	—	❌ [7]
run-tests	Filter xUnit v3 tests by trait on MTP	3.0/5 → 3.0/5	⚠️ NOT ACTIVATED / ✅ run-tests; tools: report_intent, skill, view, bash, glob	—	❌ [8]
run-tests	Filter TUnit tests by class using treenode-filter	3.0/5 → 3.0/5	⚠️ NOT ACTIVATED	—	❌ [9]
run-tests	Combine multiple filter criteria on VSTest MSTest	3.0/5 → 3.0/5	⚠️ NOT ACTIVATED / ✅ run-tests; tools: report_intent, skill, view, bash, glob	—	❌ [10]
run-tests	MTP project on SDK 9 must use -- separator for args	3.0/5 → 3.0/5	⚠️ NOT ACTIVATED / ✅ run-tests; tools: report_intent, skill, view, bash, glob	—	🟡
run-tests	MTP project on SDK 10 passes args directly	3.0/5 → 3.0/5	⚠️ NOT ACTIVATED / ✅ run-tests; tools: report_intent, skill, view, bash, edit	—	❌ [11]
run-tests	Detect test platform from Directory.Build.props	3.0/5 → 3.0/5	⚠️ NOT ACTIVATED / ✅ run-tests; tools: report_intent, skill, bash, view	—	🟡
run-tests	Negative test: do not use MTP syntax for a VSTest project	3.0/5 → 3.0/5	⚠️ NOT ACTIVATED / ✅ run-tests; tools: report_intent, skill, bash, view	—	❌ [12]
mtp-hot-reload	Suggest hot reload for failing test in MTP project (SDK 9)	3.0/5 → 3.0/5 ⏰	✅ mtp-hot-reload; tools: skill	✅ 0.14	❌ [13]
mtp-hot-reload	Suggest hot reload for failing test in MTP project (SDK 10)	1.0/5 → 3.7/5 🟢	✅ mtp-hot-reload; tools: skill, bash, create / ✅ mtp-hot-reload; platform-detection; tools: skill, bash, create	✅ 0.14	✅
mtp-hot-reload	Enable hot reload when package already installed	2.0/5 → 5.0/5 🟢	✅ mtp-hot-reload; tools: skill	✅ 0.14	✅
mtp-hot-reload	Suggest launchSettings.json configuration for hot reload	1.0/5 → 4.0/5 🟢	✅ mtp-hot-reload; tools: skill, bash, create / ✅ mtp-hot-reload; tools: skill, bash, create, glob	✅ 0.14	✅
mtp-hot-reload	Use dotnet run not dotnet test for hot reload	2.3/5 → 3.0/5 🟢	✅ mtp-hot-reload; tools: skill / ✅ mtp-hot-reload; tools: report_intent, skill	✅ 0.14	✅ [14]
mtp-hot-reload	Negative: VSTest project cannot use MTP hot reload	2.3/5 → 3.0/5 🟢	✅ mtp-hot-reload; tools: skill, create	✅ 0.14	✅ [15]
mtp-hot-reload	Run specific failing test with hot reload filter	1.0/5 → 5.0/5 🟢	✅ mtp-hot-reload; tools: skill	✅ 0.14	✅
code-testing-agent	Generate tests for ContosoUniversity ASP.NET Core MVC app	3.0/5 → 3.0/5	✅ code-testing-agent; tools: skill / ✅ code-testing-agent; tools: skill, glob	✅ 0.02	❌

[1] ⚠️ High run-to-run variance (CV=0.65) — consider re-running with --runs 5
[2] ⚠️ High run-to-run variance (CV=0.63) — consider re-running with --runs 5. (Isolated) Quality unchanged but weighted score is -3.9% due to: tokens (219034 → 340404), tool calls (13 → 19)
[3] (Isolated) Quality unchanged but weighted score is -15.0% due to: tokens (50151 → 189977), errors (0 → 1), tool calls (5 → 15), time (41.7s → 300.3s)
[4] ⚠️ High run-to-run variance (CV=3.65) — consider re-running with --runs 5
[5] ⚠️ High run-to-run variance (CV=1.90) — consider re-running with --runs 5. (Plugin) Quality unchanged but weighted score is -8.3% due to: tokens (30751 → 61808), tool calls (2 → 5), time (13.9s → 24.9s)
[6] ⚠️ High run-to-run variance (CV=92.16) — consider re-running with --runs 5
[7] (Plugin) Quality unchanged but weighted score is -5.0% due to: tokens (0 → 39526), tool calls (0 → 3), time (0ms → 18.2s)
[8] ⚠️ High run-to-run variance (CV=1.73) — consider re-running with --runs 5. (Isolated) Quality unchanged but weighted score is -2.5% due to: time (0ms → 1ms)
[9] (Plugin) Quality unchanged but weighted score is -5.0% due to: tokens (0 → 39754), tool calls (0 → 4), time (0ms → 18.4s)
[10] ⚠️ High run-to-run variance (CV=1.00) — consider re-running with --runs 5. (Isolated) Quality unchanged but weighted score is -2.5% due to: time (0ms → 1ms)
[11] ⚠️ High run-to-run variance (CV=1.00) — consider re-running with --runs 5. (Isolated) Quality unchanged but weighted score is -2.5% due to: time (0ms → 1ms)
[12] (Plugin) Quality unchanged but weighted score is -5.0% due to: tokens (0 → 85289), tool calls (0 → 7), time (0ms → 31.9s)
[13] (Isolated) Quality unchanged but weighted score is -15.0% due to: tokens (148440 → 765178), errors (0 → 1), tool calls (14 → 39), time (117.9s → 360.2s)
[14] ⚠️ High run-to-run variance (CV=0.85) — consider re-running with --runs 5
[15] ⚠️ High run-to-run variance (CV=1.36) — consider re-running with --runs 5

⏰ timeout — run(s) hit the (300s, 360s) scenario timeout limit; scoring may be impacted by aborting model execution before it could produce its full output (increase via timeout in eval.yaml)

Model: claude-opus-4.6 | Judge: claude-opus-4.6

🔍 Full Results - additional metrics and failure investigation steps

▶ Sessions Visualisation -- interactive replay of all evaluation sessions

Evangelink · 2026-04-13T08:17:40Z

/evaluate

github-actions · 2026-04-13T08:52:08Z

Skill Validation Results

Skill	Scenario	Quality	Skills Loaded	Overfit	Verdict
coverage-analysis	Project-wide coverage analysis with existing Cobertura data	2.0/5 → 3.7/5 🟢	✅ coverage-analysis; tools: skill, view, create	✅ 0.08	✅
coverage-analysis	Run coverage from scratch without existing data	3.7/5 → 5.0/5 🟢	✅ coverage-analysis; tools: skill, create, grep / ✅ coverage-analysis; tools: skill, create	✅ 0.08	✅
coverage-analysis	Coverage plateau diagnosis	3.0/5 → 4.0/5 🟢	✅ coverage-analysis; tools: skill, create, read_bash, stop_bash / ✅ coverage-analysis; tools: skill, create	✅ 0.08	✅ [1]
run-tests	Run tests in a VSTest MSTest project	4.0/5 → 5.0/5 🟢	✅ run-tests; tools: skill, glob	✅ 0.10	✅
run-tests	Run tests with trx reporting on MTP project (SDK 9)	3.7/5 → 3.7/5 ⏰	✅ run-tests; tools: skill, glob / ✅ run-tests; tools: skill	✅ 0.10	✅ [2]
run-tests	Run tests with blame-hang on MTP project (SDK 10)	1.7/5 ⏰ → 2.3/5 ⏰ 🟢	✅ run-tests; tools: skill	✅ 0.10	✅
run-tests	Run tests in a multi-TFM project targeting a specific framework	4.7/5 → 4.3/5 🔴	✅ run-tests; tools: skill, view, glob, read_bash / ⚠️ NOT ACTIVATED	✅ 0.10	❌ [3]
run-tests	Filter MSTest tests by category on VSTest	5.0/5 → 5.0/5	✅ run-tests; tools: skill, bash, view / ✅ run-tests; tools: skill, bash, view, glob	✅ 0.10	✅ [4]
run-tests	Filter NUnit tests by class name on VSTest	3.0/5 → 5.0/5 🟢	✅ run-tests; tools: skill, report_intent, view, bash, glob / ✅ run-tests; tools: report_intent, view, skill, bash	✅ 0.10	✅
run-tests	Filter xUnit v3 tests by class on MTP	1.0/5 → 5.0/5 🟢	✅ run-tests; tools: skill, view, bash, report_intent, grep / ⚠️ NOT ACTIVATED	✅ 0.10	✅
run-tests	Filter xUnit v3 tests by trait on MTP	1.0/5 → 5.0/5 🟢	✅ run-tests; tools: skill, view	✅ 0.10	✅
run-tests	Filter TUnit tests by class using treenode-filter	2.0/5 → 4.7/5 🟢	✅ run-tests; tools: skill, bash, glob / ✅ run-tests; tools: skill, bash	✅ 0.10	✅
run-tests	Combine multiple filter criteria on VSTest MSTest	4.7/5 → 4.3/5 🔴	✅ run-tests; tools: skill, bash, glob	✅ 0.10	✅ [5]
run-tests	MTP project on SDK 9 must use -- separator for args	1.7/5 → 5.0/5 🟢	✅ run-tests; tools: skill	✅ 0.10	✅
run-tests	MTP project on SDK 10 passes args directly	3.7/5 → 3.3/5 🔴	✅ run-tests; tools: skill / ✅ run-tests; tools: skill, glob	✅ 0.10	✅ [6]
run-tests	Detect test platform from Directory.Build.props	3.0/5 → 5.0/5 🟢	✅ run-tests; tools: skill	✅ 0.10	✅
run-tests	Negative test: do not use MTP syntax for a VSTest project	4.0/5 → 5.0/5 🟢	✅ run-tests; tools: skill, view	✅ 0.10	✅
mtp-hot-reload	Suggest hot reload for failing test in MTP project (SDK 9)	2.3/5 → 3.3/5 ⏰ 🟢	✅ mtp-hot-reload; tools: skill, read_bash / ✅ mtp-hot-reload; tools: skill	✅ 0.11	✅ [7]
mtp-hot-reload	Suggest hot reload for failing test in MTP project (SDK 10)	1.3/5 → 3.7/5 🟢	✅ mtp-hot-reload; tools: skill, bash, create	✅ 0.11	✅
mtp-hot-reload	Enable hot reload when package already installed	2.0/5 → 5.0/5 🟢	✅ mtp-hot-reload; tools: skill, glob / ✅ mtp-hot-reload; tools: skill	✅ 0.11	✅
mtp-hot-reload	Suggest launchSettings.json configuration for hot reload	1.0/5 ⏰ → 4.0/5 🟢	✅ mtp-hot-reload; tools: skill, create / ✅ mtp-hot-reload; tools: skill, glob, create	✅ 0.11	✅
mtp-hot-reload	Use dotnet run not dotnet test for hot reload	2.3/5 → 3.0/5 🟢	✅ mtp-hot-reload; tools: skill / ✅ mtp-hot-reload; tools: report_intent, skill	✅ 0.11	✅ [8]
mtp-hot-reload	Negative: VSTest project cannot use MTP hot reload	2.3/5 → 3.0/5 🟢	✅ mtp-hot-reload; tools: skill, create	✅ 0.11	✅ [9]
mtp-hot-reload	Run specific failing test with hot reload filter	1.0/5 → 5.0/5 🟢	✅ mtp-hot-reload; tools: skill, bash / ⚠️ NOT ACTIVATED	✅ 0.11	❌ [10]
code-testing-agent	Generate tests for ContosoUniversity ASP.NET Core MVC app	3.0/5 → 3.0/5	✅ code-testing-agent; tools: skill / ✅ code-testing-agent; tools: skill, task, read_agent, glob, grep	✅ 0.02	❌

[1] ⚠️ High run-to-run variance (CV=0.64) — consider re-running with --runs 5
[2] ⚠️ High run-to-run variance (CV=4.88) — consider re-running with --runs 5
[3] ⚠️ High run-to-run variance (CV=1.53) — consider re-running with --runs 5
[4] ⚠️ High run-to-run variance (CV=6.53) — consider re-running with --runs 5
[5] ⚠️ High run-to-run variance (CV=7.42) — consider re-running with --runs 5
[6] ⚠️ High run-to-run variance (CV=1.22) — consider re-running with --runs 5
[7] ⚠️ High run-to-run variance (CV=3.65) — consider re-running with --runs 5
[8] ⚠️ High run-to-run variance (CV=0.71) — consider re-running with --runs 5
[9] ⚠️ High run-to-run variance (CV=2.83) — consider re-running with --runs 5
[10] (Plugin) Quality unchanged but weighted score is -43.0% due to: judgment, quality, errors (0 → 1)

⏰ timeout — run(s) hit the (180s, 300s, 360s, 1800s) scenario timeout limit; scoring may be impacted by aborting model execution before it could produce its full output (increase via timeout in eval.yaml)

Model: claude-opus-4.6 | Judge: claude-opus-4.6

🔍 Full Results - additional metrics and failure investigation steps

▶ Sessions Visualisation -- interactive replay of all evaluation sessions

Evangelink · 2026-04-14T10:31:05Z

/evaluate

github-actions · 2026-04-14T11:03:45Z

Skill Validation Results

Skill	Scenario	Quality	Skills Loaded	Overfit	Verdict
coverage-analysis	Project-wide coverage analysis with existing Cobertura data	2.0/5 → 4.0/5 🟢	✅ coverage-analysis; tools: skill, view, read_bash, stop_bash, create / ✅ coverage-analysis; tools: skill, view, create, read_bash	✅ 0.10	✅
coverage-analysis	Run coverage from scratch without existing data	3.7/5 → 5.0/5 🟢	✅ coverage-analysis; tools: skill, task, read_agent, create / ✅ coverage-analysis; tools: skill, create	✅ 0.10	✅
coverage-analysis	Coverage plateau diagnosis	3.0/5 → 4.3/5 🟢	✅ coverage-analysis; tools: skill, create, bash	✅ 0.10	✅
run-tests	Run tests in a VSTest MSTest project	4.0/5 → 5.0/5 🟢	✅ run-tests; tools: skill, glob	✅ 0.12	✅
run-tests	Run tests with trx reporting on MTP project (SDK 9)	3.0/5 ⏰ → 3.0/5 ⏰	✅ run-tests; tools: skill, glob, edit	✅ 0.12	❌ [1]
run-tests	Run tests with blame-hang on MTP project (SDK 10)	3.0/5 ⏰ → 3.0/5 ⏰	✅ run-tests; tools: skill	✅ 0.12	✅ [2]
run-tests	Run tests in a multi-TFM project targeting a specific framework	4.7/5 → 4.7/5	✅ run-tests; tools: skill, glob, view / ⚠️ NOT ACTIVATED	✅ 0.12	❌ [3]
run-tests	Filter MSTest tests by category on VSTest	5.0/5 → 5.0/5	✅ run-tests; tools: skill, bash, glob, view	✅ 0.12	❌ [4]
run-tests	Filter NUnit tests by class name on VSTest	3.3/5 → 5.0/5 🟢	✅ run-tests; tools: skill, report_intent, view, bash / ⚠️ NOT ACTIVATED	✅ 0.12	✅
run-tests	Filter xUnit v3 tests by class on MTP	1.0/5 → 5.0/5 🟢	✅ run-tests; tools: skill, view, report_intent, bash / ⚠️ NOT ACTIVATED	✅ 0.12	✅
run-tests	Filter xUnit v3 tests by trait on MTP	1.0/5 → 5.0/5 🟢	✅ run-tests; tools: skill, view	✅ 0.12	✅
run-tests	Filter TUnit tests by class using treenode-filter	3.0/5 → 3.0/5	✅ run-tests; tools: skill, bash / ✅ run-tests; tools: skill, bash, glob	✅ 0.12	✅
run-tests	Combine multiple filter criteria on VSTest MSTest	3.0/5 ⏰ → 3.0/5 ⏰	✅ run-tests; tools: skill, bash, glob	✅ 0.12	❌ [5]
run-tests	MTP project on SDK 9 must use -- separator for args	3.0/5 → 3.0/5	⚠️ NOT ACTIVATED / ✅ run-tests; tools: report_intent, skill, bash, view, glob	✅ 0.12	🟡 [6]
run-tests	MTP project on SDK 10 passes args directly	3.0/5 → 3.0/5	⚠️ NOT ACTIVATED / ✅ run-tests; tools: report_intent, skill, view, bash, glob, edit	✅ 0.12	🟡 [7]
run-tests	Detect test platform from Directory.Build.props	3.0/5 → 3.0/5	⚠️ NOT ACTIVATED / ✅ run-tests; tools: report_intent, skill, bash, view	✅ 0.12	❌ [8]
run-tests	Negative test: do not use MTP syntax for a VSTest project	3.0/5 → 3.0/5	⚠️ NOT ACTIVATED / ✅ run-tests; tools: report_intent, skill, bash, view, glob	✅ 0.12	❌ [9]
mtp-hot-reload	Suggest hot reload for failing test in MTP project (SDK 9)	3.0/5 → 3.0/5 ⏰	✅ mtp-hot-reload; tools: skill	✅ 0.11	❌ [10]
mtp-hot-reload	Suggest hot reload for failing test in MTP project (SDK 10)	2.3/5 ⏰ → 3.7/5 🟢	✅ mtp-hot-reload; tools: skill, create, bash	✅ 0.11	✅ [11]
mtp-hot-reload	Enable hot reload when package already installed	2.3/5 ⏰ → 4.3/5 🟢	✅ mtp-hot-reload; tools: skill, glob	✅ 0.11	✅ [12]
mtp-hot-reload	Suggest launchSettings.json configuration for hot reload	1.0/5 → 4.0/5 🟢	✅ mtp-hot-reload; tools: skill, create, glob, bash	✅ 0.11	✅
mtp-hot-reload	Use dotnet run not dotnet test for hot reload	1.7/5 → 3.0/5 🟢	✅ mtp-hot-reload; tools: skill / ✅ mtp-hot-reload; tools: skill, report_intent	✅ 0.11	✅ [13]
mtp-hot-reload	Negative: VSTest project cannot use MTP hot reload	3.0/5 ⏰ → 3.0/5	✅ mtp-hot-reload; tools: skill, create	✅ 0.11	✅ [14]
mtp-hot-reload	Run specific failing test with hot reload filter	3.0/5 → 3.0/5 ⏰	✅ mtp-hot-reload; tools: skill / ✅ mtp-hot-reload; tools: skill, bash	✅ 0.11	❌ [15]
code-testing-agent	Generate tests for ContosoUniversity ASP.NET Core MVC app	3.0/5 ⏰ → 3.0/5 ⏰	✅ code-testing-agent; tools: skill, bash, create / ✅ code-testing-agent; tools: skill, bash, glob, create, edit	✅ 0.02	✅ [16]

[1] ⚠️ High run-to-run variance (CV=1.58) — consider re-running with --runs 5. (Isolated) Quality unchanged but weighted score is -16.9% due to: completion (✓ → ✗), tokens (600667 → 745380)
[2] ⚠️ High run-to-run variance (CV=2.12) — consider re-running with --runs 5
[3] ⚠️ High run-to-run variance (CV=2.10) — consider re-running with --runs 5. (Isolated) Quality unchanged but weighted score is -25.1% due to: judgment, quality, tokens (39477 → 83424), tool calls (3 → 7)
[4] ⚠️ High run-to-run variance (CV=4.30) — consider re-running with --runs 5. (Plugin) Quality unchanged but weighted score is -7.5% due to: tokens (30807 → 63873), tool calls (2 → 6), time (22.1s → 32.9s)
[5] (Isolated) Quality unchanged but weighted score is -10.8% due to: errors (0 → 1), tool calls (2 → 6), time (68.9s → 180.1s), tokens (26666 → 30760)
[6] ⚠️ High run-to-run variance (CV=1.73) — consider re-running with --runs 5
[7] ⚠️ High run-to-run variance (CV=0.87) — consider re-running with --runs 5
[8] (Isolated) Quality unchanged but weighted score is -2.5% due to: time (0ms → 1ms)
[9] (Plugin) Quality unchanged but weighted score is -5.0% due to: tokens (0 → 85141), tool calls (0 → 8), time (0ms → 32.9s)
[10] (Isolated) Quality unchanged but weighted score is -15.0% due to: tokens (183856 → 569277), errors (0 → 1), tool calls (13 → 31), time (137.8s → 360.1s)
[11] ⚠️ High run-to-run variance (CV=0.93) — consider re-running with --runs 5
[12] ⚠️ High run-to-run variance (CV=0.52) — consider re-running with --runs 5
[13] ⚠️ High run-to-run variance (CV=0.55) — consider re-running with --runs 5
[14] ⚠️ High run-to-run variance (CV=1.45) — consider re-running with --runs 5
[15] (Isolated) Quality unchanged but weighted score is -11.7% due to: errors (0 → 1), tool calls (2 → 5), time (14.0s → 126.5s), tokens (24861 → 33276)
[16] ⚠️ High run-to-run variance (CV=11.74) — consider re-running with --runs 5

⏰ timeout — run(s) hit the (180s, 240s, 300s, 360s, 1800s) scenario timeout limit; scoring may be impacted by aborting model execution before it could produce its full output (increase via timeout in eval.yaml)

Model: claude-opus-4.6 | Judge: claude-opus-4.6

🔍 Full Results - additional metrics and failure investigation steps

▶ Sessions Visualisation -- interactive replay of all evaluation sessions

- run-tests: trx reporting MTP SDK 9 (360s -> 480s) - run-tests: blame-hang MTP SDK 10 (300s -> 420s) - run-tests: combine filter criteria VSTest (180s -> 300s) - mtp-hot-reload: hot reload SDK 9 (360s -> 480s) - mtp-hot-reload: hot reload filter (180s -> 300s) - code-testing-agent: ContosoUniversity (1800s -> 2400s)

Evangelink · 2026-04-14T16:30:25Z

/evaluate

github-actions · 2026-04-14T17:13:18Z

Skill Validation Results

Skill	Scenario	Quality	Skills Loaded	Overfit	Verdict
coverage-analysis	Project-wide coverage analysis with existing Cobertura data	2.3/5 → 3.7/5 🟢	✅ coverage-analysis; tools: skill, bash, create, view	✅ 0.14	✅
coverage-analysis	Run coverage from scratch without existing data	3.3/5 → 5.0/5 🟢	✅ coverage-analysis; tools: skill, create / ✅ coverage-analysis; tools: skill, create, grep	✅ 0.14	✅
coverage-analysis	Coverage plateau diagnosis	3.3/5 → 4.7/5 🟢	✅ coverage-analysis; tools: skill, create, bash	✅ 0.14	❌ [1]
run-tests	Run tests in a VSTest MSTest project	4.0/5 → 5.0/5 🟢	✅ run-tests; tools: skill / ✅ run-tests; tools: skill, glob	✅ 0.14	✅
run-tests	Run tests with trx reporting on MTP project (SDK 9)	3.3/5 → 3.7/5 ⏰ 🟢	✅ run-tests; tools: skill, read_bash, stop_bash, glob / ✅ run-tests; tools: skill, glob	✅ 0.14	✅
run-tests	Run tests with blame-hang on MTP project (SDK 10)	2.3/5 → 2.7/5 ⏰ 🟢	✅ run-tests; tools: skill, glob / ✅ run-tests; tools: skill, read_bash	✅ 0.14	❌ [2]
run-tests	Run tests in a multi-TFM project targeting a specific framework	4.3/5 → 4.7/5 🟢	✅ run-tests; tools: skill, view, glob / ⚠️ NOT ACTIVATED	✅ 0.14	✅ [3]
run-tests	Filter MSTest tests by category on VSTest	5.0/5 → 5.0/5	✅ run-tests; tools: skill, bash, glob, view / ✅ run-tests; tools: skill, bash, view, glob	✅ 0.14	❌ [4]
run-tests	Filter NUnit tests by class name on VSTest	3.0/5 → 5.0/5 🟢	✅ run-tests; tools: skill, report_intent, view, bash / ✅ run-tests; tools: report_intent, view, skill, bash	✅ 0.14	✅
run-tests	Filter xUnit v3 tests by class on MTP	1.0/5 → 5.0/5 🟢	✅ run-tests; tools: report_intent, skill, view, bash	✅ 0.14	✅ [5]
run-tests	Filter xUnit v3 tests by trait on MTP	1.0/5 → 5.0/5 🟢	✅ run-tests; tools: skill, view	✅ 0.14	✅
run-tests	Filter TUnit tests by class using treenode-filter	3.0/5 → 4.7/5 🟢	✅ run-tests; tools: skill, bash / ✅ run-tests; filter-syntax; tools: skill, bash	✅ 0.14	✅
run-tests	Combine multiple filter criteria on VSTest MSTest	4.7/5 → 4.3/5 🔴	✅ run-tests; tools: skill, bash, glob	✅ 0.14	✅ [6]
run-tests	MTP project on SDK 9 must use -- separator for args	2.3/5 → 5.0/5 🟢	✅ run-tests; tools: skill / ✅ run-tests; tools: skill, glob	✅ 0.14	✅
run-tests	MTP project on SDK 10 passes args directly	3.7/5 → 3.0/5 ⏰ 🔴	✅ run-tests; tools: skill, glob	✅ 0.14	✅ [7]
run-tests	Detect test platform from Directory.Build.props	3.0/5 → 5.0/5 🟢	✅ run-tests; tools: skill	✅ 0.14	✅
run-tests	Negative test: do not use MTP syntax for a VSTest project	4.0/5 → 5.0/5 🟢	✅ run-tests; tools: skill, view, glob / ✅ run-tests; tools: skill, view	✅ 0.14	✅
mtp-hot-reload	Suggest hot reload for failing test in MTP project (SDK 9)	3.0/5 → 3.0/5 ⏰	✅ mtp-hot-reload; tools: skill	✅ 0.07	❌ [8]
mtp-hot-reload	Suggest hot reload for failing test in MTP project (SDK 10)	1.0/5 → 4.3/5 🟢	✅ mtp-hot-reload; tools: skill, bash, create	✅ 0.07	✅
mtp-hot-reload	Enable hot reload when package already installed	2.0/5 → 5.0/5 🟢	✅ mtp-hot-reload; tools: skill	✅ 0.07	✅
mtp-hot-reload	Suggest launchSettings.json configuration for hot reload	1.7/5 → 3.7/5 🟢	✅ mtp-hot-reload; tools: skill, create, bash / ✅ mtp-hot-reload; tools: skill, glob, create, bash	✅ 0.07	✅
mtp-hot-reload	Use dotnet run not dotnet test for hot reload	3.0/5 → 3.3/5 🟢	✅ mtp-hot-reload; tools: skill	✅ 0.07	✅ [9]
mtp-hot-reload	Negative: VSTest project cannot use MTP hot reload	3.0/5 ⏰ → 3.0/5 ⏰	✅ mtp-hot-reload; tools: skill, edit, create	✅ 0.07	❌ [10]
mtp-hot-reload	Run specific failing test with hot reload filter	3.0/5 → 3.0/5	⚠️ NOT ACTIVATED	✅ 0.07	❌ [11]
code-testing-agent	Generate tests for ContosoUniversity ASP.NET Core MVC app	3.0/5 ⏰ → 3.0/5 ⏰	✅ code-testing-agent; tools: skill, edit / ✅ code-testing-agent; tools: skill, glob	✅ 0.02	✅

[1] ⚠️ High run-to-run variance (CV=23.07) — consider re-running with --runs 5. (Isolated) Quality improved but weighted score is -10.0% due to: tokens (42724 → 251858), tool calls (4 → 19), time (28.6s → 132.3s)
[2] ⚠️ High run-to-run variance (CV=1.58) — consider re-running with --runs 5. (Isolated) Quality improved but weighted score is -11.6% due to: judgment, tokens (371107 → 800235), tool calls (22 → 35), time (256.0s → 369.6s)
[3] ⚠️ High run-to-run variance (CV=5.21) — consider re-running with --runs 5
[4] ⚠️ High run-to-run variance (CV=1.71) — consider re-running with --runs 5. (Isolated) Quality unchanged but weighted score is -8.8% due to: tokens (30912 → 54629), tool calls (2 → 6), time (19.5s → 42.0s)
[5] ⚠️ High run-to-run variance (CV=1.38) — consider re-running with --runs 5
[6] ⚠️ High run-to-run variance (CV=10.48) — consider re-running with --runs 5
[7] ⚠️ High run-to-run variance (CV=1.10) — consider re-running with --runs 5
[8] (Plugin) Quality unchanged but weighted score is -15.0% due to: tokens (136892 → 514180), errors (0 → 1), tool calls (13 → 26), time (126.2s → 339.5s)
[9] ⚠️ High run-to-run variance (CV=0.57) — consider re-running with --runs 5
[10] ⚠️ High run-to-run variance (CV=1.05) — consider re-running with --runs 5. (Plugin) Quality unchanged but weighted score is -0.8% due to: tokens (48271 → 241645), tool calls (6 → 13)
[11] (Isolated) Quality unchanged but weighted score is -2.5% due to: time (0ms → 2ms)

⏰ timeout — run(s) hit the (300s, 420s, 480s, 2400s) scenario timeout limit; scoring may be impacted by aborting model execution before it could produce its full output (increase via timeout in eval.yaml)

Model: claude-opus-4.6 | Judge: claude-opus-4.6

🔍 Full Results - additional metrics and failure investigation steps

▶ Sessions Visualisation -- interactive replay of all evaluation sessions

github-actions bot added a commit that referenced this pull request Apr 10, 2026

Update session data (PR #514)

007e5b7

Evangelink added 2 commits April 10, 2026 17:28

Improve description

7226d25

Improve evaluation

3cf9713

Add expected and rejected tools

f223c4e

github-actions bot added a commit that referenced this pull request Apr 12, 2026

Update session data (PR #514)

c0dae44

Evangelink added 2 commits April 12, 2026 18:48

Change

e5c56d6

Improve do not use sections

c7e93fa

github-actions bot added a commit that referenced this pull request Apr 12, 2026

Update PR token usage data (PR #514)

d318eeb

github-actions bot added a commit that referenced this pull request Apr 12, 2026

Update session data (PR #514)

eb9d330

Improve prompt

7ab5c54

github-actions bot added a commit that referenced this pull request Apr 13, 2026

Update PR token usage data (PR #514)

9f9a709

github-actions bot added a commit that referenced this pull request Apr 14, 2026

Update PR token usage data (PR #514)

30b44eb

github-actions bot added a commit that referenced this pull request Apr 14, 2026

Update PR token usage data (PR #514)

d0ee948

github-actions bot added a commit that referenced this pull request Apr 14, 2026

Update session data (PR #514)

dd2d2d9

Conversation

Evangelink commented Apr 10, 2026

Uh oh!

Evangelink commented Apr 10, 2026

Uh oh!

github-actions bot commented Apr 10, 2026

Skill Validation Results

Uh oh!

Evangelink commented Apr 10, 2026

Uh oh!

github-actions bot commented Apr 10, 2026

Skill Validation Results

Uh oh!

Evangelink commented Apr 12, 2026

Uh oh!

github-actions bot commented Apr 12, 2026

Skill Validation Results

Uh oh!

Evangelink commented Apr 12, 2026

Uh oh!

github-actions bot commented Apr 12, 2026

Skill Validation Results

Uh oh!

Evangelink commented Apr 13, 2026

Uh oh!

github-actions bot commented Apr 13, 2026

Skill Validation Results

Uh oh!

Evangelink commented Apr 14, 2026

Uh oh!

github-actions bot commented Apr 14, 2026

Skill Validation Results

Uh oh!

Evangelink commented Apr 14, 2026

Uh oh!

github-actions bot commented Apr 14, 2026

Skill Validation Results

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant