Skip to content

Fix the repo path traversal check#525

Merged
JanKrivanek merged 1 commit intomainfrom
dev/jankrivanek/fix-repo-refs-check
Apr 14, 2026
Merged

Fix the repo path traversal check#525
JanKrivanek merged 1 commit intomainfrom
dev/jankrivanek/fix-repo-refs-check

Conversation

@JanKrivanek
Copy link
Copy Markdown
Member

Followup of #524

Motivation

When we want to allow traversal of paths outside of the skill folder (in case of repo specific skills, that are not meant to be shareable) - we should not check for the depth of the path, unless it is in the current skill folder - otherwise the depth check would snap

Copilot AI review requested due to automatic review settings April 14, 2026 15:32
@JanKrivanek
Copy link
Copy Markdown
Member Author

/evaluate

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adjusts the skill-validator’s file reference depth/traversal checks so that when repo traversal is explicitly allowed, parent-directory (..) references aren’t incorrectly subjected to the “max 1 directory deep” rule—intended for keeping portable skills self-contained.

Changes:

  • Updates SkillProfiler to suppress depth checking for references that include .. when AllowRepoTraversal is enabled.
  • Updates/extends unit tests to distinguish between deep external refs (allowed with repo traversal) vs deep internal refs (still rejected).
Show a summary per file
File Description
eng/skill-validator/src/Check/SkillProfiler.cs Changes traversal/depth validation flow to skip depth checks for .. refs when repo traversal is allowed.
eng/skill-validator/tests/Check/SkillProfileTests.cs Updates and adds tests for the updated repo-traversal behavior.

Copilot's findings

Tip

Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comments suppressed due to low confidence (1)

eng/skill-validator/tests/Check/SkillProfileTests.cs:339

  • The new coverage verifies that ..-based external refs don't trigger depth errors, but it doesn't cover the important edge case where a path contains .. yet still resolves inside the skill directory (e.g. references/../refs/utils/foo/readme.md). Add a test that expects the depth error in that scenario (with AllowRepoTraversal = true) so the implementation can't accidentally allow deep internal refs by inserting .. segments.
    [Fact]
    public void AllowRepoTraversalAllowsDeepExternalRefs()
    {
        var content = "---\nname: test-skill\n---\n# Title\n1. Step\n```bash\necho\n```\nSee [ref](../../../documentation/guides/setup.md)\n" + new string('x', 4000);
        var options = new CheckOptions { AllowRepoTraversal = true };
        var profile = SkillProfiler.AnalyzeSkill(MakeSkill(content), options);
        Assert.DoesNotContain(profile.Errors, e => e.Contains("traversal") || e.Contains("directories deep"));
    }

    [Fact]
    public void AllowRepoTraversalStillChecksDepthForInternalRefs()
    {
        var content = "---\nname: test-skill\n---\n# Title\n1. Step\n```bash\necho\n```\nSee [ref](refs/utils/foo/readme.md)\n" + new string('x', 4000);
        var options = new CheckOptions { AllowRepoTraversal = true };
        var profile = SkillProfiler.AnalyzeSkill(MakeSkill(content), options);
        Assert.Contains(profile.Errors, e => e.Contains("directories deep"));
    }
  • Files reviewed: 2/2 changed files
  • Comments generated: 1

@JanKrivanek
Copy link
Copy Markdown
Member Author

/evaluate

@JanKrivanek JanKrivanek merged commit e6f4168 into main Apr 14, 2026
41 checks passed
@JanKrivanek JanKrivanek deleted the dev/jankrivanek/fix-repo-refs-check branch April 14, 2026 16:52
github-actions bot added a commit that referenced this pull request Apr 14, 2026
github-actions bot added a commit that referenced this pull request Apr 14, 2026
@github-actions
Copy link
Copy Markdown
Contributor

Skill Validation Results

Skill Scenario Quality Skills Loaded Overfit Verdict
coverage-analysis Project-wide coverage analysis with existing Cobertura data 3.0/5 → 5.0/5 🟢 ✅ coverage-analysis; tools: skill, view, create ✅ 0.09
coverage-analysis Run coverage from scratch without existing data 4.0/5 → 5.0/5 🟢 ✅ coverage-analysis; tools: skill ✅ 0.09
coverage-analysis Coverage plateau diagnosis 3.0/5 → 4.0/5 🟢 ✅ coverage-analysis; tools: skill / ✅ coverage-analysis; tools: skill, create ✅ 0.09
migrate-vstest-to-mtp Migrate MSTest project from VSTest to Microsoft.Testing.Platform 5.0/5 → 5.0/5 ✅ migrate-vstest-to-mtp; tools: skill / ✅ migrate-vstest-to-mtp; tools: report_intent, skill ✅ 0.10 [1]
migrate-vstest-to-mtp Migrate NUnit project from VSTest to Microsoft.Testing.Platform 2.0/5 → 5.0/5 🟢 ✅ migrate-vstest-to-mtp; tools: skill ✅ 0.10
migrate-vstest-to-mtp Migrate xUnit.net v2 project from VSTest to Microsoft.Testing.Platform 2.0/5 → 4.0/5 🟢 ✅ migrate-vstest-to-mtp; tools: skill, report_intent, view, bash / ✅ migrate-vstest-to-mtp; tools: skill ✅ 0.10
migrate-vstest-to-mtp Update Azure DevOps pipeline from VSTest task to MTP 2.0/5 → 5.0/5 🟢 ✅ migrate-vstest-to-mtp; tools: skill / ✅ migrate-vstest-to-mtp; tools: report_intent, skill ✅ 0.10
migrate-vstest-to-mtp Migrate MSTest.Sdk project that explicitly uses VSTest 3.0/5 → 5.0/5 🟢 ✅ migrate-vstest-to-mtp; tools: skill ✅ 0.10
migrate-vstest-to-mtp Translate dotnet test VSTest arguments to MTP equivalents 3.0/5 → 5.0/5 🟢 ✅ migrate-vstest-to-mtp; tools: skill / ✅ migrate-vstest-to-mtp; tools: report_intent, skill ✅ 0.10
migrate-vstest-to-mtp Handle exit code 8 when migrating from VSTest to MTP 2.0/5 → 5.0/5 🟢 ✅ migrate-vstest-to-mtp; tools: skill / ⚠️ NOT ACTIVATED ✅ 0.10
migrate-vstest-to-mtp Configure dotnet test MTP mode on .NET 10 SDK 2.0/5 → 5.0/5 🟢 ✅ migrate-vstest-to-mtp; tools: skill ✅ 0.10
migrate-vstest-to-mtp Migrate xUnit.net VSTest filter syntax to MTP 1.0/5 → 5.0/5 🟢 ✅ migrate-vstest-to-mtp; tools: skill / ⚠️ NOT ACTIVATED ✅ 0.10
migrate-vstest-to-mtp Full VSTest to MTP migration plan for MSTest solution 4.0/5 → 5.0/5 🟢 ✅ migrate-vstest-to-mtp; tools: skill ✅ 0.10
test-anti-patterns Detect mixed severity anti-patterns in repository service tests 5.0/5 → 5.0/5 ✅ test-anti-patterns; tools: skill, report_intent / ⚠️ NOT ACTIVATED ✅ 0.05
test-anti-patterns Detect flakiness indicators and test coupling 3.0/5 → 5.0/5 🟢 ✅ test-anti-patterns; tools: skill, report_intent / ⚠️ NOT ACTIVATED ✅ 0.05 [2]
test-anti-patterns Detect duplicated tests and magic values 3.0/5 → 5.0/5 🟢 ✅ test-anti-patterns; tools: report_intent, skill ✅ 0.05
test-anti-patterns Recognize well-written tests without inventing false positives 2.0/5 → 5.0/5 🟢 ✅ test-anti-patterns; tools: report_intent, skill ✅ 0.05
migrate-static-to-wrapper Migrate DateTime.UtcNow to TimeProvider in a service class 5.0/5 → 5.0/5 ✅ migrate-static-to-wrapper; tools: skill, bash / ✅ migrate-static-to-wrapper; tools: skill ✅ 0.07 [3]
migrate-static-to-wrapper Migrate only in scoped files, leaving others untouched 5.0/5 → 5.0/5 ✅ migrate-static-to-wrapper; tools: skill, bash / ⚠️ NOT ACTIVATED ✅ 0.07 [4]
migrate-static-to-wrapper Decline migration when wrapper does not exist yet 4.0/5 → 5.0/5 🟢 ✅ migrate-static-to-wrapper; tools: skill / ✅ migrate-static-to-wrapper; tools: skill, glob ✅ 0.07 [5]
run-tests Run tests in a VSTest MSTest project 4.0/5 → 4.0/5 ✅ run-tests; tools: skill, glob / ✅ run-tests; tools: skill ✅ 0.19
run-tests Run tests with trx reporting on MTP project (SDK 9) 4.0/5 → 4.0/5 ✅ run-tests; tools: skill / ✅ run-tests; tools: skill, glob ✅ 0.19
run-tests Run tests with blame-hang on MTP project (SDK 10) 1.0/5 ⏰ → 2.0/5 🟢 ✅ run-tests; tools: skill ✅ 0.19
run-tests Run tests in a multi-TFM project targeting a specific framework 2.0/5 → 4.0/5 🟢 ⚠️ NOT ACTIVATED ✅ 0.19
run-tests Filter MSTest tests by category on VSTest 5.0/5 → 5.0/5 ⚠️ NOT ACTIVATED ✅ 0.19 [6]
run-tests Filter NUnit tests by class name on VSTest 4.0/5 → 5.0/5 🟢 ✅ run-tests; tools: skill, glob / ⚠️ NOT ACTIVATED ✅ 0.19
run-tests Filter xUnit v3 tests by class on MTP 1.0/5 → 5.0/5 🟢 ✅ run-tests; tools: skill, view / ⚠️ NOT ACTIVATED ✅ 0.19
run-tests Filter xUnit v3 tests by trait on MTP 1.0/5 → 5.0/5 🟢 ✅ run-tests; tools: skill / ⚠️ NOT ACTIVATED ✅ 0.19
run-tests Filter TUnit tests by class using treenode-filter 4.0/5 → 4.0/5 ✅ run-tests; tools: skill / ⚠️ NOT ACTIVATED ✅ 0.19 [7]
run-tests Combine multiple filter criteria on VSTest MSTest 4.0/5 → 4.0/5 ⚠️ NOT ACTIVATED / ✅ run-tests; tools: skill ✅ 0.19
run-tests MTP project on SDK 9 must use -- separator for args 2.0/5 → 5.0/5 🟢 ✅ run-tests; tools: skill / ⚠️ NOT ACTIVATED ✅ 0.19 [8]
run-tests MTP project on SDK 10 passes args directly 3.0/5 → 3.0/5 ✅ run-tests; tools: skill / ⚠️ NOT ACTIVATED ✅ 0.19 [9]
run-tests Detect test platform from Directory.Build.props 2.0/5 → 5.0/5 🟢 ✅ run-tests; tools: skill ✅ 0.19
run-tests Negative test: do not use MTP syntax for a VSTest project 4.0/5 → 5.0/5 🟢 ✅ run-tests; tools: skill, view, glob / ⚠️ NOT ACTIVATED ✅ 0.19 [10]
writing-mstest-tests Write unit tests for a service class 5.0/5 → 4.0/5 🔴 ✅ writing-mstest-tests; tools: skill / ⚠️ NOT ACTIVATED 🟡 0.27
writing-mstest-tests Write data-driven tests for a calculator 4.0/5 → 5.0/5 🟢 ✅ writing-mstest-tests; tools: skill 🟡 0.27
writing-mstest-tests Write async tests with cancellation 2.0/5 → 5.0/5 🟢 ✅ writing-mstest-tests; tools: skill 🟡 0.27
writing-mstest-tests Fix swapped Assert.AreEqual arguments 5.0/5 → 5.0/5 ✅ writing-mstest-tests; tools: skill / ⚠️ NOT ACTIVATED 🟡 0.27 [11]
writing-mstest-tests Modernize legacy test patterns 4.0/5 → 5.0/5 🟢 ✅ writing-mstest-tests; tools: skill 🟡 0.27
writing-mstest-tests Replace ExpectedException with Assert.Throws 3.0/5 → 3.0/5 ✅ writing-mstest-tests; tools: skill / ✅ writing-mstest-tests; tools: report_intent, skill 🟡 0.27
writing-mstest-tests Use proper collection assertions 3.0/5 → 2.0/5 🔴 ✅ writing-mstest-tests; tools: skill 🟡 0.27
writing-mstest-tests Use proper type assertions instead of casts 3.0/5 → 3.0/5 ✅ writing-mstest-tests; tools: skill / ⚠️ NOT ACTIVATED 🟡 0.27
writing-mstest-tests Set up test lifecycle correctly 3.0/5 → 4.0/5 🟢 ✅ writing-mstest-tests; tools: skill 🟡 0.27
writing-mstest-tests Use DynamicData with ValueTuples over object arrays 2.0/5 → 5.0/5 🟢 ✅ writing-mstest-tests; tools: report_intent, skill / ⚠️ NOT ACTIVATED 🟡 0.27
mtp-hot-reload Suggest hot reload for failing test in MTP project (SDK 9) 1.0/5 → 2.0/5 ⏰ 🟢 ✅ mtp-hot-reload; tools: skill ✅ 0.14
mtp-hot-reload Suggest hot reload for failing test in MTP project (SDK 10) 1.0/5 → 4.0/5 🟢 ✅ mtp-hot-reload; tools: skill, bash, create ✅ 0.14
mtp-hot-reload Enable hot reload when package already installed 2.0/5 → 5.0/5 🟢 ✅ mtp-hot-reload; tools: skill ✅ 0.14
mtp-hot-reload Suggest launchSettings.json configuration for hot reload 1.0/5 → 4.0/5 🟢 ✅ mtp-hot-reload; tools: skill, bash ✅ 0.14
mtp-hot-reload Use dotnet run not dotnet test for hot reload 3.0/5 → 3.0/5 ✅ mtp-hot-reload; tools: skill ✅ 0.14
mtp-hot-reload Negative: VSTest project cannot use MTP hot reload 1.0/5 → 3.0/5 🟢 ✅ mtp-hot-reload; tools: skill, create ✅ 0.14
mtp-hot-reload Run specific failing test with hot reload filter 1.0/5 → 5.0/5 🟢 ✅ mtp-hot-reload; tools: skill ✅ 0.14
migrate-xunit-to-xunit-v3 Migrate basic xUnit.net v2 project to v3 2.0/5 → 5.0/5 🟢 ✅ migrate-xunit-to-xunit-v3; tools: skill, glob / ✅ migrate-xunit-to-xunit-v3; tools: skill ✅ 0.05
migrate-xunit-to-xunit-v3 Detect incompatible target framework and stop migration 1.0/5 → 5.0/5 🟢 ✅ migrate-xunit-to-xunit-v3; tools: skill ✅ 0.05
migrate-xunit-to-xunit-v3 Convert async void test methods to async Task 5.0/5 → 5.0/5 ✅ migrate-xunit-to-xunit-v3; tools: skill, glob, create / ✅ migrate-xunit-to-xunit-v3; tools: skill ✅ 0.05 [12]
migrate-xunit-to-xunit-v3 Convert string-based attribute constructors to typeof syntax 5.0/5 → 5.0/5 ✅ migrate-xunit-to-xunit-v3; tools: skill, create / ✅ migrate-xunit-to-xunit-v3; tools: skill, web_fetch, create ✅ 0.05 [13]
migrate-xunit-to-xunit-v3 Update custom FactAttribute to include source information parameters 5.0/5 → 5.0/5 ✅ migrate-xunit-to-xunit-v3; tools: skill / ✅ migrate-xunit-to-xunit-v3; tools: skill, web_fetch ✅ 0.05
migrate-xunit-to-xunit-v3 Update BeforeAfterTestAttribute overrides with IXunitTest parameter 5.0/5 → 4.0/5 🔴 ✅ migrate-xunit-to-xunit-v3; tools: skill, create ✅ 0.05
migrate-xunit-to-xunit-v3 Migrate project with YTest.MTP.XUnit2 to xUnit.net v3 preserving MTP 3.0/5 → 5.0/5 🟢 ✅ migrate-xunit-to-xunit-v3; tools: skill, web_fetch / ✅ migrate-xunit-to-xunit-v3; tools: skill ✅ 0.05
migrate-xunit-to-xunit-v3 Migrate Xunit.SkippableFact to xUnit.net v3 built-in skip APIs 5.0/5 → 5.0/5 ✅ migrate-xunit-to-xunit-v3; tools: skill, create / ✅ migrate-xunit-to-xunit-v3; tools: skill, glob ✅ 0.05 [14]
migrate-xunit-to-xunit-v3 Migrate xUnit v2 packages managed via Central Package Management 5.0/5 → 5.0/5 ✅ migrate-xunit-to-xunit-v3; tools: skill, web_fetch / ✅ migrate-xunit-to-xunit-v3; tools: skill, create ✅ 0.05 [15]
migrate-xunit-to-xunit-v3 Recognize project already on xUnit.net v3 — no migration needed 2.0/5 → 5.0/5 🟢 ✅ migrate-xunit-to-xunit-v3; tools: skill / ✅ migrate-xunit-to-xunit-v3; tools: skill, glob ✅ 0.05
migrate-xunit-to-xunit-v3 Consolidate xunit.extensibility packages and remove xunit.abstractions 3.0/5 → 3.0/5 ✅ migrate-xunit-to-xunit-v3; tools: skill ✅ 0.05
migrate-xunit-to-xunit-v3 Update Xunit.Combinatorial and Xunit.StaFact companion packages 5.0/5 → 4.0/5 🔴 ✅ migrate-xunit-to-xunit-v3; tools: skill, create / ✅ migrate-xunit-to-xunit-v3; tools: skill, glob, create ✅ 0.05
generate-testability-wrappers Generate TimeProvider adoption for DateTime.UtcNow 3.0/5 → 5.0/5 🟢 ✅ generate-testability-wrappers; tools: skill / ✅ generate-testability-wrappers; tools: glob, skill, bash ✅ 0.08
generate-testability-wrappers Generate custom Environment wrapper 3.0/5 → 5.0/5 🟢 ✅ generate-testability-wrappers; tools: skill ✅ 0.08
generate-testability-wrappers Recommend System.IO.Abstractions for file system calls 2.0/5 → 5.0/5 🟢 ✅ generate-testability-wrappers; tools: skill, report_intent, view / ✅ generate-testability-wrappers; tools: skill ✅ 0.08
generate-testability-wrappers Decline wrapper generation for already-abstracted code 2.0/5 → 5.0/5 🟢 ✅ generate-testability-wrappers; tools: skill ✅ 0.08
code-testing-agent Generate tests for ContosoUniversity ASP.NET Core MVC app 3.0/5 → 3.0/5 ✅ code-testing-agent; tools: skill, task, read_agent / ✅ code-testing-agent; tools: skill, task, read_bash, read_agent ✅ 0.02 [16]
migrate-mstest-v1v2-to-v3 Migrate MSTest v1 project with assembly reference 3.0/5 → 5.0/5 🟢 ✅ migrate-mstest-v1v2-to-v3; tools: skill, edit, bash / ✅ migrate-mstest-v1v2-to-v3; tools: skill ✅ 0.04
migrate-mstest-v1v2-to-v3 Migrate MSTest v2 NuGet project to v3 3.0/5 → 3.0/5 ✅ migrate-mstest-v1v2-to-v3; tools: skill ✅ 0.04 [17]
migrate-mstest-v1v2-to-v3 Fix Assert.AreEqual object overload errors after v3 upgrade 4.0/5 → 5.0/5 🟢 ✅ migrate-mstest-v1v2-to-v3; tools: skill / ⚠️ NOT ACTIVATED ✅ 0.04
migrate-mstest-v1v2-to-v3 Migrate from .testsettings to .runsettings 4.0/5 → 4.0/5 ✅ migrate-mstest-v1v2-to-v3; tools: skill, bash ✅ 0.04 [18]
migrate-mstest-v1v2-to-v3 Fix DataRow type mismatch errors after v3 upgrade 3.0/5 → 3.0/5 ✅ migrate-mstest-v1v2-to-v3; tools: skill ✅ 0.04
migrate-mstest-v1v2-to-v3 Migrate to MSTest.Sdk project style 3.0/5 → 5.0/5 🟢 ✅ migrate-mstest-v1v2-to-v3; tools: skill, bash ✅ 0.04
migrate-mstest-v1v2-to-v3 Handle dropped target framework during v3 migration 5.0/5 → 5.0/5 ⚠️ NOT ACTIVATED ✅ 0.04 [19]
migrate-mstest-v1v2-to-v3 Migrate complex MSTest v2 project with testsettings, DataRow issues, and dropped TFM 3.0/5 → 5.0/5 🟢 ✅ migrate-mstest-v1v2-to-v3; tools: skill ✅ 0.04
migrate-mstest-v1v2-to-v3 Correctly identify MSTest v1 vs v2 and recommend different migration paths 4.0/5 → 5.0/5 🟢 ✅ migrate-mstest-v1v2-to-v3; tools: skill ✅ 0.04
detect-static-dependencies Identify static dependencies in a multi-class project 4.0/5 → 4.0/5 ✅ detect-static-dependencies; tools: skill, glob / ✅ detect-static-dependencies; tools: skill ✅ 0.06 [20]
detect-static-dependencies Detect time-related statics and recommend TimeProvider 5.0/5 → 5.0/5 ✅ detect-static-dependencies; tools: skill ✅ 0.06 [21]
detect-static-dependencies Decline scan for non-C# project 5.0/5 → 5.0/5 ℹ️ not activated (expected) ✅ 0.06 [22]
migrate-mstest-v3-to-v4 Migrate custom TestMethodAttribute from Execute to ExecuteAsync 2.0/5 → 3.0/5 🟢 ✅ migrate-mstest-v3-to-v4; tools: skill ✅ 0.06
migrate-mstest-v3-to-v4 Replace ExpectedExceptionAttribute with Assert.ThrowsExactly 3.0/5 → 4.0/5 🟢 ✅ migrate-mstest-v3-to-v4; tools: skill ✅ 0.06
migrate-mstest-v3-to-v4 Fix multiple v4 breaking changes: Assert, ClassCleanup, TestContext, Timeout 5.0/5 → 5.0/5 ✅ migrate-mstest-v3-to-v4; tools: skill ✅ 0.06
migrate-mstest-v3-to-v4 Handle net6.0 target framework dropped in MSTest v4 3.0/5 → 5.0/5 🟢 ✅ migrate-mstest-v3-to-v4; tools: skill / ⚠️ NOT ACTIVATED ✅ 0.06
migrate-mstest-v3-to-v4 Fix TestMethodAttribute CallerInfo constructor breaking change 5.0/5 → 5.0/5 ✅ migrate-mstest-v3-to-v4; tools: skill ✅ 0.06
migrate-mstest-v3-to-v4 Understand behavioral changes after MSTest v4 upgrade 3.0/5 → 5.0/5 🟢 ✅ migrate-mstest-v3-to-v4; tools: skill ✅ 0.06
migrate-mstest-v3-to-v4 Handle MSTest.Sdk and MTP changes in v4 2.0/5 → 3.0/5 🟢 ✅ migrate-mstest-v3-to-v4; tools: report_intent, skill ✅ 0.06
migrate-mstest-v3-to-v4 Full MSTest v3 to v4 migration with multiple breaking changes 2.0/5 → 5.0/5 🟢 ✅ migrate-mstest-v3-to-v4; tools: skill ✅ 0.06
migrate-mstest-v3-to-v4 Migrate MSTest.Sdk v3 project using ManagedType and TestTimeout 2.0/5 → 4.0/5 🟢 ✅ migrate-mstest-v3-to-v4; tools: skill ✅ 0.06
migrate-mstest-v3-to-v4 Correctly identify MSTest v3 project and recommend v4 migration 3.0/5 → 5.0/5 🟢 ✅ migrate-mstest-v3-to-v4; tools: skill ✅ 0.06
crap-score Calculate CRAP score for a single method with partial coverage 4.0/5 → 4.0/5 ✅ crap-score; tools: skill, glob / ✅ crap-score; tools: skill ✅ 0.11
crap-score Identify riskiest methods across a file 4.0/5 → 5.0/5 🟢 ✅ crap-score; tools: skill ✅ 0.11
crap-score Generate coverage then compute CRAP score 3.0/5 → 4.0/5 🟢 ✅ crap-score; tools: skill ✅ 0.11
exp-mock-usage-analysis Detect unused and unreachable mock setups 4.0/5 → 5.0/5 🟢 ✅ exp-mock-usage-analysis; tools: skill ✅ 0.07
exp-mock-usage-analysis Detect redundant mock configurations duplicated across tests 3.0/5 → 3.0/5 ✅ exp-mock-usage-analysis; tools: skill ✅ 0.07 [23]
exp-mock-usage-analysis Detect mocking of stable framework types 3.0/5 → 5.0/5 🟢 ✅ exp-mock-usage-analysis; tools: skill ✅ 0.07
exp-mock-usage-analysis Analyze mock usage in NSubstitute tests 5.0/5 → 4.0/5 🔴 ✅ exp-mock-usage-analysis; tools: skill ✅ 0.07 [24]
exp-mock-usage-analysis Analyze mock usage in FakeItEasy tests 4.0/5 → 5.0/5 🟢 ✅ exp-mock-usage-analysis; tools: skill ✅ 0.07
exp-mock-usage-analysis Detect excessive mock configuration sprawl 3.0/5 → 4.0/5 🟢 ✅ exp-mock-usage-analysis; tools: skill ✅ 0.07
exp-test-tagging Tag an untagged MSTest test suite 3.0/5 → 5.0/5 🟢 ✅ exp-test-tagging; tools: skill / ✅ exp-test-tagging; tools: skill, read_bash 🟡 0.36
exp-test-tagging Tag an untagged xUnit test suite 4.0/5 → 5.0/5 🟢 ✅ exp-test-tagging; tools: skill / ⚠️ NOT ACTIVATED 🟡 0.36
exp-test-tagging Tag an untagged NUnit test suite 3.0/5 → 4.0/5 🟢 ✅ exp-test-tagging; tools: skill, glob / ⚠️ NOT ACTIVATED 🟡 0.36
exp-test-tagging Audit test distribution without modifying files 5.0/5 → 5.0/5 ✅ exp-test-tagging; tools: skill 🟡 0.36 [25]
exp-test-tagging Decline request to write new tests 4.0/5 → 4.0/5 ℹ️ not activated (expected) 🟡 0.36 [26]
exp-test-smell-detection Detect multiple test smells in order processing test suite 4.0/5 → 5.0/5 🟢 ✅ exp-test-smell-detection; tools: skill ✅ 0.05
exp-test-smell-detection Recognize well-written tests with no significant smells 3.0/5 → 5.0/5 🟢 ✅ exp-test-smell-detection; tools: skill ✅ 0.05
exp-test-smell-detection Recognize integration tests and avoid false positives for external resources 5.0/5 → 5.0/5 ✅ exp-test-smell-detection; tools: skill ✅ 0.05 [27]
exp-test-smell-detection Decline request to write new tests from scratch 5.0/5 → 5.0/5 ℹ️ not activated (expected) ✅ 0.05 [28]
exp-test-gap-analysis Find boundary mutation gaps in tiered discount and shipping logic 4.0/5 → 5.0/5 🟢 ✅ exp-test-gap-analysis; tools: skill ✅ 0.10
exp-test-gap-analysis Find logic and null-check mutation gaps in access control code 5.0/5 → 5.0/5 ✅ exp-test-gap-analysis; tools: skill ✅ 0.10 [29]
exp-test-gap-analysis Acknowledge well-tested code with few surviving mutations 4.0/5 → 4.0/5 ✅ exp-test-gap-analysis; tools: skill, glob / ✅ exp-test-gap-analysis; tools: skill ✅ 0.10
exp-test-gap-analysis Decline request to write new tests from scratch 4.0/5 → 4.0/5 ℹ️ not activated (expected) ✅ 0.10 [30]
exp-assertion-quality Identify low assertion diversity in equality-dominated test suite 4.0/5 → 5.0/5 🟢 ✅ exp-assertion-quality; tools: skill / ✅ exp-assertion-quality; tools: skill, glob ✅ 0.10
exp-assertion-quality Flag assertion-free tests and trivial-only assertions 3.0/5 → 4.0/5 🟢 ✅ exp-assertion-quality; tools: skill / ⚠️ NOT ACTIVATED ✅ 0.10
exp-assertion-quality Recognize well-diversified assertion usage 3.0/5 → 5.0/5 🟢 ✅ exp-assertion-quality; tools: skill ✅ 0.10
exp-assertion-quality Decline request to write new tests from scratch 2.0/5 ⏰ → 3.0/5 ⏰ 🟢 ℹ️ not activated (expected) ✅ 0.10
exp-test-maintainability Recommend data-driven patterns with display names for unclear parameters 4.0/5 → 4.0/5 ⚠️ NOT ACTIVATED ✅ 0.08 [31]
exp-test-maintainability Recognize well-maintained tests that need minimal changes 4.0/5 → 5.0/5 🟢 ⚠️ NOT ACTIVATED / ✅ exp-test-maintainability; tools: report_intent, skill ✅ 0.08
exp-test-maintainability Detect repeated object construction and setup across test methods 3.0/5 → 4.0/5 🟢 ✅ exp-test-maintainability; tools: skill ✅ 0.08
exp-test-maintainability Recognize tests with minimal boilerplate that need no refactoring 4.0/5 → 5.0/5 🟢 ✅ exp-test-maintainability; tools: skill ✅ 0.08 [32]
exp-simd-vectorization Optimize manual min/max with TensorPrimitives 1.0/5 → 5.0/5 🟢 ✅ exp-simd-vectorization; tools: skill, glob, create 🟡 0.22
exp-simd-vectorization Optimize manual product with TensorPrimitives 1.0/5 → 5.0/5 🟢 ✅ exp-simd-vectorization; tools: skill, glob, create, bash / ⚠️ NOT ACTIVATED 🟡 0.22 [33]
exp-simd-vectorization No optimization opportunity — dictionary-based lookup service 1.0/5 → 4.0/5 🟢 ⚠️ NOT ACTIVATED 🟡 0.22
exp-simd-vectorization Optimize int array conditional increment with SIMD 3.0/5 → 4.0/5 🟢 ✅ exp-simd-vectorization; tools: skill 🟡 0.22
exp-simd-vectorization Optimize byte buffer bit reversal with SIMD 1.0/5 ⏰ → 4.0/5 🟢 ✅ exp-simd-vectorization; tools: skill, edit, bash / ✅ exp-simd-vectorization; tools: skill 🟡 0.22 [34]

[1] (Plugin) Quality unchanged but weighted score is -7.9% due to: tokens (19082 → 59936), tool calls (0 → 2)
[2] (Plugin) Quality unchanged but weighted score is -2.0% due to: quality, tokens (20438 → 24549)
[3] (Isolated) Quality unchanged but weighted score is -12.3% due to: tokens (66796 → 161170), quality, tool calls (9 → 17), time (31.8s → 55.3s)
[4] (Isolated) Quality unchanged but weighted score is -6.7% due to: tokens (100103 → 162734), tool calls (11 → 19), time (40.8s → 69.8s)
[5] (Isolated) Quality improved but weighted score is -5.7% due to: tokens (71806 → 122970), time (33.3s → 57.8s)
[6] (Isolated) Quality unchanged but weighted score is -2.6% due to: tokens (31188 → 38509), tool calls (3 → 4), time (11.7s → 14.6s)
[7] (Isolated) Quality unchanged but weighted score is -0.7% due to: tokens (38306 → 59456), tool calls (4 → 5), time (26.3s → 31.8s)
[8] (Plugin) Quality unchanged but weighted score is -29.5% due to: quality, judgment, tokens (62063 → 74826)
[9] (Isolated) Quality unchanged but weighted score is -16.5% due to: completion (✓ → ✗), tokens (130577 → 340674), time (67.5s → 150.0s), tool calls (10 → 18)
[10] (Plugin) Quality unchanged but weighted score is -5.8% due to: tokens (30713 → 55560), tool calls (2 → 3), time (14.1s → 17.0s)
[11] (Isolated) Quality unchanged but weighted score is -9.3% due to: tokens (18444 → 38539), tool calls (0 → 1), time (11.5s → 19.8s)
[12] (Plugin) Quality unchanged but weighted score is -10.0% due to: tokens (156817 → 1018419), tool calls (19 → 45), time (94.2s → 293.2s)
[13] (Isolated) Quality unchanged but weighted score is -18.0% due to: judgment, quality, tokens (451439 → 529026)
[14] (Plugin) Quality unchanged but weighted score is -0.5% due to: tokens (189866 → 231027)
[15] (Isolated) Quality unchanged but weighted score is -12.4% due to: judgment, tokens (132558 → 150073)
[16] (Isolated) Quality unchanged but weighted score is -30.4% due to: quality, judgment, tokens (1018206 → 1633192), tool calls (69 → 132), time (366.9s → 533.2s)
[17] (Isolated) Quality unchanged but weighted score is -12.3% due to: completion (✓ → ✗), tokens (33153 → 51417), tool calls (3 → 5)
[18] (Isolated) Quality unchanged but weighted score is -17.6% due to: judgment, quality, tokens (59260 → 68176)
[19] (Plugin) Quality unchanged but weighted score is -1.6% due to: tokens (30748 → 38968)
[20] (Plugin) Quality unchanged but weighted score is -1.8% due to: tokens (73256 → 125415), time (31.7s → 65.4s)
[21] (Isolated) Quality unchanged but weighted score is -18.0% due to: quality, tokens (46245 → 109492), tool calls (5 → 11), time (28.9s → 72.1s)
[22] (Plugin) Quality dropped but weighted score is +0.9% due to: efficiency metrics
[23] (Isolated) Quality unchanged but weighted score is -1.2% due to: tokens (48445 → 70755), time (53.8s → 88.3s), tool calls (5 → 6)
[24] (Plugin) Quality unchanged but weighted score is -2.6% due to: tokens (28649 → 50226), time (43.0s → 78.0s), tool calls (3 → 4)
[25] (Isolated) Quality unchanged but weighted score is -19.2% due to: judgment, quality, tokens (45352 → 66031), tool calls (7 → 9)
[26] (Plugin) Quality unchanged but weighted score is -2.0% due to: tokens (41837 → 206912), tool calls (4 → 14), time (48.4s → 131.3s)
[27] (Plugin) Quality unchanged but weighted score is -8.1% due to: tokens (39926 → 71189), time (42.9s → 81.9s), tool calls (4 → 7)
[28] (Plugin) Quality unchanged but weighted score is -10.0% due to: tokens (40845 → 286462), tool calls (3 → 17), time (29.4s → 142.3s)
[29] (Isolated) Quality unchanged but weighted score is -9.7% due to: judgment, quality
[30] (Plugin) Quality unchanged but weighted score is -3.5% due to: tokens (185339 → 263698), time (78.7s → 104.6s), tool calls (14 → 17)
[31] (Plugin) Quality unchanged but weighted score is -0.2% due to: efficiency metrics
[32] (Plugin) Quality unchanged but weighted score is -6.6% due to: tokens (39395 → 66399), time (24.2s → 63.2s), tool calls (4 → 5)
[33] (Plugin) Quality unchanged but weighted score is -17.1% due to: judgment, quality
[34] (Plugin) Quality unchanged but weighted score is -7.5% due to: tokens (12219 → 27443), tool calls (2 → 4)

timeout — run(s) hit the (120s, 180s, 300s, 360s) scenario timeout limit; scoring may be impacted by aborting model execution before it could produce its full output (increase via timeout in eval.yaml)

Model: claude-opus-4.6 | Judge: claude-opus-4.6

🔍 Full Results - additional metrics and failure investigation steps

▶ Sessions Visualisation -- interactive replay of all evaluation sessions

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants