Skip to content

tidy: clean up old benchmark and add gym#7081

Merged
codefromthecrypt merged 3 commits intomainfrom
micn/bench-cleanup
Feb 9, 2026
Merged

tidy: clean up old benchmark and add gym#7081
codefromthecrypt merged 3 commits intomainfrom
micn/bench-cleanup

Conversation

@michaelneale
Copy link
Collaborator

The old benchmark code was wired in to cli and I don't think we use it.

Also thought I would try moving the "gym" over to evals dir where we can park a bunch of things in there which are not tied in to the code directly but useful.

Copilot AI review requested due to automatic review settings February 9, 2026 01:50
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Removes the legacy goose-bench benchmarking integration from the Rust CLI, and adds a standalone “Open Model Gym” evaluation harness under evals/ for running scenario matrices across models/runners (including an MCP tool harness).

Changes:

  • Removed the bench CLI command path and the goose-bench crate/dependency.
  • Added evals/open-model-gym/ including a TypeScript suite runner, validators, and example scenarios.
  • Added an MCP harness (TypeScript MCP server) for simulating common tool integrations used by eval scenarios.

Reviewed changes

Copilot reviewed 69 out of 76 changed files in this pull request and generated 13 comments.

Show a summary per file
File Description
evals/open-model-gym/config.yaml Defines the models/runners/scenario matrix for the Open Model Gym runs.
evals/open-model-gym/Justfile Adds just recipes to install/build/run/report the gym suite and MCP harness.
evals/open-model-gym/README.md Documents how the Open Model Gym works, how to configure it, and how to run it.
evals/open-model-gym/.gitignore Ignores build outputs, logs, caches, and runtime artifacts for the gym.
evals/open-model-gym/mcp-harness/README.md Documents the simulated MCP server and its tool set.
evals/open-model-gym/mcp-harness/package.json Defines MCP harness build/start scripts and dependencies.
evals/open-model-gym/mcp-harness/run.sh Convenience script to build and start the MCP harness.
evals/open-model-gym/mcp-harness/src/index.ts Implements the simulated MCP server and tool call logging.
evals/open-model-gym/mcp-harness/tsconfig.json TypeScript configuration for building the MCP harness.
evals/open-model-gym/suite/.gitignore Ignores suite-local artifacts (node_modules/workdir/cache/etc.).
evals/open-model-gym/suite/package.json Defines suite build/run scripts for the scenario runner.
evals/open-model-gym/suite/package-lock.json Locks suite npm dependencies for reproducible installs.
evals/open-model-gym/suite/tsconfig.json TypeScript configuration for building the suite runner.
evals/open-model-gym/suite/src/runner.ts Main suite runner: loads config/scenarios, runs agent commands, caches, and renders reports.
evals/open-model-gym/suite/src/types.ts Shared types for scenarios, validations, and test results.
evals/open-model-gym/suite/src/validator.ts Implements validation rules (file checks, regex checks, tool call checks, command checks).
evals/open-model-gym/suite/src/gym.png Embedded image asset used by the generated report.
evals/open-model-gym/suite/scenarios/everyday-app-automation.yaml Adds a multi-step scenario validating MCP tool usage across Slack/Jira/Calendar.
evals/open-model-gym/suite/scenarios/file-editing.yaml Adds a single-turn code-editing scenario with file/regex/compile validations.
evals/open-model-gym/suite/scenarios/multi-turn-edit.yaml Adds a multi-turn scenario validating session continuation behavior.
crates/goose-cli/Cargo.toml Drops the goose-bench dependency from the CLI crate.
crates/goose-cli/src/cli.rs Removes bench subcommand wiring; adjusts logging setup call sites.
crates/goose-cli/src/commands/mod.rs Removes the bench command module export.
crates/goose-cli/src/commands/bench.rs Removes the old benchmark CLI command implementation.
crates/goose-cli/src/commands/web.rs Adjusts logging setup invocation for the web command.
crates/goose-cli/src/logging.rs Simplifies logging setup by removing benchmark-specific error capture plumbing.
crates/goose-cli/src/main.rs Updates to the new setup_logging signature.
crates/goose-bench/Cargo.toml Removes the goose-bench crate manifest.
crates/goose-bench/README.md Removes legacy benchmark crate documentation.
crates/goose-bench/src/lib.rs Removes benchmark crate module exports.
crates/goose-bench/src/bench_config.rs Removes benchmark configuration types/serialization.
crates/goose-bench/src/bench_session.rs Removes benchmark session abstraction used by evals.
crates/goose-bench/src/bench_work_dir.rs Removes benchmark workdir management utilities.
crates/goose-bench/src/error_capture.rs Removes benchmark tracing layer that captured errors into vectors.
crates/goose-bench/src/reporting.rs Removes benchmark result/reporting structures.
crates/goose-bench/src/utilities.rs Removes benchmark process/thread helpers.
crates/goose-bench/src/runners/mod.rs Removes benchmark runner module exports.
crates/goose-bench/src/runners/bench_runner.rs Removes the benchmark suite runner orchestration.
crates/goose-bench/src/runners/eval_runner.rs Removes the eval runner that executed a single evaluation.
crates/goose-bench/src/runners/metric_aggregator.rs Removes the Python-postprocessing leaderboard generation wrapper.
crates/goose-bench/src/runners/model_runner.rs Removes the model runner that executed suites across runs/models.
crates/goose-bench/src/eval_suites/mod.rs Removes eval suite module structure.
crates/goose-bench/src/eval_suites/evaluation.rs Removes core eval trait and metric types.
crates/goose-bench/src/eval_suites/factory.rs Removes the eval registry and selector matching logic.
crates/goose-bench/src/eval_suites/metrics.rs Removes baseline metric collection utilities.
crates/goose-bench/src/eval_suites/utils.rs Removes eval output-to-file helpers.
crates/goose-bench/src/eval_suites/core/mod.rs Removes the “core” eval suite module.
crates/goose-bench/src/eval_suites/core/example.rs Removes the example evaluation.
crates/goose-bench/src/eval_suites/core/memory/mod.rs Removes memory eval suite module.
crates/goose-bench/src/eval_suites/core/memory/save_fact.rs Removes memory “save fact” evaluation.
crates/goose-bench/src/eval_suites/core/developer/mod.rs Removes developer eval suite module.
crates/goose-bench/src/eval_suites/core/developer/create_file.rs Removes developer “create file” evaluation.
crates/goose-bench/src/eval_suites/core/developer/list_files.rs Removes developer “list files” evaluation.
crates/goose-bench/src/eval_suites/core/developer/simple_repo_clone_test.rs Removes developer “repo clone test” evaluation.
crates/goose-bench/src/eval_suites/core/developer_image/mod.rs Removes developer image eval suite module.
crates/goose-bench/src/eval_suites/core/developer_image/image.rs Removes developer “screen capture” evaluation.
crates/goose-bench/src/eval_suites/core/developer_search_replace/mod.rs Removes developer search/replace eval suite module.
crates/goose-bench/src/eval_suites/core/developer_search_replace/search_replace.rs Removes developer search/replace evaluation.
crates/goose-bench/src/eval_suites/core/computercontroller/mod.rs Removes computer controller eval suite module.
crates/goose-bench/src/eval_suites/core/computercontroller/script.rs Removes computer controller “script” evaluation.
crates/goose-bench/src/eval_suites/core/computercontroller/web_scrape.rs Removes computer controller “web scrape” evaluation.
crates/goose-bench/src/eval_suites/vibes/mod.rs Removes the “vibes” eval suite module.
crates/goose-bench/src/eval_suites/vibes/blog_summary.rs Removes the blog summary evaluation.
crates/goose-bench/src/eval_suites/vibes/flappy_bird.rs Removes the flappy bird evaluation.
crates/goose-bench/src/eval_suites/vibes/goose_wiki.rs Removes the goose wiki evaluation.
crates/goose-bench/src/eval_suites/vibes/restaurant_research.rs Removes the restaurant research evaluation.
crates/goose-bench/src/eval_suites/vibes/squirrel_census.rs Removes the squirrel census evaluation.
crates/goose-bench/src/assets/kubernetes.patch Removes benchmark asset patch file.
crates/goose-bench/src/assets/vscode.patch Removes benchmark asset patch file.
crates/goose-bench/src/assets/vscode_config_registry.ts Removes benchmark asset source file.
Cargo.lock Removes goose-bench from the dependency graph lockfile.
Files not reviewed (2)
  • evals/open-model-gym/mcp-harness/package-lock.json: Language not supported
  • evals/open-model-gym/suite/package-lock.json: Language not supported

Comment on lines 240 to 242
validate_network_auth(&host, &auth_token, no_auth);
crate::logging::setup_logging(Some("goose-web"), None)?;
let _ = crate::logging::setup_logging(Some("goose-web"));

Copy link

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

setup_logging errors are being silently ignored here, which can hide failures to initialize file/telemetry logging; please at least emit the same warning used in main.rs (or propagate the error) so operators know logging is degraded.

Copilot uses AI. Check for mistakes.
Comment on lines +52 to +65
case "file_matches": {
const fullPath = join(workdir, rule.path);
if (!existsSync(fullPath)) {
return { passed: false, message: `File not found: ${rule.path}` };
}
const content = readFileSync(fullPath, "utf-8");
const regex = new RegExp(rule.regex);
const matches = regex.test(content);
return {
passed: matches,
message: matches
? undefined
: `File ${rule.path} does not match regex: ${rule.regex}`,
};
Copy link

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

file_matches/file_not_matches create new RegExp(rule.regex) without handling invalid patterns, so a bad regex will throw and crash the whole run instead of returning a failed validation; wrap regex compilation in a try/catch and return a helpful failure message.

Copilot uses AI. Check for mistakes.
Comment on lines +125 to +163
// Check if any call matches the arg requirements
for (const call of matchingCalls) {
const args = call.arguments || {};
let allMatch = true;

for (const [key, expected] of Object.entries(rule.args)) {
const actual = args[key];
if (actual === undefined) {
allMatch = false;
break;
}

// If expected starts/ends with /, treat as regex pattern
if (typeof expected === "string" && expected.startsWith("/") && expected.endsWith("/")) {
const pattern = new RegExp(expected.slice(1, -1), "i");
if (!pattern.test(String(actual))) {
allMatch = false;
break;
}
} else {
// Exact match (case-insensitive for strings)
const actualStr = String(actual).toLowerCase();
const expectedStr = String(expected).toLowerCase();
if (!actualStr.includes(expectedStr)) {
allMatch = false;
break;
}
}
}

if (allMatch) {
return { passed: true };
}
}

return {
passed: false,
message: `Tool ${rule.tool} called but args didn't match: expected ${JSON.stringify(rule.args)}`,
};
Copy link

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ValidationRule.tool_called.args is typed to allow RegExp, but validateRule doesn't actually handle RegExp values (it stringifies them and does a substring check); either implement proper expected instanceof RegExp matching or change the type to string to avoid misleading API consumers.

Copilot uses AI. Check for mistakes.
Comment on lines +84 to +93
case "command_succeeds": {
try {
execSync(rule.command, { cwd: workdir, stdio: "pipe" });
return { passed: true };
} catch (err) {
return {
passed: false,
message: `Command failed: ${rule.command}`,
};
}
Copy link

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On command_succeeds failures, the returned message drops the underlying stderr/stdout, which makes debugging scenario failures hard; include the caught error’s output (or at least err.message) in the validation message.

Copilot uses AI. Check for mistakes.
Comment on lines 1 to 2
pub mod configure;
pub mod info;
Copy link

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the bench command/module is being removed, there are now stale references to crates/goose-bench in repo docs (e.g. AGENTS.md and benchmark blog posts); please update those references or clarify the new location so documentation doesn’t point at deleted paths.

Copilot uses AI. Check for mistakes.
return new Date().toISOString();
}

function randomDelay(): number {
Copy link

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused function randomDelay.

Copilot uses AI. Check for mistakes.
@@ -0,0 +1,1504 @@
#!/usr/bin/env node
import { mkdirSync, writeFileSync, rmSync, readdirSync, existsSync, copyFileSync } from "node:fs";
import { join, basename, dirname } from "node:path";
Copy link

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused import dirname.

Copilot uses AI. Check for mistakes.
return `${pair.model.name}::${pair.runner.name}`;
}

function resultKey(result: TestResultWithLog): string {
Copy link

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused function resultKey.

Copilot uses AI. Check for mistakes.
}

const getResult = (scenario: string, rowKey: string) => {
const [modelPart, runnerName] = rowKey.split("::");
Copy link

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused variable modelPart.

Copilot uses AI. Check for mistakes.
Copy link
Collaborator

@alexhancock alexhancock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this is good to do. Can experiment with a range of evals here.

Copilot AI review requested due to automatic review settings February 9, 2026 03:44
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 69 out of 76 changed files in this pull request and generated 6 comments.

Files not reviewed (2)
  • evals/open-model-gym/mcp-harness/package-lock.json: Language not supported
  • evals/open-model-gym/suite/package-lock.json: Language not supported

Comment on lines +16 to +18
case "file_exists": {
const fullPath = join(workdir, rule.path);
const exists = existsSync(fullPath);
Copy link

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rule.path is joined onto workdir without validation, so a scenario can use ../ or absolute paths to read/stat files outside the workspace; consider resolving the path and rejecting anything that escapes workdir (and optionally disallow absolute paths).

Copilot uses AI. Check for mistakes.
Comment on lines +57 to +59
const content = readFileSync(fullPath, "utf-8");
const regex = new RegExp(rule.regex);
const matches = regex.test(content);
Copy link

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

new RegExp(rule.regex) can throw on an invalid pattern and crash the whole run; wrap regex compilation in a try/catch and return a failed validation with an explanatory message instead.

Copilot uses AI. Check for mistakes.
Comment on lines +73 to +75
const content = readFileSync(fullPath, "utf-8");
const regex = new RegExp(rule.regex);
const matches = regex.test(content);
Copy link

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

new RegExp(rule.regex) can throw on an invalid pattern and crash the whole run; wrap regex compilation in a try/catch and return a failed validation with an explanatory message instead.

Copilot uses AI. Check for mistakes.
Comment on lines +145 to +149
// Exact match (case-insensitive for strings)
const actualStr = String(actual).toLowerCase();
const expectedStr = String(expected).toLowerCase();
if (!actualStr.includes(expectedStr)) {
allMatch = false;
Copy link

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment says this is an “Exact match” check, but the implementation uses includes, which can lead to false-positive validations; either change the check to strict equality (case-insensitive) or update the comment/behavior to reflect substring matching.

Copilot uses AI. Check for mistakes.
Comment on lines +42 to +44
| { type: "command_succeeds"; command: string; name?: string }
| { type: "tool_called"; tool: string; args?: Record<string, string | RegExp>; name?: string }
| { type: "custom"; fn: string; name?: string };
Copy link

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tool_called.args is typed as allowing RegExp, but the validator only treats regex expectations as strings wrapped in slashes (e.g. "/foo/"); either update the type to match the on-disk YAML/JSON representation or add support for actual RegExp values in validateRule.

Copilot uses AI. Check for mistakes.
Comment on lines +15 to +19
result: result,
};
const line = JSON.stringify(entry) + '\n';
fs.appendFileSync(LOG_FILE, line);
}
Copy link

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

appendFileSync will throw (and crash the server) if MCP_HARNESS_LOG points to a path whose parent directory doesn’t exist; ensure the directory is created (or catch/log the error) before writing.

Copilot uses AI. Check for mistakes.
@codefromthecrypt codefromthecrypt added this pull request to the merge queue Feb 9, 2026
Copy link
Collaborator

@codefromthecrypt codefromthecrypt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

red is a lovely color

Merged via the queue into main with commit a3ba124 Feb 9, 2026
25 checks passed
@codefromthecrypt codefromthecrypt deleted the micn/bench-cleanup branch February 9, 2026 06:14
jh-block added a commit that referenced this pull request Feb 9, 2026
* origin/main: (54 commits)
  chore: strip posthog for sessions/models/daily only (#7079)
  tidy: clean up old benchmark and add gym (#7081)
  fix: use command.process_group(0) for CLI providers, not just MCP (#7083)
  added build notify (#6891)
  test(mcp): add image tool test and consolidate MCP test fixtures (#7019)
  fix: remove Option from model listing return types, propagate errors (#7074)
  fix: lazy provider creation for goose acp (#7026) (#7066)
  Smoke tests: split compaction test and use debug build (#6984)
  fix(deps): trim bat to resolve RUSTSEC-2024-0320 (#7061)
  feat: expose AGENT_SESSION_ID env var to extension child processes (#7072)
  fix: add XML tool call parsing fallback for Qwen3-coder via Ollama (#6882)
  Remove clippy too_many_lines lint and decompose long functions (#7064)
  refactor: move disable_session_naming into AgentConfig (#7062)
  Add global config switch to disable automatic session naming (#7052)
  docs: add blog post - 8 Things You Didn't Know About Code Mode (#7059)
  fix: ensure animated elements are visible when prefers-reduced-motion is enabled (#7047)
  Show recommended model on failture (#7040)
  feat(ui): add session content search via API (#7050)
  docs: fix img url (#7053)
  Desktop UI for deleting custom providers (#7042)
  ...
tlongwell-block added a commit that referenced this pull request Feb 9, 2026
* origin/main:
  Docs: require auth optional for custom providers (#7098)
  fix: improve text-muted contrast for better readability (#7095)
  Always sync bundled extensions (#7057)
  feat: Add tom (Top Of Mind) platform extension (#7073)
  chore(docs): update GOOSE_SESSION_ID -> AGENT_SESSION_ID (#6669)
  fix(ci): switch from cargo-audit to cargo-deny for advisory scanning (#7032)
  chore(deps): bump @isaacs/brace-expansion from 5.0.0 to 5.0.1 in /evals/open-model-gym/suite (#7085)
  chore(deps): bump @modelcontextprotocol/sdk from 1.25.3 to 1.26.0 in /evals/open-model-gym/mcp-harness (#7086)
  fix: switch to windows msvc (#7080)
  fix: allow unlisted models for CLI providers (#7090)
  Use goose port (#7089)
  chore: strip posthog for sessions/models/daily only (#7079)
  tidy: clean up old benchmark and add gym (#7081)
  fix: use command.process_group(0) for CLI providers, not just MCP (#7083)
  added build notify (#6891)
michaelneale added a commit that referenced this pull request Feb 10, 2026
* main: (125 commits)
  chore: add a new scenario (#7107)
  fix: Goose Desktop missing Calendar and Reminders entitlements (#7100)
  Fix 'Edit In Place' and 'Fork Session' features (#6970)
  Fix: Only send command content to command injection classifier (excluding part of tool call dict) (#7082)
  Docs: require auth optional for custom providers (#7098)
  fix: improve text-muted contrast for better readability (#7095)
  Always sync bundled extensions (#7057)
  feat: Add tom (Top Of Mind) platform extension (#7073)
  chore(docs): update GOOSE_SESSION_ID -> AGENT_SESSION_ID (#6669)
  fix(ci): switch from cargo-audit to cargo-deny for advisory scanning (#7032)
  chore(deps): bump @isaacs/brace-expansion from 5.0.0 to 5.0.1 in /evals/open-model-gym/suite (#7085)
  chore(deps): bump @modelcontextprotocol/sdk from 1.25.3 to 1.26.0 in /evals/open-model-gym/mcp-harness (#7086)
  fix: switch to windows msvc (#7080)
  fix: allow unlisted models for CLI providers (#7090)
  Use goose port (#7089)
  chore: strip posthog for sessions/models/daily only (#7079)
  tidy: clean up old benchmark and add gym (#7081)
  fix: use command.process_group(0) for CLI providers, not just MCP (#7083)
  added build notify (#6891)
  test(mcp): add image tool test and consolidate MCP test fixtures (#7019)
  ...
lifeizhou-ap added a commit that referenced this pull request Feb 11, 2026
* main: (85 commits)
  Fix 'Edit In Place' and 'Fork Session' features (#6970)
  Fix: Only send command content to command injection classifier (excluding part of tool call dict) (#7082)
  Docs: require auth optional for custom providers (#7098)
  fix: improve text-muted contrast for better readability (#7095)
  Always sync bundled extensions (#7057)
  feat: Add tom (Top Of Mind) platform extension (#7073)
  chore(docs): update GOOSE_SESSION_ID -> AGENT_SESSION_ID (#6669)
  fix(ci): switch from cargo-audit to cargo-deny for advisory scanning (#7032)
  chore(deps): bump @isaacs/brace-expansion from 5.0.0 to 5.0.1 in /evals/open-model-gym/suite (#7085)
  chore(deps): bump @modelcontextprotocol/sdk from 1.25.3 to 1.26.0 in /evals/open-model-gym/mcp-harness (#7086)
  fix: switch to windows msvc (#7080)
  fix: allow unlisted models for CLI providers (#7090)
  Use goose port (#7089)
  chore: strip posthog for sessions/models/daily only (#7079)
  tidy: clean up old benchmark and add gym (#7081)
  fix: use command.process_group(0) for CLI providers, not just MCP (#7083)
  added build notify (#6891)
  test(mcp): add image tool test and consolidate MCP test fixtures (#7019)
  fix: remove Option from model listing return types, propagate errors (#7074)
  fix: lazy provider creation for goose acp (#7026) (#7066)
  ...
Tyler-Hardin pushed a commit to Tyler-Hardin/goose that referenced this pull request Feb 11, 2026
Tyler-Hardin pushed a commit to Tyler-Hardin/goose that referenced this pull request Feb 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants