tidy: clean up old benchmark and add gym#7081
Conversation
There was a problem hiding this comment.
Pull request overview
Removes the legacy goose-bench benchmarking integration from the Rust CLI, and adds a standalone “Open Model Gym” evaluation harness under evals/ for running scenario matrices across models/runners (including an MCP tool harness).
Changes:
- Removed the
benchCLI command path and thegoose-benchcrate/dependency. - Added
evals/open-model-gym/including a TypeScript suite runner, validators, and example scenarios. - Added an MCP harness (TypeScript MCP server) for simulating common tool integrations used by eval scenarios.
Reviewed changes
Copilot reviewed 69 out of 76 changed files in this pull request and generated 13 comments.
Show a summary per file
| File | Description |
|---|---|
| evals/open-model-gym/config.yaml | Defines the models/runners/scenario matrix for the Open Model Gym runs. |
| evals/open-model-gym/Justfile | Adds just recipes to install/build/run/report the gym suite and MCP harness. |
| evals/open-model-gym/README.md | Documents how the Open Model Gym works, how to configure it, and how to run it. |
| evals/open-model-gym/.gitignore | Ignores build outputs, logs, caches, and runtime artifacts for the gym. |
| evals/open-model-gym/mcp-harness/README.md | Documents the simulated MCP server and its tool set. |
| evals/open-model-gym/mcp-harness/package.json | Defines MCP harness build/start scripts and dependencies. |
| evals/open-model-gym/mcp-harness/run.sh | Convenience script to build and start the MCP harness. |
| evals/open-model-gym/mcp-harness/src/index.ts | Implements the simulated MCP server and tool call logging. |
| evals/open-model-gym/mcp-harness/tsconfig.json | TypeScript configuration for building the MCP harness. |
| evals/open-model-gym/suite/.gitignore | Ignores suite-local artifacts (node_modules/workdir/cache/etc.). |
| evals/open-model-gym/suite/package.json | Defines suite build/run scripts for the scenario runner. |
| evals/open-model-gym/suite/package-lock.json | Locks suite npm dependencies for reproducible installs. |
| evals/open-model-gym/suite/tsconfig.json | TypeScript configuration for building the suite runner. |
| evals/open-model-gym/suite/src/runner.ts | Main suite runner: loads config/scenarios, runs agent commands, caches, and renders reports. |
| evals/open-model-gym/suite/src/types.ts | Shared types for scenarios, validations, and test results. |
| evals/open-model-gym/suite/src/validator.ts | Implements validation rules (file checks, regex checks, tool call checks, command checks). |
| evals/open-model-gym/suite/src/gym.png | Embedded image asset used by the generated report. |
| evals/open-model-gym/suite/scenarios/everyday-app-automation.yaml | Adds a multi-step scenario validating MCP tool usage across Slack/Jira/Calendar. |
| evals/open-model-gym/suite/scenarios/file-editing.yaml | Adds a single-turn code-editing scenario with file/regex/compile validations. |
| evals/open-model-gym/suite/scenarios/multi-turn-edit.yaml | Adds a multi-turn scenario validating session continuation behavior. |
| crates/goose-cli/Cargo.toml | Drops the goose-bench dependency from the CLI crate. |
| crates/goose-cli/src/cli.rs | Removes bench subcommand wiring; adjusts logging setup call sites. |
| crates/goose-cli/src/commands/mod.rs | Removes the bench command module export. |
| crates/goose-cli/src/commands/bench.rs | Removes the old benchmark CLI command implementation. |
| crates/goose-cli/src/commands/web.rs | Adjusts logging setup invocation for the web command. |
| crates/goose-cli/src/logging.rs | Simplifies logging setup by removing benchmark-specific error capture plumbing. |
| crates/goose-cli/src/main.rs | Updates to the new setup_logging signature. |
| crates/goose-bench/Cargo.toml | Removes the goose-bench crate manifest. |
| crates/goose-bench/README.md | Removes legacy benchmark crate documentation. |
| crates/goose-bench/src/lib.rs | Removes benchmark crate module exports. |
| crates/goose-bench/src/bench_config.rs | Removes benchmark configuration types/serialization. |
| crates/goose-bench/src/bench_session.rs | Removes benchmark session abstraction used by evals. |
| crates/goose-bench/src/bench_work_dir.rs | Removes benchmark workdir management utilities. |
| crates/goose-bench/src/error_capture.rs | Removes benchmark tracing layer that captured errors into vectors. |
| crates/goose-bench/src/reporting.rs | Removes benchmark result/reporting structures. |
| crates/goose-bench/src/utilities.rs | Removes benchmark process/thread helpers. |
| crates/goose-bench/src/runners/mod.rs | Removes benchmark runner module exports. |
| crates/goose-bench/src/runners/bench_runner.rs | Removes the benchmark suite runner orchestration. |
| crates/goose-bench/src/runners/eval_runner.rs | Removes the eval runner that executed a single evaluation. |
| crates/goose-bench/src/runners/metric_aggregator.rs | Removes the Python-postprocessing leaderboard generation wrapper. |
| crates/goose-bench/src/runners/model_runner.rs | Removes the model runner that executed suites across runs/models. |
| crates/goose-bench/src/eval_suites/mod.rs | Removes eval suite module structure. |
| crates/goose-bench/src/eval_suites/evaluation.rs | Removes core eval trait and metric types. |
| crates/goose-bench/src/eval_suites/factory.rs | Removes the eval registry and selector matching logic. |
| crates/goose-bench/src/eval_suites/metrics.rs | Removes baseline metric collection utilities. |
| crates/goose-bench/src/eval_suites/utils.rs | Removes eval output-to-file helpers. |
| crates/goose-bench/src/eval_suites/core/mod.rs | Removes the “core” eval suite module. |
| crates/goose-bench/src/eval_suites/core/example.rs | Removes the example evaluation. |
| crates/goose-bench/src/eval_suites/core/memory/mod.rs | Removes memory eval suite module. |
| crates/goose-bench/src/eval_suites/core/memory/save_fact.rs | Removes memory “save fact” evaluation. |
| crates/goose-bench/src/eval_suites/core/developer/mod.rs | Removes developer eval suite module. |
| crates/goose-bench/src/eval_suites/core/developer/create_file.rs | Removes developer “create file” evaluation. |
| crates/goose-bench/src/eval_suites/core/developer/list_files.rs | Removes developer “list files” evaluation. |
| crates/goose-bench/src/eval_suites/core/developer/simple_repo_clone_test.rs | Removes developer “repo clone test” evaluation. |
| crates/goose-bench/src/eval_suites/core/developer_image/mod.rs | Removes developer image eval suite module. |
| crates/goose-bench/src/eval_suites/core/developer_image/image.rs | Removes developer “screen capture” evaluation. |
| crates/goose-bench/src/eval_suites/core/developer_search_replace/mod.rs | Removes developer search/replace eval suite module. |
| crates/goose-bench/src/eval_suites/core/developer_search_replace/search_replace.rs | Removes developer search/replace evaluation. |
| crates/goose-bench/src/eval_suites/core/computercontroller/mod.rs | Removes computer controller eval suite module. |
| crates/goose-bench/src/eval_suites/core/computercontroller/script.rs | Removes computer controller “script” evaluation. |
| crates/goose-bench/src/eval_suites/core/computercontroller/web_scrape.rs | Removes computer controller “web scrape” evaluation. |
| crates/goose-bench/src/eval_suites/vibes/mod.rs | Removes the “vibes” eval suite module. |
| crates/goose-bench/src/eval_suites/vibes/blog_summary.rs | Removes the blog summary evaluation. |
| crates/goose-bench/src/eval_suites/vibes/flappy_bird.rs | Removes the flappy bird evaluation. |
| crates/goose-bench/src/eval_suites/vibes/goose_wiki.rs | Removes the goose wiki evaluation. |
| crates/goose-bench/src/eval_suites/vibes/restaurant_research.rs | Removes the restaurant research evaluation. |
| crates/goose-bench/src/eval_suites/vibes/squirrel_census.rs | Removes the squirrel census evaluation. |
| crates/goose-bench/src/assets/kubernetes.patch | Removes benchmark asset patch file. |
| crates/goose-bench/src/assets/vscode.patch | Removes benchmark asset patch file. |
| crates/goose-bench/src/assets/vscode_config_registry.ts | Removes benchmark asset source file. |
| Cargo.lock | Removes goose-bench from the dependency graph lockfile. |
Files not reviewed (2)
- evals/open-model-gym/mcp-harness/package-lock.json: Language not supported
- evals/open-model-gym/suite/package-lock.json: Language not supported
| validate_network_auth(&host, &auth_token, no_auth); | ||
| crate::logging::setup_logging(Some("goose-web"), None)?; | ||
| let _ = crate::logging::setup_logging(Some("goose-web")); | ||
|
|
There was a problem hiding this comment.
setup_logging errors are being silently ignored here, which can hide failures to initialize file/telemetry logging; please at least emit the same warning used in main.rs (or propagate the error) so operators know logging is degraded.
| case "file_matches": { | ||
| const fullPath = join(workdir, rule.path); | ||
| if (!existsSync(fullPath)) { | ||
| return { passed: false, message: `File not found: ${rule.path}` }; | ||
| } | ||
| const content = readFileSync(fullPath, "utf-8"); | ||
| const regex = new RegExp(rule.regex); | ||
| const matches = regex.test(content); | ||
| return { | ||
| passed: matches, | ||
| message: matches | ||
| ? undefined | ||
| : `File ${rule.path} does not match regex: ${rule.regex}`, | ||
| }; |
There was a problem hiding this comment.
file_matches/file_not_matches create new RegExp(rule.regex) without handling invalid patterns, so a bad regex will throw and crash the whole run instead of returning a failed validation; wrap regex compilation in a try/catch and return a helpful failure message.
| // Check if any call matches the arg requirements | ||
| for (const call of matchingCalls) { | ||
| const args = call.arguments || {}; | ||
| let allMatch = true; | ||
|
|
||
| for (const [key, expected] of Object.entries(rule.args)) { | ||
| const actual = args[key]; | ||
| if (actual === undefined) { | ||
| allMatch = false; | ||
| break; | ||
| } | ||
|
|
||
| // If expected starts/ends with /, treat as regex pattern | ||
| if (typeof expected === "string" && expected.startsWith("/") && expected.endsWith("/")) { | ||
| const pattern = new RegExp(expected.slice(1, -1), "i"); | ||
| if (!pattern.test(String(actual))) { | ||
| allMatch = false; | ||
| break; | ||
| } | ||
| } else { | ||
| // Exact match (case-insensitive for strings) | ||
| const actualStr = String(actual).toLowerCase(); | ||
| const expectedStr = String(expected).toLowerCase(); | ||
| if (!actualStr.includes(expectedStr)) { | ||
| allMatch = false; | ||
| break; | ||
| } | ||
| } | ||
| } | ||
|
|
||
| if (allMatch) { | ||
| return { passed: true }; | ||
| } | ||
| } | ||
|
|
||
| return { | ||
| passed: false, | ||
| message: `Tool ${rule.tool} called but args didn't match: expected ${JSON.stringify(rule.args)}`, | ||
| }; |
There was a problem hiding this comment.
ValidationRule.tool_called.args is typed to allow RegExp, but validateRule doesn't actually handle RegExp values (it stringifies them and does a substring check); either implement proper expected instanceof RegExp matching or change the type to string to avoid misleading API consumers.
| case "command_succeeds": { | ||
| try { | ||
| execSync(rule.command, { cwd: workdir, stdio: "pipe" }); | ||
| return { passed: true }; | ||
| } catch (err) { | ||
| return { | ||
| passed: false, | ||
| message: `Command failed: ${rule.command}`, | ||
| }; | ||
| } |
There was a problem hiding this comment.
On command_succeeds failures, the returned message drops the underlying stderr/stdout, which makes debugging scenario failures hard; include the caught error’s output (or at least err.message) in the validation message.
| pub mod configure; | ||
| pub mod info; |
There was a problem hiding this comment.
Since the bench command/module is being removed, there are now stale references to crates/goose-bench in repo docs (e.g. AGENTS.md and benchmark blog posts); please update those references or clarify the new location so documentation doesn’t point at deleted paths.
| return new Date().toISOString(); | ||
| } | ||
|
|
||
| function randomDelay(): number { |
There was a problem hiding this comment.
Unused function randomDelay.
| @@ -0,0 +1,1504 @@ | |||
| #!/usr/bin/env node | |||
| import { mkdirSync, writeFileSync, rmSync, readdirSync, existsSync, copyFileSync } from "node:fs"; | |||
| import { join, basename, dirname } from "node:path"; | |||
There was a problem hiding this comment.
Unused import dirname.
| return `${pair.model.name}::${pair.runner.name}`; | ||
| } | ||
|
|
||
| function resultKey(result: TestResultWithLog): string { |
There was a problem hiding this comment.
Unused function resultKey.
| } | ||
|
|
||
| const getResult = (scenario: string, rowKey: string) => { | ||
| const [modelPart, runnerName] = rowKey.split("::"); |
There was a problem hiding this comment.
Unused variable modelPart.
alexhancock
left a comment
There was a problem hiding this comment.
Yeah this is good to do. Can experiment with a range of evals here.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 69 out of 76 changed files in this pull request and generated 6 comments.
Files not reviewed (2)
- evals/open-model-gym/mcp-harness/package-lock.json: Language not supported
- evals/open-model-gym/suite/package-lock.json: Language not supported
| case "file_exists": { | ||
| const fullPath = join(workdir, rule.path); | ||
| const exists = existsSync(fullPath); |
There was a problem hiding this comment.
rule.path is joined onto workdir without validation, so a scenario can use ../ or absolute paths to read/stat files outside the workspace; consider resolving the path and rejecting anything that escapes workdir (and optionally disallow absolute paths).
| const content = readFileSync(fullPath, "utf-8"); | ||
| const regex = new RegExp(rule.regex); | ||
| const matches = regex.test(content); |
There was a problem hiding this comment.
new RegExp(rule.regex) can throw on an invalid pattern and crash the whole run; wrap regex compilation in a try/catch and return a failed validation with an explanatory message instead.
| const content = readFileSync(fullPath, "utf-8"); | ||
| const regex = new RegExp(rule.regex); | ||
| const matches = regex.test(content); |
There was a problem hiding this comment.
new RegExp(rule.regex) can throw on an invalid pattern and crash the whole run; wrap regex compilation in a try/catch and return a failed validation with an explanatory message instead.
| // Exact match (case-insensitive for strings) | ||
| const actualStr = String(actual).toLowerCase(); | ||
| const expectedStr = String(expected).toLowerCase(); | ||
| if (!actualStr.includes(expectedStr)) { | ||
| allMatch = false; |
There was a problem hiding this comment.
The comment says this is an “Exact match” check, but the implementation uses includes, which can lead to false-positive validations; either change the check to strict equality (case-insensitive) or update the comment/behavior to reflect substring matching.
| | { type: "command_succeeds"; command: string; name?: string } | ||
| | { type: "tool_called"; tool: string; args?: Record<string, string | RegExp>; name?: string } | ||
| | { type: "custom"; fn: string; name?: string }; |
There was a problem hiding this comment.
tool_called.args is typed as allowing RegExp, but the validator only treats regex expectations as strings wrapped in slashes (e.g. "/foo/"); either update the type to match the on-disk YAML/JSON representation or add support for actual RegExp values in validateRule.
| result: result, | ||
| }; | ||
| const line = JSON.stringify(entry) + '\n'; | ||
| fs.appendFileSync(LOG_FILE, line); | ||
| } |
There was a problem hiding this comment.
appendFileSync will throw (and crash the server) if MCP_HARNESS_LOG points to a path whose parent directory doesn’t exist; ensure the directory is created (or catch/log the error) before writing.
codefromthecrypt
left a comment
There was a problem hiding this comment.
red is a lovely color
* origin/main: (54 commits) chore: strip posthog for sessions/models/daily only (#7079) tidy: clean up old benchmark and add gym (#7081) fix: use command.process_group(0) for CLI providers, not just MCP (#7083) added build notify (#6891) test(mcp): add image tool test and consolidate MCP test fixtures (#7019) fix: remove Option from model listing return types, propagate errors (#7074) fix: lazy provider creation for goose acp (#7026) (#7066) Smoke tests: split compaction test and use debug build (#6984) fix(deps): trim bat to resolve RUSTSEC-2024-0320 (#7061) feat: expose AGENT_SESSION_ID env var to extension child processes (#7072) fix: add XML tool call parsing fallback for Qwen3-coder via Ollama (#6882) Remove clippy too_many_lines lint and decompose long functions (#7064) refactor: move disable_session_naming into AgentConfig (#7062) Add global config switch to disable automatic session naming (#7052) docs: add blog post - 8 Things You Didn't Know About Code Mode (#7059) fix: ensure animated elements are visible when prefers-reduced-motion is enabled (#7047) Show recommended model on failture (#7040) feat(ui): add session content search via API (#7050) docs: fix img url (#7053) Desktop UI for deleting custom providers (#7042) ...
* origin/main: Docs: require auth optional for custom providers (#7098) fix: improve text-muted contrast for better readability (#7095) Always sync bundled extensions (#7057) feat: Add tom (Top Of Mind) platform extension (#7073) chore(docs): update GOOSE_SESSION_ID -> AGENT_SESSION_ID (#6669) fix(ci): switch from cargo-audit to cargo-deny for advisory scanning (#7032) chore(deps): bump @isaacs/brace-expansion from 5.0.0 to 5.0.1 in /evals/open-model-gym/suite (#7085) chore(deps): bump @modelcontextprotocol/sdk from 1.25.3 to 1.26.0 in /evals/open-model-gym/mcp-harness (#7086) fix: switch to windows msvc (#7080) fix: allow unlisted models for CLI providers (#7090) Use goose port (#7089) chore: strip posthog for sessions/models/daily only (#7079) tidy: clean up old benchmark and add gym (#7081) fix: use command.process_group(0) for CLI providers, not just MCP (#7083) added build notify (#6891)
* main: (125 commits) chore: add a new scenario (#7107) fix: Goose Desktop missing Calendar and Reminders entitlements (#7100) Fix 'Edit In Place' and 'Fork Session' features (#6970) Fix: Only send command content to command injection classifier (excluding part of tool call dict) (#7082) Docs: require auth optional for custom providers (#7098) fix: improve text-muted contrast for better readability (#7095) Always sync bundled extensions (#7057) feat: Add tom (Top Of Mind) platform extension (#7073) chore(docs): update GOOSE_SESSION_ID -> AGENT_SESSION_ID (#6669) fix(ci): switch from cargo-audit to cargo-deny for advisory scanning (#7032) chore(deps): bump @isaacs/brace-expansion from 5.0.0 to 5.0.1 in /evals/open-model-gym/suite (#7085) chore(deps): bump @modelcontextprotocol/sdk from 1.25.3 to 1.26.0 in /evals/open-model-gym/mcp-harness (#7086) fix: switch to windows msvc (#7080) fix: allow unlisted models for CLI providers (#7090) Use goose port (#7089) chore: strip posthog for sessions/models/daily only (#7079) tidy: clean up old benchmark and add gym (#7081) fix: use command.process_group(0) for CLI providers, not just MCP (#7083) added build notify (#6891) test(mcp): add image tool test and consolidate MCP test fixtures (#7019) ...
* main: (85 commits) Fix 'Edit In Place' and 'Fork Session' features (#6970) Fix: Only send command content to command injection classifier (excluding part of tool call dict) (#7082) Docs: require auth optional for custom providers (#7098) fix: improve text-muted contrast for better readability (#7095) Always sync bundled extensions (#7057) feat: Add tom (Top Of Mind) platform extension (#7073) chore(docs): update GOOSE_SESSION_ID -> AGENT_SESSION_ID (#6669) fix(ci): switch from cargo-audit to cargo-deny for advisory scanning (#7032) chore(deps): bump @isaacs/brace-expansion from 5.0.0 to 5.0.1 in /evals/open-model-gym/suite (#7085) chore(deps): bump @modelcontextprotocol/sdk from 1.25.3 to 1.26.0 in /evals/open-model-gym/mcp-harness (#7086) fix: switch to windows msvc (#7080) fix: allow unlisted models for CLI providers (#7090) Use goose port (#7089) chore: strip posthog for sessions/models/daily only (#7079) tidy: clean up old benchmark and add gym (#7081) fix: use command.process_group(0) for CLI providers, not just MCP (#7083) added build notify (#6891) test(mcp): add image tool test and consolidate MCP test fixtures (#7019) fix: remove Option from model listing return types, propagate errors (#7074) fix: lazy provider creation for goose acp (#7026) (#7066) ...
The old benchmark code was wired in to cli and I don't think we use it.
Also thought I would try moving the "gym" over to evals dir where we can park a bunch of things in there which are not tied in to the code directly but useful.