tidy: clean up old benchmark and add gym by michaelneale · Pull Request #7081 · block/goose

michaelneale · 2026-02-09T01:50:37Z

The old benchmark code was wired in to cli and I don't think we use it.

Also thought I would try moving the "gym" over to evals dir where we can park a bunch of things in there which are not tied in to the code directly but useful.

Copilot

Pull request overview

Removes the legacy goose-bench benchmarking integration from the Rust CLI, and adds a standalone “Open Model Gym” evaluation harness under evals/ for running scenario matrices across models/runners (including an MCP tool harness).

Changes:

Removed the bench CLI command path and the goose-bench crate/dependency.
Added evals/open-model-gym/ including a TypeScript suite runner, validators, and example scenarios.
Added an MCP harness (TypeScript MCP server) for simulating common tool integrations used by eval scenarios.

Reviewed changes

Copilot reviewed 69 out of 76 changed files in this pull request and generated 13 comments.

Show a summary per file

File	Description
evals/open-model-gym/config.yaml	Defines the models/runners/scenario matrix for the Open Model Gym runs.
evals/open-model-gym/Justfile	Adds `just` recipes to install/build/run/report the gym suite and MCP harness.
evals/open-model-gym/README.md	Documents how the Open Model Gym works, how to configure it, and how to run it.
evals/open-model-gym/.gitignore	Ignores build outputs, logs, caches, and runtime artifacts for the gym.
evals/open-model-gym/mcp-harness/README.md	Documents the simulated MCP server and its tool set.
evals/open-model-gym/mcp-harness/package.json	Defines MCP harness build/start scripts and dependencies.
evals/open-model-gym/mcp-harness/run.sh	Convenience script to build and start the MCP harness.
evals/open-model-gym/mcp-harness/src/index.ts	Implements the simulated MCP server and tool call logging.
evals/open-model-gym/mcp-harness/tsconfig.json	TypeScript configuration for building the MCP harness.
evals/open-model-gym/suite/.gitignore	Ignores suite-local artifacts (node_modules/workdir/cache/etc.).
evals/open-model-gym/suite/package.json	Defines suite build/run scripts for the scenario runner.
evals/open-model-gym/suite/package-lock.json	Locks suite npm dependencies for reproducible installs.
evals/open-model-gym/suite/tsconfig.json	TypeScript configuration for building the suite runner.
evals/open-model-gym/suite/src/runner.ts	Main suite runner: loads config/scenarios, runs agent commands, caches, and renders reports.
evals/open-model-gym/suite/src/types.ts	Shared types for scenarios, validations, and test results.
evals/open-model-gym/suite/src/validator.ts	Implements validation rules (file checks, regex checks, tool call checks, command checks).
evals/open-model-gym/suite/src/gym.png	Embedded image asset used by the generated report.
evals/open-model-gym/suite/scenarios/everyday-app-automation.yaml	Adds a multi-step scenario validating MCP tool usage across Slack/Jira/Calendar.
evals/open-model-gym/suite/scenarios/file-editing.yaml	Adds a single-turn code-editing scenario with file/regex/compile validations.
evals/open-model-gym/suite/scenarios/multi-turn-edit.yaml	Adds a multi-turn scenario validating session continuation behavior.
crates/goose-cli/Cargo.toml	Drops the `goose-bench` dependency from the CLI crate.
crates/goose-cli/src/cli.rs	Removes `bench` subcommand wiring; adjusts logging setup call sites.
crates/goose-cli/src/commands/mod.rs	Removes the `bench` command module export.
crates/goose-cli/src/commands/bench.rs	Removes the old benchmark CLI command implementation.
crates/goose-cli/src/commands/web.rs	Adjusts logging setup invocation for the web command.
crates/goose-cli/src/logging.rs	Simplifies logging setup by removing benchmark-specific error capture plumbing.
crates/goose-cli/src/main.rs	Updates to the new `setup_logging` signature.
crates/goose-bench/Cargo.toml	Removes the `goose-bench` crate manifest.
crates/goose-bench/README.md	Removes legacy benchmark crate documentation.
crates/goose-bench/src/lib.rs	Removes benchmark crate module exports.
crates/goose-bench/src/bench_config.rs	Removes benchmark configuration types/serialization.
crates/goose-bench/src/bench_session.rs	Removes benchmark session abstraction used by evals.
crates/goose-bench/src/bench_work_dir.rs	Removes benchmark workdir management utilities.
crates/goose-bench/src/error_capture.rs	Removes benchmark tracing layer that captured errors into vectors.
crates/goose-bench/src/reporting.rs	Removes benchmark result/reporting structures.
crates/goose-bench/src/utilities.rs	Removes benchmark process/thread helpers.
crates/goose-bench/src/runners/mod.rs	Removes benchmark runner module exports.
crates/goose-bench/src/runners/bench_runner.rs	Removes the benchmark suite runner orchestration.
crates/goose-bench/src/runners/eval_runner.rs	Removes the eval runner that executed a single evaluation.
crates/goose-bench/src/runners/metric_aggregator.rs	Removes the Python-postprocessing leaderboard generation wrapper.
crates/goose-bench/src/runners/model_runner.rs	Removes the model runner that executed suites across runs/models.
crates/goose-bench/src/eval_suites/mod.rs	Removes eval suite module structure.
crates/goose-bench/src/eval_suites/evaluation.rs	Removes core eval trait and metric types.
crates/goose-bench/src/eval_suites/factory.rs	Removes the eval registry and selector matching logic.
crates/goose-bench/src/eval_suites/metrics.rs	Removes baseline metric collection utilities.
crates/goose-bench/src/eval_suites/utils.rs	Removes eval output-to-file helpers.
crates/goose-bench/src/eval_suites/core/mod.rs	Removes the “core” eval suite module.
crates/goose-bench/src/eval_suites/core/example.rs	Removes the example evaluation.
crates/goose-bench/src/eval_suites/core/memory/mod.rs	Removes memory eval suite module.
crates/goose-bench/src/eval_suites/core/memory/save_fact.rs	Removes memory “save fact” evaluation.
crates/goose-bench/src/eval_suites/core/developer/mod.rs	Removes developer eval suite module.
crates/goose-bench/src/eval_suites/core/developer/create_file.rs	Removes developer “create file” evaluation.
crates/goose-bench/src/eval_suites/core/developer/list_files.rs	Removes developer “list files” evaluation.
crates/goose-bench/src/eval_suites/core/developer/simple_repo_clone_test.rs	Removes developer “repo clone test” evaluation.
crates/goose-bench/src/eval_suites/core/developer_image/mod.rs	Removes developer image eval suite module.
crates/goose-bench/src/eval_suites/core/developer_image/image.rs	Removes developer “screen capture” evaluation.
crates/goose-bench/src/eval_suites/core/developer_search_replace/mod.rs	Removes developer search/replace eval suite module.
crates/goose-bench/src/eval_suites/core/developer_search_replace/search_replace.rs	Removes developer search/replace evaluation.
crates/goose-bench/src/eval_suites/core/computercontroller/mod.rs	Removes computer controller eval suite module.
crates/goose-bench/src/eval_suites/core/computercontroller/script.rs	Removes computer controller “script” evaluation.
crates/goose-bench/src/eval_suites/core/computercontroller/web_scrape.rs	Removes computer controller “web scrape” evaluation.
crates/goose-bench/src/eval_suites/vibes/mod.rs	Removes the “vibes” eval suite module.
crates/goose-bench/src/eval_suites/vibes/blog_summary.rs	Removes the blog summary evaluation.
crates/goose-bench/src/eval_suites/vibes/flappy_bird.rs	Removes the flappy bird evaluation.
crates/goose-bench/src/eval_suites/vibes/goose_wiki.rs	Removes the goose wiki evaluation.
crates/goose-bench/src/eval_suites/vibes/restaurant_research.rs	Removes the restaurant research evaluation.
crates/goose-bench/src/eval_suites/vibes/squirrel_census.rs	Removes the squirrel census evaluation.
crates/goose-bench/src/assets/kubernetes.patch	Removes benchmark asset patch file.
crates/goose-bench/src/assets/vscode.patch	Removes benchmark asset patch file.
crates/goose-bench/src/assets/vscode_config_registry.ts	Removes benchmark asset source file.
Cargo.lock	Removes `goose-bench` from the dependency graph lockfile.

Files not reviewed (2)

evals/open-model-gym/mcp-harness/package-lock.json: Language not supported
evals/open-model-gym/suite/package-lock.json: Language not supported

Copilot · 2026-02-09T01:56:16Z

crates/goose-cli/src/commands/web.rs

    validate_network_auth(&host, &auth_token, no_auth);
-    crate::logging::setup_logging(Some("goose-web"), None)?;
+    let _ = crate::logging::setup_logging(Some("goose-web"));



setup_logging errors are being silently ignored here, which can hide failures to initialize file/telemetry logging; please at least emit the same warning used in main.rs (or propagate the error) so operators know logging is degraded.

evals/open-model-gym/Justfile

Copilot · 2026-02-09T01:56:17Z

evals/open-model-gym/suite/src/validator.ts

+    case "file_matches": {
+      const fullPath = join(workdir, rule.path);
+      if (!existsSync(fullPath)) {
+        return { passed: false, message: `File not found: ${rule.path}` };
+      }
+      const content = readFileSync(fullPath, "utf-8");
+      const regex = new RegExp(rule.regex);
+      const matches = regex.test(content);
+      return {
+        passed: matches,
+        message: matches
+          ? undefined
+          : `File ${rule.path} does not match regex: ${rule.regex}`,
+      };


file_matches/file_not_matches create new RegExp(rule.regex) without handling invalid patterns, so a bad regex will throw and crash the whole run instead of returning a failed validation; wrap regex compilation in a try/catch and return a helpful failure message.

Copilot · 2026-02-09T01:56:17Z

evals/open-model-gym/suite/src/validator.ts

+      // Check if any call matches the arg requirements
+      for (const call of matchingCalls) {
+        const args = call.arguments || {};
+        let allMatch = true;
+
+        for (const [key, expected] of Object.entries(rule.args)) {
+          const actual = args[key];
+          if (actual === undefined) {
+            allMatch = false;
+            break;
+          }
+
+          // If expected starts/ends with /, treat as regex pattern
+          if (typeof expected === "string" && expected.startsWith("/") && expected.endsWith("/")) {
+            const pattern = new RegExp(expected.slice(1, -1), "i");
+            if (!pattern.test(String(actual))) {
+              allMatch = false;
+              break;
+            }
+          } else {
+            // Exact match (case-insensitive for strings)
+            const actualStr = String(actual).toLowerCase();
+            const expectedStr = String(expected).toLowerCase();
+            if (!actualStr.includes(expectedStr)) {
+              allMatch = false;
+              break;
+            }
+          }
+        }
+
+        if (allMatch) {
+          return { passed: true };
+        }
+      }
+
+      return {
+        passed: false,
+        message: `Tool ${rule.tool} called but args didn't match: expected ${JSON.stringify(rule.args)}`,
+      };


ValidationRule.tool_called.args is typed to allow RegExp, but validateRule doesn't actually handle RegExp values (it stringifies them and does a substring check); either implement proper expected instanceof RegExp matching or change the type to string to avoid misleading API consumers.

Copilot · 2026-02-09T01:56:17Z

evals/open-model-gym/suite/src/validator.ts

+    case "command_succeeds": {
+      try {
+        execSync(rule.command, { cwd: workdir, stdio: "pipe" });
+        return { passed: true };
+      } catch (err) {
+        return {
+          passed: false,
+          message: `Command failed: ${rule.command}`,
+        };
+      }


On command_succeeds failures, the returned message drops the underlying stderr/stdout, which makes debugging scenario failures hard; include the caught error’s output (or at least err.message) in the validation message.

Copilot · 2026-02-09T01:56:18Z

crates/goose-cli/src/commands/mod.rs

 pub mod configure;
 pub mod info;


Since the bench command/module is being removed, there are now stale references to crates/goose-bench in repo docs (e.g. AGENTS.md and benchmark blog posts); please update those references or clarify the new location so documentation doesn’t point at deleted paths.

Copilot · 2026-02-09T01:56:18Z

evals/open-model-gym/mcp-harness/src/index.ts

+  return new Date().toISOString();
+}
+
+function randomDelay(): number {


Unused function randomDelay.

Copilot · 2026-02-09T01:56:18Z

evals/open-model-gym/suite/src/runner.ts

@@ -0,0 +1,1504 @@
+#!/usr/bin/env node
+import { mkdirSync, writeFileSync, rmSync, readdirSync, existsSync, copyFileSync } from "node:fs";
+import { join, basename, dirname } from "node:path";


Unused import dirname.

Copilot · 2026-02-09T01:56:19Z

evals/open-model-gym/suite/src/runner.ts

+  return `${pair.model.name}::${pair.runner.name}`;
+}
+
+function resultKey(result: TestResultWithLog): string {


Unused function resultKey.

Copilot · 2026-02-09T01:56:19Z

evals/open-model-gym/suite/src/runner.ts

+  }
+
+  const getResult = (scenario: string, rowKey: string) => {
+    const [modelPart, runnerName] = rowKey.split("::");


Unused variable modelPart.

alexhancock

Yeah this is good to do. Can experiment with a range of evals here.

Copilot

Pull request overview

Copilot reviewed 69 out of 76 changed files in this pull request and generated 6 comments.

Files not reviewed (2)

evals/open-model-gym/mcp-harness/package-lock.json: Language not supported
evals/open-model-gym/suite/package-lock.json: Language not supported

Copilot · 2026-02-09T03:48:53Z

evals/open-model-gym/suite/src/validator.ts

+    case "file_exists": {
+      const fullPath = join(workdir, rule.path);
+      const exists = existsSync(fullPath);


rule.path is joined onto workdir without validation, so a scenario can use ../ or absolute paths to read/stat files outside the workspace; consider resolving the path and rejecting anything that escapes workdir (and optionally disallow absolute paths).

Copilot · 2026-02-09T03:48:53Z

evals/open-model-gym/suite/src/validator.ts

+      const content = readFileSync(fullPath, "utf-8");
+      const regex = new RegExp(rule.regex);
+      const matches = regex.test(content);


new RegExp(rule.regex) can throw on an invalid pattern and crash the whole run; wrap regex compilation in a try/catch and return a failed validation with an explanatory message instead.

Copilot · 2026-02-09T03:48:53Z

evals/open-model-gym/suite/src/validator.ts

+      const content = readFileSync(fullPath, "utf-8");
+      const regex = new RegExp(rule.regex);
+      const matches = regex.test(content);


new RegExp(rule.regex) can throw on an invalid pattern and crash the whole run; wrap regex compilation in a try/catch and return a failed validation with an explanatory message instead.

Copilot · 2026-02-09T03:48:54Z

evals/open-model-gym/suite/src/validator.ts

+            // Exact match (case-insensitive for strings)
+            const actualStr = String(actual).toLowerCase();
+            const expectedStr = String(expected).toLowerCase();
+            if (!actualStr.includes(expectedStr)) {
+              allMatch = false;


The comment says this is an “Exact match” check, but the implementation uses includes, which can lead to false-positive validations; either change the check to strict equality (case-insensitive) or update the comment/behavior to reflect substring matching.

Copilot · 2026-02-09T03:48:54Z

evals/open-model-gym/suite/src/types.ts

+  | { type: "command_succeeds"; command: string; name?: string }
+  | { type: "tool_called"; tool: string; args?: Record<string, string | RegExp>; name?: string }
+  | { type: "custom"; fn: string; name?: string };


tool_called.args is typed as allowing RegExp, but the validator only treats regex expectations as strings wrapped in slashes (e.g. "/foo/"); either update the type to match the on-disk YAML/JSON representation or add support for actual RegExp values in validateRule.

Copilot · 2026-02-09T03:48:54Z

evals/open-model-gym/mcp-harness/src/index.ts

+    result: result,
+  };
+  const line = JSON.stringify(entry) + '\n';
+  fs.appendFileSync(LOG_FILE, line);
+}


appendFileSync will throw (and crash the server) if MCP_HARNESS_LOG points to a path whose parent directory doesn’t exist; ensure the directory is created (or catch/log the error) before writing.

codefromthecrypt

red is a lovely color

* origin/main: (54 commits) chore: strip posthog for sessions/models/daily only (#7079) tidy: clean up old benchmark and add gym (#7081) fix: use command.process_group(0) for CLI providers, not just MCP (#7083) added build notify (#6891) test(mcp): add image tool test and consolidate MCP test fixtures (#7019) fix: remove Option from model listing return types, propagate errors (#7074) fix: lazy provider creation for goose acp (#7026) (#7066) Smoke tests: split compaction test and use debug build (#6984) fix(deps): trim bat to resolve RUSTSEC-2024-0320 (#7061) feat: expose AGENT_SESSION_ID env var to extension child processes (#7072) fix: add XML tool call parsing fallback for Qwen3-coder via Ollama (#6882) Remove clippy too_many_lines lint and decompose long functions (#7064) refactor: move disable_session_naming into AgentConfig (#7062) Add global config switch to disable automatic session naming (#7052) docs: add blog post - 8 Things You Didn't Know About Code Mode (#7059) fix: ensure animated elements are visible when prefers-reduced-motion is enabled (#7047) Show recommended model on failture (#7040) feat(ui): add session content search via API (#7050) docs: fix img url (#7053) Desktop UI for deleting custom providers (#7042) ...

* origin/main: Docs: require auth optional for custom providers (#7098) fix: improve text-muted contrast for better readability (#7095) Always sync bundled extensions (#7057) feat: Add tom (Top Of Mind) platform extension (#7073) chore(docs): update GOOSE_SESSION_ID -> AGENT_SESSION_ID (#6669) fix(ci): switch from cargo-audit to cargo-deny for advisory scanning (#7032) chore(deps): bump @isaacs/brace-expansion from 5.0.0 to 5.0.1 in /evals/open-model-gym/suite (#7085) chore(deps): bump @modelcontextprotocol/sdk from 1.25.3 to 1.26.0 in /evals/open-model-gym/mcp-harness (#7086) fix: switch to windows msvc (#7080) fix: allow unlisted models for CLI providers (#7090) Use goose port (#7089) chore: strip posthog for sessions/models/daily only (#7079) tidy: clean up old benchmark and add gym (#7081) fix: use command.process_group(0) for CLI providers, not just MCP (#7083) added build notify (#6891)

* main: (125 commits) chore: add a new scenario (#7107) fix: Goose Desktop missing Calendar and Reminders entitlements (#7100) Fix 'Edit In Place' and 'Fork Session' features (#6970) Fix: Only send command content to command injection classifier (excluding part of tool call dict) (#7082) Docs: require auth optional for custom providers (#7098) fix: improve text-muted contrast for better readability (#7095) Always sync bundled extensions (#7057) feat: Add tom (Top Of Mind) platform extension (#7073) chore(docs): update GOOSE_SESSION_ID -> AGENT_SESSION_ID (#6669) fix(ci): switch from cargo-audit to cargo-deny for advisory scanning (#7032) chore(deps): bump @isaacs/brace-expansion from 5.0.0 to 5.0.1 in /evals/open-model-gym/suite (#7085) chore(deps): bump @modelcontextprotocol/sdk from 1.25.3 to 1.26.0 in /evals/open-model-gym/mcp-harness (#7086) fix: switch to windows msvc (#7080) fix: allow unlisted models for CLI providers (#7090) Use goose port (#7089) chore: strip posthog for sessions/models/daily only (#7079) tidy: clean up old benchmark and add gym (#7081) fix: use command.process_group(0) for CLI providers, not just MCP (#7083) added build notify (#6891) test(mcp): add image tool test and consolidate MCP test fixtures (#7019) ...

* main: (85 commits) Fix 'Edit In Place' and 'Fork Session' features (#6970) Fix: Only send command content to command injection classifier (excluding part of tool call dict) (#7082) Docs: require auth optional for custom providers (#7098) fix: improve text-muted contrast for better readability (#7095) Always sync bundled extensions (#7057) feat: Add tom (Top Of Mind) platform extension (#7073) chore(docs): update GOOSE_SESSION_ID -> AGENT_SESSION_ID (#6669) fix(ci): switch from cargo-audit to cargo-deny for advisory scanning (#7032) chore(deps): bump @isaacs/brace-expansion from 5.0.0 to 5.0.1 in /evals/open-model-gym/suite (#7085) chore(deps): bump @modelcontextprotocol/sdk from 1.25.3 to 1.26.0 in /evals/open-model-gym/mcp-harness (#7086) fix: switch to windows msvc (#7080) fix: allow unlisted models for CLI providers (#7090) Use goose port (#7089) chore: strip posthog for sessions/models/daily only (#7079) tidy: clean up old benchmark and add gym (#7081) fix: use command.process_group(0) for CLI providers, not just MCP (#7083) added build notify (#6891) test(mcp): add image tool test and consolidate MCP test fixtures (#7019) fix: remove Option from model listing return types, propagate errors (#7074) fix: lazy provider creation for goose acp (#7026) (#7066) ...

clean up old benchmark and add gym

5fea265

Copilot AI review requested due to automatic review settings February 9, 2026 01:50

Copilot started reviewing on behalf of michaelneale February 9, 2026 01:51 View session

Copilot AI reviewed Feb 9, 2026

View reviewed changes

alexhancock approved these changes Feb 9, 2026

View reviewed changes

michaelneale added 2 commits February 9, 2026 14:34

don't need this

ac7c9f7

oops

db3c80d

Copilot AI review requested due to automatic review settings February 9, 2026 03:44

Copilot started reviewing on behalf of michaelneale February 9, 2026 03:44 View session

Copilot AI reviewed Feb 9, 2026

View reviewed changes

codefromthecrypt added this pull request to the merge queue Feb 9, 2026

codefromthecrypt approved these changes Feb 9, 2026

View reviewed changes

Merged via the queue into main with commit a3ba124 Feb 9, 2026
25 checks passed

codefromthecrypt deleted the micn/bench-cleanup branch February 9, 2026 06:14

github-actions bot mentioned this pull request Feb 10, 2026

chore(release): release version 1.24.0 (minor) #7102

Open

Tyler-Hardin pushed a commit to Tyler-Hardin/goose that referenced this pull request Feb 11, 2026

tidy: clean up old benchmark and add gym (block#7081)

3521e08

Tyler-Hardin pushed a commit to Tyler-Hardin/goose that referenced this pull request Feb 11, 2026

tidy: clean up old benchmark and add gym (block#7081)

e0f9087

Conversation

michaelneale commented Feb 9, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

alexhancock left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

codefromthecrypt left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants