Test/benchmarking (#59)

acgetchell · web-flow · commit fda43d76af0c · 2025-09-02T10:48:50.000-07:00
* Simplifies benchmark workflow and reporting

Refactors the benchmark workflow to rely solely on release baselines for performance comparison.

This removes the generation of fallback baselines during pull request workflows, streamlining the process and ensuring comparisons are made against accurate, full-setting benchmarks.

Updates the reporting to provide clearer guidance on how to enable performance regression testing via release tags.

Increases baseline artifact retention from 1 year to 90 days to match repository maximum.

* Improves performance baseline artifact retrieval

Modifies the workflow to improve the way performance baseline artifacts are located.

The workflow now first attempts to find baseline artifacts associated with release tags by searching recent successful workflow runs of 'generate-baseline.yml'.

If no release-tagged baseline is found, it falls back to searching for any recent baseline artifact generated by the same workflow, regardless of its origin (e.g., manual trigger).

This ensures that a baseline is found even if a tagged release is not available, or the baseline was manually generated.
It also changes the failure message if no baseline is found, and provides improved instructions on how to generate a baseline.

* Improves benchmark baseline handling

Ensures that expired artifacts are ignored when searching for performance baselines. This prevents the benchmark workflow from erroneously using outdated data.

Adds more detailed logging about the baseline source, differentiating between baselines from artifacts and releases. The origin of the baseline (release or artifact) is now included in the
comparison output and summary.

* Improves baseline artifact lookup efficiency

Optimizes the benchmark workflow by caching baseline artifacts from recent successful workflow runs. This significantly reduces the number of API calls required to locate the correct baseline, particularly for release baselines. It also falls back
to the most recent baseline artifact if no release baseline is found.

Also, the `compare` command is executed using an `if` statement that determines whether the benchmark utils is installed as a
module or an entrypoint to avoid errors.

* Improves benchmark workflow reliability

Limits the number of workflow runs fetched to avoid excessive API calls and ensures that only unique, non-expired baseline artifacts are cached, enhancing the reliability and efficiency of the benchmark workflow.

Fixes a bug where oldest benchmarks were used - GitHub lists newest first, so first one should be preferred.

* Improves benchmark baseline artifact retrieval

Optimizes the retrieval of baseline artifacts by pre-computing expected release baseline names and short-circuiting the search when a release baseline is found.

This significantly reduces the number of API calls and speeds up the process, especially when release baselines are available.

Also exports the `BASELINE_TAG` to the environment, making it available for subsequent steps.
diff --git a/.github/workflows/benchmarks.yml b/.github/workflows/benchmarks.yml
@@ -68,73 +68,151 @@ jobs:
         run: uv --version
 
 
-      - name: Find latest release with baseline artifact
+      - name: Find baseline artifact
         id: find_baseline
         uses: actions/github-script@60a0d83039c74a4aee543508d2ffcb1c3799cdea  # v7.0.1
         with:
           script: |
             try {
-              // Get all releases with pagination to handle large repos
+              // Precompute expected baseline names for early-hit optimization
               const releases = await github.paginate(
                 github.rest.repos.listReleases,
-                { owner: context.repo.owner, repo: context.repo.repo, per_page: 100 }
+                { owner: context.repo.owner, repo: context.repo.repo, per_page: 50 }
               );
 
-              console.log(`Found ${releases.length} releases (paginated)`);
-
-              // Look for releases with baseline artifacts
+              const expectedReleaseBaselines = new Set();
               for (const release of releases) {
-                console.log(`Checking release ${release.tag_name}...`);
-                if (release.draft || release.prerelease) {
-                  console.log(`Skipping draft/prerelease ${release.tag_name}`);
-                  continue;
+                if (!release.draft && !release.prerelease) {
+                  const cleanTag = release.tag_name.replace(/[^a-zA-Z0-9._-]/g, '_');
+                  expectedReleaseBaselines.add(`performance-baseline-${cleanTag}`);
                 }
+              }
+              console.log(`Precomputed ${expectedReleaseBaselines.size} expected release baseline names`);
+
+              // Fetch all successful generate-baseline.yml runs once (O(runs) API calls)
+              console.log('Fetching recent generate-baseline.yml runs...');
+              let count = 0;
+              const runs = await github.paginate(
+                github.rest.actions.listWorkflowRuns,
+                {
+                  owner: context.repo.owner,
+                  repo: context.repo.repo,
+                  workflow_id: 'generate-baseline.yml',
+                  status: 'completed',
+                  conclusion: 'success',
+                  per_page: 100
+                },
+                (response, done) => {
+                  // Limit to 150 runs total across pages (no overshoot)
+                  const remaining = Math.max(0, 150 - count);
+                  if (remaining === 0) { done(); return []; }
+                  const slice = response.data.slice(0, remaining);
+                  count += slice.length;
+                  if (count >= 150) done();
+                  return slice;
+                }
+              );
 
-                // List recent runs and look for an artifact matching the release tag
+              console.log(`Found ${runs.length} successful generate-baseline runs`);
+
+              // Build artifact cache: artifact name → {run_id, run_created_at}
+              const artifactCache = new Map();
+              let foundReleaseBaseline = false;
+              for (const run of runs) {
                 try {
-                  // Mirror sanitization in generate-baseline.yml (allow [A-Za-z0-9._-], replace others with _)
-                  const cleanTag = release.tag_name.replace(/[^a-zA-Z0-9._-]/g, '_');
-                  const expectedName = `performance-baseline-${cleanTag}`;
-                  let found = false;
-                  for (let page = 1; page <= 5 && !found; page++) {
-                    const runs = await github.rest.actions.listWorkflowRuns({
-                      owner: context.repo.owner,
-                      repo: context.repo.repo,
-                      workflow_id: 'generate-baseline.yml',
-                      status: 'completed',
-                      conclusion: 'success',
-                      per_page: 100,
-                      page
-                    });
-                    for (const run of runs.data.workflow_runs) {
-                      const artifacts = await github.rest.actions.listWorkflowRunArtifacts({
-                        owner: context.repo.owner,
-                        repo: context.repo.repo,
-                        run_id: run.id
-                      });
-                      const baselineArtifact = artifacts.data.artifacts.find(a => a.name === expectedName);
-                      if (baselineArtifact) {
-                        console.log(
-                          `Found baseline artifact ${expectedName} in run ${run.id} ` +
-                          `for release ${release.tag_name}`
-                        );
-                        core.setOutput('found', 'true');
-                        core.setOutput('release_tag', release.tag_name);
-                        core.setOutput('artifact_name', expectedName);
-                        core.setOutput('run_id', run.id.toString());
-                        found = true;
-                        break;
+                  const artifacts = await github.rest.actions.listWorkflowRunArtifacts({
+                    owner: context.repo.owner,
+                    repo: context.repo.repo,
+                    run_id: run.id
+                  });
+
+                  for (const artifact of artifacts.data.artifacts) {
+                    if (artifact.name.startsWith('performance-baseline-') && artifact.expired !== true) {
+                      if (!artifactCache.has(artifact.name)) {
+                        artifactCache.set(artifact.name, {
+                          run_id: run.id,
+                          run_created_at: run.created_at
+                        });
+
+                        // Early-hit optimization: stop if we found a release baseline
+                        if (expectedReleaseBaselines.has(artifact.name)) {
+                          console.log(`Early hit: found release baseline ${artifact.name}, stopping search`);
+                          foundReleaseBaseline = true;
+                        }
                       }
                     }
                   }
-                  if (found) return;
+
+                  // Short-circuit if we found a release baseline
+                  if (foundReleaseBaseline) break;
                 } catch (error) {
-                  console.log(`Error checking release ${release.tag_name}: ${error.message}`);
+                  console.log(`Warning: Could not fetch artifacts for run ${run.id}: ${error.message}`);
                   continue;
                 }
               }
 
-              console.log('No baseline artifacts found in any release');
+              console.log(`Built cache of ${artifactCache.size} baseline artifacts`);
+
+              console.log(`Found ${releases.length} releases (already fetched)`);
+
+              // Look for releases with baseline artifacts (now O(releases) lookups)
+              for (const release of releases) {
+                console.log(`Checking release ${release.tag_name}...`);
+                if (release.draft || release.prerelease) {
+                  console.log(`Skipping draft/prerelease ${release.tag_name}`);
+                  continue;
+                }
+
+                // Must match sanitize step in .github/workflows/generate-baseline.yml
+                const cleanTag = release.tag_name.replace(/[^a-zA-Z0-9._-]/g, '_');
+                const expectedName = `performance-baseline-${cleanTag}`;
+
+                if (artifactCache.has(expectedName)) {
+                  const artifactInfo = artifactCache.get(expectedName);
+                  console.log(
+                    `Found release baseline artifact ${expectedName} in run ${artifactInfo.run_id} ` +
+                    `for release ${release.tag_name}`
+                  );
+                  core.setOutput('found', 'true');
+                  core.setOutput('release_tag', release.tag_name);
+                  core.setOutput('artifact_name', expectedName);
+                  core.setOutput('run_id', artifactInfo.run_id.toString());
+                  core.setOutput('source_type', 'release');
+                  return;
+                }
+              }
+
+              console.log('No release baseline artifacts found, checking for any recent baselines...');
+
+              // Fallback: look for any recent baseline artifact (including manual runs)
+              if (artifactCache.size > 0) {
+                // Find the most recent artifact (by run creation time)
+                let mostRecentArtifact = null;
+                let mostRecentTime = null;
+
+                for (const [artifactName, artifactInfo] of artifactCache.entries()) {
+                  const runTime = new Date(artifactInfo.run_created_at);
+                  if (!mostRecentTime || runTime > mostRecentTime) {
+                    mostRecentTime = runTime;
+                    mostRecentArtifact = { name: artifactName, ...artifactInfo };
+                  }
+                }
+
+                if (mostRecentArtifact) {
+                  console.log(
+                    `Found baseline artifact ${mostRecentArtifact.name} in run ${mostRecentArtifact.run_id} ` +
+                    `(created: ${mostRecentArtifact.run_created_at})`
+                  );
+                  core.setOutput('found', 'true');
+                  core.setOutput('release_tag', 'manual-baseline');
+                  core.setOutput('artifact_name', mostRecentArtifact.name);
+                  core.setOutput('run_id', mostRecentArtifact.run_id.toString());
+                  core.setOutput('source_type', 'manual');
+                  return;
+                }
+              }
+
+              console.log('No baseline artifacts found in any recent runs');
               core.setOutput('found', 'false');
             } catch (error) {
               console.error(`Error searching for baseline artifacts: ${error.message}`);
@@ -156,6 +234,7 @@ jobs:
         shell: bash
         env:
           RELEASE_TAG: ${{ steps.find_baseline.outputs.release_tag }}
+          SOURCE_TYPE: ${{ steps.find_baseline.outputs.source_type }}
         run: |
           set -euo pipefail
           if [[ -f "baseline-artifact/baseline_results.txt" ]]; then
@@ -164,6 +243,10 @@ jobs:
             cp "baseline-artifact/baseline_results.txt" "benches/baseline_results.txt"
             echo "BASELINE_EXISTS=true" >> "$GITHUB_ENV"
             echo "BASELINE_SOURCE=artifact" >> "$GITHUB_ENV"
+            echo "BASELINE_ORIGIN=${SOURCE_TYPE:-unknown}" >> "$GITHUB_ENV"
+            if [[ -n "${RELEASE_TAG:-}" ]]; then
+              echo "BASELINE_TAG=${RELEASE_TAG}" >> "$GITHUB_ENV"
+            fi
 
             # Show baseline metadata
             echo "=== Baseline Information (from artifact) ==="
@@ -174,29 +257,15 @@ jobs:
             echo "BASELINE_SOURCE=missing" >> "$GITHUB_ENV"
           fi
 
-      - name: Generate fallback baseline if none found
+      - name: Set baseline status if none found
         if: steps.find_baseline.outputs.found != 'true'
         shell: bash
         run: |
           set -euo pipefail
-          echo "📈 No baseline artifact found, generating fallback baseline..."
-          echo "   This baseline will be used for this comparison only."
-          echo ""
-
-          # Generate baseline using dev mode for faster execution
-          if uv run benchmark-utils generate-baseline --dev \
-            || uv run python -m scripts.benchmark_utils generate-baseline --dev; then
-            echo "BASELINE_EXISTS=true" >> "$GITHUB_ENV"
-            echo "BASELINE_SOURCE=generated" >> "$GITHUB_ENV"
-
-            echo "✅ Generated fallback baseline successfully"
-            echo "=== Generated Baseline Information ==="
-            head -n 3 benches/baseline_results.txt
-          else
-            echo "❌ Failed to generate fallback baseline"
-            echo "BASELINE_EXISTS=false" >> "$GITHUB_ENV"
-            echo "BASELINE_SOURCE=failed" >> "$GITHUB_ENV"
-          fi
+          echo "📈 No baseline artifact found for performance comparison"
+          echo "BASELINE_EXISTS=false" >> "$GITHUB_ENV"
+          echo "BASELINE_SOURCE=none" >> "$GITHUB_ENV"
+          echo "BASELINE_ORIGIN=none" >> "$GITHUB_ENV"
 
       - name: Extract baseline commit SHA
         if: env.BASELINE_EXISTS == 'true'
@@ -254,18 +323,14 @@ jobs:
         run: |
           set -euo pipefail
           echo "⚠️ No performance baseline available for comparison."
-          if [[ "${BASELINE_SOURCE:-}" == "failed" ]]; then
-            echo "   - Failed to generate fallback baseline"
-            echo "   - Check the baseline generation workflow logs for issues"
-          else
-            echo "   - No baseline artifacts found in recent releases"
-            echo "   - Create a new release tag to generate a baseline"
-          fi
+          echo "   - No baseline artifacts found in recent workflow runs"
+          echo "   - Performance regression testing requires a baseline"
           echo ""
-          echo "💡 To resolve:"
-          echo "   1. Create a new release tag (e.g., v0.4.2)"
-          echo "   2. The baseline generation workflow will run automatically"
+          echo "💡 To enable performance regression testing:"
+          echo "   1. Create a release tag (e.g., v0.4.3), or"
+          echo "   2. Manually trigger the 'Generate Performance Baseline' workflow"
           echo "   3. Future PRs and pushes will use that baseline for comparison"
+          echo "   4. Baselines use full benchmark settings for accurate comparisons"
 
       - name: Run performance regression test
         if: env.BASELINE_EXISTS == 'true' && env.SKIP_BENCHMARKS == 'false'
@@ -274,17 +339,18 @@ jobs:
           set -euo pipefail
           echo "🚀 Running performance regression test..."
           echo "   Baseline source: ${BASELINE_SOURCE:-unknown}"
+          echo "   Baseline origin: ${BASELINE_ORIGIN:-unknown}"
 
           # This will exit with code 1 if significant regressions are found
-          # Use --dev mode when comparing against a generated baseline to match settings
-          if [[ "${BASELINE_SOURCE:-}" == "generated" ]]; then
-            echo "   Using --dev mode to match generated baseline settings"
-            uv run benchmark-utils compare --baseline benches/baseline_results.txt --dev \
-              || uv run python -m scripts.benchmark_utils compare --baseline benches/baseline_results.txt --dev
+          echo "   Using full comparison mode against ${BASELINE_ORIGIN:-unknown} baseline"
+          if uv run benchmark-utils --help >/dev/null 2>&1; then
+            uv run benchmark-utils compare --baseline benches/baseline_results.txt
+          elif uv run python -c "import importlib; importlib.import_module('scripts.benchmark_utils')" \
+               >/dev/null 2>&1; then
+            uv run python -m scripts.benchmark_utils compare --baseline benches/baseline_results.txt
           else
-            echo "   Using full comparison mode against release baseline"
-            uv run benchmark-utils compare --baseline benches/baseline_results.txt \
-              || uv run python -m scripts.benchmark_utils compare --baseline benches/baseline_results.txt
+            echo "❌ benchmark-utils entrypoint and module not found" >&2
+            exit 2
           fi
 
       - name: Display regression test results
@@ -316,6 +382,8 @@ jobs:
           echo "📊 Performance Regression Testing Summary"
           echo "==========================================="
           echo "Baseline source: ${BASELINE_SOURCE:-none}"
+          echo "Baseline origin: ${BASELINE_ORIGIN:-unknown}"
+          echo "Baseline tag: ${BASELINE_TAG:-n/a}"
           echo "Baseline exists: ${BASELINE_EXISTS:-false}"
           echo "Skip benchmarks: ${SKIP_BENCHMARKS:-unknown}"
           echo "Skip reason: ${SKIP_REASON:-n/a}"
diff --git a/.github/workflows/generate-baseline.yml b/.github/workflows/generate-baseline.yml
@@ -141,7 +141,7 @@ jobs:
         with:
           name: ${{ steps.safe_name.outputs.artifact_name }}
           path: baseline-artifact/
-          retention-days: 365  # Keep baselines for 1 year
+          retention-days: 90   # Keep baselines ~90 days (align with repo settings; adjust if needed)
           compression-level: 6  # Good balance of speed/compression
 
       - name: Display next steps