perf: fix areas() query timeout on large repos (80K+ files) by Copilot · Pull Request #12 · agent-sh/agent-analyzer

Copilot · 2026-04-03T15:17:46Z

repo-intel query onboard and can-i-help timed out (>30s) on repos like TypeScript (81K files) and Deno (28K files) due to three compounding issues in areas() and detect_commands().

Root causes fixed

O(D × S) symbols loop - areas() iterated the entire symbols map once per directory to gather complexity/symbol counts. For 80K files across thousands of directories this is O(dirs × symbols) with String allocations per iteration.

Before:
```
for (file_path, file_syms) in symbols {       // iterates all symbols...
    if file_dir(file_path) == area {           // ...for every directory
```
After: single O(S) pre-pass into HashMap<dir, (sym_count, complexities)>, then O(1) lookup per directory.
Large-repo fast path via active_areas() - A new private active_areas() helper applies a LARGE_REPO_FILE_THRESHOLD = 10_000 filter: on repos above this size, only files with recent_changes > 0 are included in directory grouping. This still scans file_activity to apply the filter, but reduces the downstream grouped working set and number of directories processed. onboard and can-i-help call active_areas() internally. The public areas() function is unchanged and always returns all directories regardless of repo size.
detect_commands() used Vec::contains() - O(N) per check for 8 build-file tests + 17 entry-point candidates = ~2M redundant comparisons on 80K-file repos. Replaced with a HashSet built once from file_activity keys.

Tests added

Four new tests in queries::tests:

test_areas_large_repo_includes_all_dirs - public areas() always returns all dirs even above the threshold
test_active_areas_large_repo_skips_dormant_dirs - active_areas() excludes dormant dirs above threshold
test_areas_small_repo_includes_dormant_dirs - both areas() and active_areas() return all dirs below threshold
test_areas_symbols_pre_index_consistent - pre-indexed symbols produce correct counts and medians

- areas(): pre-index symbols by directory once O(S) instead of O(D*S) - areas(): skip dormant files when file_activity > 10K (large-repo fast path) - detect_commands(): use HashSet for O(1) lookups instead of Vec::contains() O(N) - Add 3 tests: large-repo threshold, small-repo include all, symbols pre-index consistency Agent-Logs-Url: https://github.com/agent-sh/agent-analyzer/sessions/b6d5d883-13ae-427e-aaf9-f7d87846a797 Co-authored-by: avifenesh <55848801+avifenesh@users.noreply.github.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 807f2a72d7

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-03T15:47:45Z

crates/analyzer-git-map/src/queries.rs

+        if large_repo && activity.recent_changes == 0 {
+            continue;


Keep areas() complete for large repositories

Applying recent_changes == 0 pruning inside areas() makes the public repo-intel query areas output incomplete whenever the repo crosses the 10k-file threshold, because entire dormant directories are dropped instead of being reported with low activity. This is a behavioral regression from the previous implementation (which returned all non-deleted/non-renamed areas) and can hide stale but risky modules from area-health consumers; the pruning should be scoped to callers like onboarding guidance rather than the core areas() query.

Useful? React with 👍 / 👎.

Copilot

Pull request overview

Improves performance of repo-intel queries on very large repositories by reducing worst-case complexity in areas() and detect_commands(), addressing observed timeouts for onboard / can-i-help.

Changes:

Add a large-repo fast path in areas() to skip dormant files (no recent changes) above a file-count threshold.
Pre-index symbols by directory to avoid O(directories × symbols) work in areas().
Replace repeated Vec::contains() checks with a HashSet in detect_commands(), and add tests covering the new behaviors.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-08T12:35:36Z

crates/analyzer-git-map/src/queries.rs

+    // For large repos, skip dormant files (no recent changes) to keep the
+    // working set manageable.  `onboard` and `can-i-help` only need active
+    // areas; showing thousands of dormant directories adds no value.
+    let large_repo = map.file_activity.len() > LARGE_REPO_FILE_THRESHOLD;
+
    // Group files by directory, excluding deleted and renamed-from files
    let mut dir_files: HashMap<String, Vec<(&String, &FileActivity)>> = HashMap::new();
    for (path, activity) in &map.file_activity {
        if deleted_paths.contains(path.as_str()) || renamed_from.contains(path.as_str()) {
            continue;
        }
+        if large_repo && activity.recent_changes == 0 {
+            continue;
+        }


areas() now skips files with recent_changes == 0 when file_activity.len() > LARGE_REPO_FILE_THRESHOLD, which changes the output of repo-intel query areas on large repos (dormant directories will disappear). Since areas is also a standalone query and is described in SPEC.md as a general area overview, consider making this filtering opt-in (e.g., a parameter/flag or a separate active_areas() helper) or explicitly documenting in the public query/spec output that large repos return only “active” areas.

Copilot · 2026-04-08T12:35:37Z

crates/analyzer-git-map/src/queries.rs

+// For repos above this size, `areas()` only processes files with recent
+// activity.  This avoids O(N) overhead over 80K+ file_activity entries on
+// large repos like TypeScript or Deno.


The comment above LARGE_REPO_FILE_THRESHOLD says this avoids “O(N) overhead over 80K+ file_activity entries”, but areas() still iterates over the full file_activity map to apply the filter. Consider rewording this to reflect that it reduces the grouped working set / number of directories processed (rather than eliminating the initial O(N) scan).

Suggested change

// For repos above this size, `areas()` only processes files with recent

// activity. This avoids O(N) overhead over 80K+ file_activity entries on

// large repos like TypeScript or Deno.

// For repos above this size, `areas()` limits area-level analysis to files

// with recent activity. This still scans `file_activity` to apply the filter,

// but it reduces the downstream grouped working set and number of directories

// processed on large repos like TypeScript or Deno.

Copilot · 2026-04-08T12:35:37Z

crates/analyzer-git-map/src/queries.rs

+        // Use a file count just above the threshold so the large-repo path fires
+        let map = make_large_repo_map(LARGE_REPO_FILE_THRESHOLD + 1);


test_areas_large_repo_skips_dormant_dirs uses LARGE_REPO_FILE_THRESHOLD + 1, and make_large_repo_map then adds an additional active file, so the test builds threshold + 2 entries. To keep the test as light as possible (runtime + memory) while still exercising the large-repo path (len() > threshold), consider passing LARGE_REPO_FILE_THRESHOLD here instead.

Suggested change

// Use a file count just above the threshold so the large-repo path fires

let map = make_large_repo_map(LARGE_REPO_FILE_THRESHOLD + 1);

// Use the threshold here; make_large_repo_map adds one active file, so

// the large-repo path still fires with len() > LARGE_REPO_FILE_THRESHOLD.

let map = make_large_repo_map(LARGE_REPO_FILE_THRESHOLD);

avifenesh · 2026-04-08T12:45:58Z

@copilot address all comments

- Move dormant-file filtering out of public areas() into a private active_areas() function; public areas() always returns all areas - onboard() and can_i_help() call active_areas() for the fast path - Fix LARGE_REPO_FILE_THRESHOLD comment to accurately describe the fast path - Update tests: areas_large_repo now asserts public areas() includes all dirs, new test_active_areas_large_repo_skips_dormant_dirs tests the fast path, small-repo test checks both areas() and active_areas() include all dirs - Drop +1 in test (LARGE_REPO_FILE_THRESHOLD itself triggers len() > threshold because make_large_repo_map adds one extra active file) Agent-Logs-Url: https://github.com/agent-sh/agent-analyzer/sessions/0e2d6974-bf00-4e30-8054-f3d77be06657 Co-authored-by: avifenesh <55848801+avifenesh@users.noreply.github.com>

Copilot · 2026-04-08T13:04:20Z

@copilot address all comments

Addressed all three review comments in 29375a7:

Filtering opt-in: Moved the dormant-file filtering out of the public areas() into a new private active_areas() helper. The public areas() now delegates to areas_impl(map, false) (no filtering). onboard and can_i_help call active_areas() → areas_impl(map, true) to get the fast path. Standalone repo-intel query areas is unaffected and returns all directories.
Comment reworded: The LARGE_REPO_FILE_THRESHOLD comment now says "still scans file_activity to apply the filter, but it reduces the downstream grouped working set and number of directories processed" - matching the suggestion.
Test uses exact threshold: Changed LARGE_REPO_FILE_THRESHOLD + 1 to LARGE_REPO_FILE_THRESHOLD since make_large_repo_map(N) inserts N dormant files plus one active file, so len() == N + 1 > N still fires the fast path. Also added a dedicated test_active_areas_large_repo_skips_dormant_dirs test to verify the fast path directly.

Initial plan

0172ac1

Copilot AI assigned Copilot and avifenesh Apr 3, 2026

Copilot AI linked an issue Apr 3, 2026 that may be closed by this pull request

perf: query timeout on large repos (80K+ files) #8

Closed

Copilot started work on behalf of avifenesh April 3, 2026 15:17 View session

Copilot AI changed the title ~~[WIP] Fix query timeout on large repos for areas function~~ perf: fix areas() query timeout on large repos (80K+ files) Apr 3, 2026

Copilot finished work on behalf of avifenesh April 3, 2026 15:38

Copilot AI requested a review from avifenesh April 3, 2026 15:38

avifenesh marked this pull request as ready for review April 3, 2026 15:43

avifenesh approved these changes Apr 3, 2026

View reviewed changes

chatgpt-codex-connector bot reviewed Apr 3, 2026

View reviewed changes

avifenesh requested a review from Copilot April 8, 2026 12:30

Copilot started reviewing on behalf of avifenesh April 8, 2026 12:31 View session

Copilot AI reviewed Apr 8, 2026

View reviewed changes

Copilot started work on behalf of avifenesh April 8, 2026 12:46 View session

Copilot finished work on behalf of avifenesh April 8, 2026 13:09

Copilot AI requested a review from avifenesh April 8, 2026 13:09

avifenesh merged commit 337700f into main Apr 8, 2026
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: fix areas() query timeout on large repos (80K+ files)#12

perf: fix areas() query timeout on large repos (80K+ files)#12
avifenesh merged 3 commits intomainfrom
copilot/fix-query-timeout-large-repos

Copilot AI commented Apr 3, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Apr 3, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 8, 2026

Uh oh!

Copilot AI Apr 8, 2026

Uh oh!

Copilot AI Apr 8, 2026

Uh oh!

avifenesh commented Apr 8, 2026

Uh oh!

Copilot AI commented Apr 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

-// For repos above this size, `areas()` only processes files with recent
-// activity.  This avoids O(N) overhead over 80K+ file_activity entries on
-// large repos like TypeScript or Deno.
+// For repos above this size, `areas()` limits area-level analysis to files
+// with recent activity. This still scans `file_activity` to apply the filter,
+// but it reduces the downstream grouped working set and number of directories
+// processed on large repos like TypeScript or Deno.

		// Use a file count just above the threshold so the large-repo path fires
		let map = make_large_repo_map(LARGE_REPO_FILE_THRESHOLD + 1);

-        // Use a file count just above the threshold so the large-repo path fires
-        let map = make_large_repo_map(LARGE_REPO_FILE_THRESHOLD + 1);
+        // Use the threshold here; make_large_repo_map adds one active file, so
+        // the large-repo path still fires with len() > LARGE_REPO_FILE_THRESHOLD.
+        let map = make_large_repo_map(LARGE_REPO_FILE_THRESHOLD);

Conversation

Copilot AI commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Root causes fixed

Tests added

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

avifenesh commented Apr 8, 2026

Uh oh!

Copilot AI commented Apr 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Copilot AI commented Apr 3, 2026 •

edited

Loading