Skip to content

perf: fix areas() query timeout on large repos (80K+ files)#12

Merged
avifenesh merged 3 commits intomainfrom
copilot/fix-query-timeout-large-repos
Apr 8, 2026
Merged

perf: fix areas() query timeout on large repos (80K+ files)#12
avifenesh merged 3 commits intomainfrom
copilot/fix-query-timeout-large-repos

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented Apr 3, 2026

repo-intel query onboard and can-i-help timed out (>30s) on repos like TypeScript (81K files) and Deno (28K files) due to three compounding issues in areas() and detect_commands().

Root causes fixed

  • O(D × S) symbols loop - areas() iterated the entire symbols map once per directory to gather complexity/symbol counts. For 80K files across thousands of directories this is O(dirs × symbols) with String allocations per iteration.

    Before:

    for (file_path, file_syms) in symbols {       // iterates all symbols...
        if file_dir(file_path) == area {           // ...for every directory

    After: single O(S) pre-pass into HashMap<dir, (sym_count, complexities)>, then O(1) lookup per directory.

  • Large-repo fast path via active_areas() - A new private active_areas() helper applies a LARGE_REPO_FILE_THRESHOLD = 10_000 filter: on repos above this size, only files with recent_changes > 0 are included in directory grouping. This still scans file_activity to apply the filter, but reduces the downstream grouped working set and number of directories processed. onboard and can-i-help call active_areas() internally. The public areas() function is unchanged and always returns all directories regardless of repo size.

  • detect_commands() used Vec::contains() - O(N) per check for 8 build-file tests + 17 entry-point candidates = ~2M redundant comparisons on 80K-file repos. Replaced with a HashSet built once from file_activity keys.

Tests added

Four new tests in queries::tests:

  • test_areas_large_repo_includes_all_dirs - public areas() always returns all dirs even above the threshold
  • test_active_areas_large_repo_skips_dormant_dirs - active_areas() excludes dormant dirs above threshold
  • test_areas_small_repo_includes_dormant_dirs - both areas() and active_areas() return all dirs below threshold
  • test_areas_symbols_pre_index_consistent - pre-indexed symbols produce correct counts and medians

Copilot AI linked an issue Apr 3, 2026 that may be closed by this pull request
- areas(): pre-index symbols by directory once O(S) instead of O(D*S)
- areas(): skip dormant files when file_activity > 10K (large-repo fast path)
- detect_commands(): use HashSet for O(1) lookups instead of Vec::contains() O(N)
- Add 3 tests: large-repo threshold, small-repo include all, symbols pre-index consistency

Agent-Logs-Url: https://github.com/agent-sh/agent-analyzer/sessions/b6d5d883-13ae-427e-aaf9-f7d87846a797

Co-authored-by: avifenesh <55848801+avifenesh@users.noreply.github.com>
Copilot AI changed the title [WIP] Fix query timeout on large repos for areas function perf: fix areas() query timeout on large repos (80K+ files) Apr 3, 2026
Copilot AI requested a review from avifenesh April 3, 2026 15:38
@avifenesh avifenesh marked this pull request as ready for review April 3, 2026 15:43
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 807f2a72d7

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +517 to +518
if large_repo && activity.recent_changes == 0 {
continue;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Keep areas() complete for large repositories

Applying recent_changes == 0 pruning inside areas() makes the public repo-intel query areas output incomplete whenever the repo crosses the 10k-file threshold, because entire dormant directories are dropped instead of being reported with low activity. This is a behavioral regression from the previous implementation (which returned all non-deleted/non-renamed areas) and can hide stale but risky modules from area-health consumers; the pruning should be scoped to callers like onboarding guidance rather than the core areas() query.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Improves performance of repo-intel queries on very large repositories by reducing worst-case complexity in areas() and detect_commands(), addressing observed timeouts for onboard / can-i-help.

Changes:

  • Add a large-repo fast path in areas() to skip dormant files (no recent changes) above a file-count threshold.
  • Pre-index symbols by directory to avoid O(directories × symbols) work in areas().
  • Replace repeated Vec::contains() checks with a HashSet in detect_commands(), and add tests covering the new behaviors.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +506 to +519
// For large repos, skip dormant files (no recent changes) to keep the
// working set manageable. `onboard` and `can-i-help` only need active
// areas; showing thousands of dormant directories adds no value.
let large_repo = map.file_activity.len() > LARGE_REPO_FILE_THRESHOLD;

// Group files by directory, excluding deleted and renamed-from files
let mut dir_files: HashMap<String, Vec<(&String, &FileActivity)>> = HashMap::new();
for (path, activity) in &map.file_activity {
if deleted_paths.contains(path.as_str()) || renamed_from.contains(path.as_str()) {
continue;
}
if large_repo && activity.recent_changes == 0 {
continue;
}
Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

areas() now skips files with recent_changes == 0 when file_activity.len() > LARGE_REPO_FILE_THRESHOLD, which changes the output of repo-intel query areas on large repos (dormant directories will disappear). Since areas is also a standalone query and is described in SPEC.md as a general area overview, consider making this filtering opt-in (e.g., a parameter/flag or a separate active_areas() helper) or explicitly documenting in the public query/spec output that large repos return only “active” areas.

Copilot uses AI. Check for mistakes.
Comment on lines +492 to +494
// For repos above this size, `areas()` only processes files with recent
// activity. This avoids O(N) overhead over 80K+ file_activity entries on
// large repos like TypeScript or Deno.
Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment above LARGE_REPO_FILE_THRESHOLD says this avoids “O(N) overhead over 80K+ file_activity entries”, but areas() still iterates over the full file_activity map to apply the filter. Consider rewording this to reflect that it reduces the grouped working set / number of directories processed (rather than eliminating the initial O(N) scan).

Suggested change
// For repos above this size, `areas()` only processes files with recent
// activity. This avoids O(N) overhead over 80K+ file_activity entries on
// large repos like TypeScript or Deno.
// For repos above this size, `areas()` limits area-level analysis to files
// with recent activity. This still scans `file_activity` to apply the filter,
// but it reduces the downstream grouped working set and number of directories
// processed on large repos like TypeScript or Deno.

Copilot uses AI. Check for mistakes.
Comment on lines +3191 to +3192
// Use a file count just above the threshold so the large-repo path fires
let map = make_large_repo_map(LARGE_REPO_FILE_THRESHOLD + 1);
Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test_areas_large_repo_skips_dormant_dirs uses LARGE_REPO_FILE_THRESHOLD + 1, and make_large_repo_map then adds an additional active file, so the test builds threshold + 2 entries. To keep the test as light as possible (runtime + memory) while still exercising the large-repo path (len() > threshold), consider passing LARGE_REPO_FILE_THRESHOLD here instead.

Suggested change
// Use a file count just above the threshold so the large-repo path fires
let map = make_large_repo_map(LARGE_REPO_FILE_THRESHOLD + 1);
// Use the threshold here; make_large_repo_map adds one active file, so
// the large-repo path still fires with len() > LARGE_REPO_FILE_THRESHOLD.
let map = make_large_repo_map(LARGE_REPO_FILE_THRESHOLD);

Copilot uses AI. Check for mistakes.
@avifenesh
Copy link
Copy Markdown
Contributor

@copilot address all comments

- Move dormant-file filtering out of public areas() into a private
  active_areas() function; public areas() always returns all areas
- onboard() and can_i_help() call active_areas() for the fast path
- Fix LARGE_REPO_FILE_THRESHOLD comment to accurately describe the fast path
- Update tests: areas_large_repo now asserts public areas() includes all dirs,
  new test_active_areas_large_repo_skips_dormant_dirs tests the fast path,
  small-repo test checks both areas() and active_areas() include all dirs
- Drop +1 in test (LARGE_REPO_FILE_THRESHOLD itself triggers len() > threshold
  because make_large_repo_map adds one extra active file)

Agent-Logs-Url: https://github.com/agent-sh/agent-analyzer/sessions/0e2d6974-bf00-4e30-8054-f3d77be06657

Co-authored-by: avifenesh <55848801+avifenesh@users.noreply.github.com>
Copy link
Copy Markdown
Contributor Author

Copilot AI commented Apr 8, 2026

@copilot address all comments

Addressed all three review comments in 29375a7:

  1. Filtering opt-in: Moved the dormant-file filtering out of the public areas() into a new private active_areas() helper. The public areas() now delegates to areas_impl(map, false) (no filtering). onboard and can_i_help call active_areas()areas_impl(map, true) to get the fast path. Standalone repo-intel query areas is unaffected and returns all directories.

  2. Comment reworded: The LARGE_REPO_FILE_THRESHOLD comment now says "still scans file_activity to apply the filter, but it reduces the downstream grouped working set and number of directories processed" - matching the suggestion.

  3. Test uses exact threshold: Changed LARGE_REPO_FILE_THRESHOLD + 1 to LARGE_REPO_FILE_THRESHOLD since make_large_repo_map(N) inserts N dormant files plus one active file, so len() == N + 1 > N still fires the fast path. Also added a dedicated test_active_areas_large_repo_skips_dormant_dirs test to verify the fast path directly.

Copilot AI requested a review from avifenesh April 8, 2026 13:09
@avifenesh avifenesh merged commit 337700f into main Apr 8, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

perf: query timeout on large repos (80K+ files)

3 participants