perf: fix areas() query timeout on large repos (80K+ files)#12
perf: fix areas() query timeout on large repos (80K+ files)#12
Conversation
- areas(): pre-index symbols by directory once O(S) instead of O(D*S) - areas(): skip dormant files when file_activity > 10K (large-repo fast path) - detect_commands(): use HashSet for O(1) lookups instead of Vec::contains() O(N) - Add 3 tests: large-repo threshold, small-repo include all, symbols pre-index consistency Agent-Logs-Url: https://github.com/agent-sh/agent-analyzer/sessions/b6d5d883-13ae-427e-aaf9-f7d87846a797 Co-authored-by: avifenesh <55848801+avifenesh@users.noreply.github.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 807f2a72d7
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if large_repo && activity.recent_changes == 0 { | ||
| continue; |
There was a problem hiding this comment.
Keep areas() complete for large repositories
Applying recent_changes == 0 pruning inside areas() makes the public repo-intel query areas output incomplete whenever the repo crosses the 10k-file threshold, because entire dormant directories are dropped instead of being reported with low activity. This is a behavioral regression from the previous implementation (which returned all non-deleted/non-renamed areas) and can hide stale but risky modules from area-health consumers; the pruning should be scoped to callers like onboarding guidance rather than the core areas() query.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Pull request overview
Improves performance of repo-intel queries on very large repositories by reducing worst-case complexity in areas() and detect_commands(), addressing observed timeouts for onboard / can-i-help.
Changes:
- Add a large-repo fast path in
areas()to skip dormant files (no recent changes) above a file-count threshold. - Pre-index
symbolsby directory to avoid O(directories × symbols) work inareas(). - Replace repeated
Vec::contains()checks with aHashSetindetect_commands(), and add tests covering the new behaviors.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // For large repos, skip dormant files (no recent changes) to keep the | ||
| // working set manageable. `onboard` and `can-i-help` only need active | ||
| // areas; showing thousands of dormant directories adds no value. | ||
| let large_repo = map.file_activity.len() > LARGE_REPO_FILE_THRESHOLD; | ||
|
|
||
| // Group files by directory, excluding deleted and renamed-from files | ||
| let mut dir_files: HashMap<String, Vec<(&String, &FileActivity)>> = HashMap::new(); | ||
| for (path, activity) in &map.file_activity { | ||
| if deleted_paths.contains(path.as_str()) || renamed_from.contains(path.as_str()) { | ||
| continue; | ||
| } | ||
| if large_repo && activity.recent_changes == 0 { | ||
| continue; | ||
| } |
There was a problem hiding this comment.
areas() now skips files with recent_changes == 0 when file_activity.len() > LARGE_REPO_FILE_THRESHOLD, which changes the output of repo-intel query areas on large repos (dormant directories will disappear). Since areas is also a standalone query and is described in SPEC.md as a general area overview, consider making this filtering opt-in (e.g., a parameter/flag or a separate active_areas() helper) or explicitly documenting in the public query/spec output that large repos return only “active” areas.
| // For repos above this size, `areas()` only processes files with recent | ||
| // activity. This avoids O(N) overhead over 80K+ file_activity entries on | ||
| // large repos like TypeScript or Deno. |
There was a problem hiding this comment.
The comment above LARGE_REPO_FILE_THRESHOLD says this avoids “O(N) overhead over 80K+ file_activity entries”, but areas() still iterates over the full file_activity map to apply the filter. Consider rewording this to reflect that it reduces the grouped working set / number of directories processed (rather than eliminating the initial O(N) scan).
| // For repos above this size, `areas()` only processes files with recent | |
| // activity. This avoids O(N) overhead over 80K+ file_activity entries on | |
| // large repos like TypeScript or Deno. | |
| // For repos above this size, `areas()` limits area-level analysis to files | |
| // with recent activity. This still scans `file_activity` to apply the filter, | |
| // but it reduces the downstream grouped working set and number of directories | |
| // processed on large repos like TypeScript or Deno. |
| // Use a file count just above the threshold so the large-repo path fires | ||
| let map = make_large_repo_map(LARGE_REPO_FILE_THRESHOLD + 1); |
There was a problem hiding this comment.
test_areas_large_repo_skips_dormant_dirs uses LARGE_REPO_FILE_THRESHOLD + 1, and make_large_repo_map then adds an additional active file, so the test builds threshold + 2 entries. To keep the test as light as possible (runtime + memory) while still exercising the large-repo path (len() > threshold), consider passing LARGE_REPO_FILE_THRESHOLD here instead.
| // Use a file count just above the threshold so the large-repo path fires | |
| let map = make_large_repo_map(LARGE_REPO_FILE_THRESHOLD + 1); | |
| // Use the threshold here; make_large_repo_map adds one active file, so | |
| // the large-repo path still fires with len() > LARGE_REPO_FILE_THRESHOLD. | |
| let map = make_large_repo_map(LARGE_REPO_FILE_THRESHOLD); |
|
@copilot address all comments |
- Move dormant-file filtering out of public areas() into a private active_areas() function; public areas() always returns all areas - onboard() and can_i_help() call active_areas() for the fast path - Fix LARGE_REPO_FILE_THRESHOLD comment to accurately describe the fast path - Update tests: areas_large_repo now asserts public areas() includes all dirs, new test_active_areas_large_repo_skips_dormant_dirs tests the fast path, small-repo test checks both areas() and active_areas() include all dirs - Drop +1 in test (LARGE_REPO_FILE_THRESHOLD itself triggers len() > threshold because make_large_repo_map adds one extra active file) Agent-Logs-Url: https://github.com/agent-sh/agent-analyzer/sessions/0e2d6974-bf00-4e30-8054-f3d77be06657 Co-authored-by: avifenesh <55848801+avifenesh@users.noreply.github.com>
Addressed all three review comments in 29375a7:
|
repo-intel query onboardandcan-i-helptimed out (>30s) on repos like TypeScript (81K files) and Deno (28K files) due to three compounding issues inareas()anddetect_commands().Root causes fixed
O(D × S) symbols loop -
areas()iterated the entire symbols map once per directory to gather complexity/symbol counts. For 80K files across thousands of directories this is O(dirs × symbols) with String allocations per iteration.Before:
After: single O(S) pre-pass into
HashMap<dir, (sym_count, complexities)>, then O(1) lookup per directory.Large-repo fast path via
active_areas()- A new privateactive_areas()helper applies aLARGE_REPO_FILE_THRESHOLD = 10_000filter: on repos above this size, only files withrecent_changes > 0are included in directory grouping. This still scansfile_activityto apply the filter, but reduces the downstream grouped working set and number of directories processed.onboardandcan-i-helpcallactive_areas()internally. The publicareas()function is unchanged and always returns all directories regardless of repo size.detect_commands()usedVec::contains()- O(N) per check for 8 build-file tests + 17 entry-point candidates = ~2M redundant comparisons on 80K-file repos. Replaced with aHashSetbuilt once fromfile_activitykeys.Tests added
Four new tests in
queries::tests:test_areas_large_repo_includes_all_dirs- publicareas()always returns all dirs even above the thresholdtest_active_areas_large_repo_skips_dormant_dirs-active_areas()excludes dormant dirs above thresholdtest_areas_small_repo_includes_dormant_dirs- bothareas()andactive_areas()return all dirs below thresholdtest_areas_symbols_pre_index_consistent- pre-indexed symbols produce correct counts and medians