Skip to content

fix: prevent crash when KnnSearch and Tune run concurrently#1764

Open
Copilot wants to merge 2 commits intomainfrom
copilot/fix-crash-knnsearch-tune
Open

fix: prevent crash when KnnSearch and Tune run concurrently#1764
Copilot wants to merge 2 commits intomainfrom
copilot/fix-crash-knnsearch-tune

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented Mar 27, 2026

Change Type

  • Bug fix
  • New feature
  • Improvement/Refactor
  • Documentation
  • CI/Build/Infra

Linked Issue

  • Issue: #crash-knnsearch-tune-concurrent

What Changed

Tune() mutates basic_flatten_codes_, high_precise_codes_, and use_reorder_ under add_mutex_, while search methods hold global_mutex_. Different mutexes → no synchronization → use-after-free in SIMD distance computation when Tune destroys the old flatten codes mid-search.

  • Tune(): Defer member mutations to local variables (new_use_reorder, drop_precise_codes). Build new flatten codes fully, then atomically swap all shared state under exclusive global_mutex_.
  • RangeSearch(): Add missing shared_lock(global_mutex_) — was present in KnnSearch/SearchWithRequest but absent here.

Lock order add_mutex_ → global_mutex_ is preserved (consistent with add_one_point).

Test Evidence

  • make fmt
  • make lint
  • make test
  • make cov, run tests, and collect coverage
  • Other (describe below)

Test details:

# algorithm target (includes hgraph.cpp) builds cleanly
cmake --build build/ --target algorithm -j$(nproc)
# [100%] Built target algorithm
# Pre-existing build error in unrelated window_result_queue.cpp prevents full build

Compatibility Impact

  • API/ABI compatibility: none
  • Behavior changes: Tune() now blocks concurrent searches only during the pointer swap (not during training/rebuild). RangeSearch() now correctly synchronizes with Tune().

Performance and Concurrency Impact

  • Performance impact: none — lock scope on global_mutex_ in Tune is minimal (pointer swaps only; training happens outside the lock)
  • Concurrency/thread-safety impact: Fixes data race between Tune and all search methods. RangeSearch gains the same global_mutex_ protection other search methods already had.

Documentation Impact

  • No docs update needed
  • Updated docs:
    • README.md
    • DEVELOPMENT.md
    • CONTRIBUTING.md
    • Other:

Risk and Rollback

  • Risk level: low
  • Rollback plan: revert single commit

Checklist

  • I have linked the relevant issue (or explained why not applicable)
  • I have added/updated tests for new behavior or bug fixes
  • I have considered API compatibility impact
  • I have updated docs if behavior/workflow changed
  • My commit messages follow project conventions (Conventional Commits, optional [skip ci] prefix)

🔒 GitHub Advanced Security automatically protects Copilot coding agent pull requests. You can protect all pull requests by enabling Advanced Security for your repositories. Learn more about Advanced Security.

The crash occurred because Tune() modified basic_flatten_codes_,
high_precise_codes_, and use_reorder_ under add_mutex_ while search
methods (KnnSearch, SearchWithRequest) held shared global_mutex_ —
these are different mutexes providing no synchronization.

Changes:
- In Tune(), defer member variable mutations using local variables,
  build new flatten codes first, then atomically swap all shared state
  under exclusive global_mutex_ lock
- In RangeSearch(), add missing shared_lock(global_mutex_) for
  consistency with KnnSearch and SearchWithRequest

Agent-Logs-Url: https://github.com/antgroup/vsag/sessions/ff0a358d-0694-4cb2-a40e-daf1830bd6f4

Co-authored-by: wxyucs <12595343+wxyucs@users.noreply.github.com>
Copilot AI changed the title [WIP] Fix crash when KnnSearch and Tune concurrently fix: prevent crash when KnnSearch and Tune run concurrently Mar 27, 2026
Copilot AI requested a review from wxyucs March 27, 2026 07:04
@wxyucs wxyucs marked this pull request as ready for review March 27, 2026 08:16
@wxyucs wxyucs requested a review from LHT129 as a code owner March 27, 2026 08:16
Copilot AI review requested due to automatic review settings March 27, 2026 08:16
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a concurrency crash in VSAG’s HGraph index when Tune() runs concurrently with search operations by ensuring shared search state is swapped under the same mutex that protects readers.

Changes:

  • Update HGraph::Tune() to rebuild flatten codes off-lock and then atomically swap basic_flatten_codes_, high_precise_codes_, and use_reorder_ under an exclusive global_mutex_.
  • Add the missing shared_lock(global_mutex_) in HGraph::RangeSearch() to synchronize with Tune().

Comment on lines +620 to +633
// Acquire exclusive global lock to atomically swap flatten codes,
// preventing concurrent searches from accessing partially updated state.
{
std::scoped_lock<std::shared_mutex> wlock(this->global_mutex_);
basic_flatten_codes_ = new_basic;
if (drop_precise_codes) {
high_precise_codes_.reset();
} else {
high_precise_codes_ = new_precise;
}
use_reorder_ = new_use_reorder;

// set status
if (disable_future_tuning) {
this->index_feature_list_->SetFeature(IndexFeature::SUPPORT_TUNE, false);
this->raw_vector_.reset();
has_raw_vector_ = false;
create_new_raw_vector_ = false;
check_and_init_raw_vector(param->raw_vector_param, common_param, false);
init_resize_bit_and_reorder();
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tune now swaps basic_flatten_codes_/high_precise_codes_ under global_mutex_, but other public read paths (e.g., CalcDistanceById/CalDistanceById) access these members without taking global_mutex_. That means concurrent Tune() can still race with those APIs and potentially trigger the same use-after-free. Consider taking a std::shared_lock on global_mutex_ in those distance-by-id methods (and any other readers of these pointers), or otherwise ensuring the swapped state is read atomically everywhere.

Copilot uses AI. Check for mistakes.
Comment on lines +616 to +629
auto new_basic = tune_and_rebuild(is_tune_base_code, basic_flatten_codes_, new_basic_code);
auto new_precise =
tune_and_rebuild(is_tune_precise_code, high_precise_codes_, new_precise_code);

check_and_init_raw_vector(param->raw_vector_param, common_param, false);
init_resize_bit_and_reorder();
// Acquire exclusive global lock to atomically swap flatten codes,
// preventing concurrent searches from accessing partially updated state.
{
std::scoped_lock<std::shared_mutex> wlock(this->global_mutex_);
basic_flatten_codes_ = new_basic;
if (drop_precise_codes) {
high_precise_codes_.reset();
} else {
high_precise_codes_ = new_precise;
}
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR fixes a specific concurrent crash scenario (Tune() vs searches) but doesn't add a regression test that runs Tune() concurrently with KnnSearch()/RangeSearch() to ensure the race stays fixed. There are already concurrent tests for read/write in tests/test_hgraph.cpp, so adding a focused Tune-vs-search stress/regression test would help prevent future reintroductions.

Copilot generated this review using guidance from repository custom instructions.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kind/bug Something isn't working size/M version/1.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants