fix: prevent crash when KnnSearch and Tune run concurrently#1764
fix: prevent crash when KnnSearch and Tune run concurrently#1764
Conversation
The crash occurred because Tune() modified basic_flatten_codes_, high_precise_codes_, and use_reorder_ under add_mutex_ while search methods (KnnSearch, SearchWithRequest) held shared global_mutex_ — these are different mutexes providing no synchronization. Changes: - In Tune(), defer member variable mutations using local variables, build new flatten codes first, then atomically swap all shared state under exclusive global_mutex_ lock - In RangeSearch(), add missing shared_lock(global_mutex_) for consistency with KnnSearch and SearchWithRequest Agent-Logs-Url: https://github.com/antgroup/vsag/sessions/ff0a358d-0694-4cb2-a40e-daf1830bd6f4 Co-authored-by: wxyucs <12595343+wxyucs@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Fixes a concurrency crash in VSAG’s HGraph index when Tune() runs concurrently with search operations by ensuring shared search state is swapped under the same mutex that protects readers.
Changes:
- Update
HGraph::Tune()to rebuild flatten codes off-lock and then atomically swapbasic_flatten_codes_,high_precise_codes_, anduse_reorder_under an exclusiveglobal_mutex_. - Add the missing
shared_lock(global_mutex_)inHGraph::RangeSearch()to synchronize withTune().
| // Acquire exclusive global lock to atomically swap flatten codes, | ||
| // preventing concurrent searches from accessing partially updated state. | ||
| { | ||
| std::scoped_lock<std::shared_mutex> wlock(this->global_mutex_); | ||
| basic_flatten_codes_ = new_basic; | ||
| if (drop_precise_codes) { | ||
| high_precise_codes_.reset(); | ||
| } else { | ||
| high_precise_codes_ = new_precise; | ||
| } | ||
| use_reorder_ = new_use_reorder; | ||
|
|
||
| // set status | ||
| if (disable_future_tuning) { | ||
| this->index_feature_list_->SetFeature(IndexFeature::SUPPORT_TUNE, false); | ||
| this->raw_vector_.reset(); | ||
| has_raw_vector_ = false; | ||
| create_new_raw_vector_ = false; | ||
| check_and_init_raw_vector(param->raw_vector_param, common_param, false); | ||
| init_resize_bit_and_reorder(); |
There was a problem hiding this comment.
Tune now swaps basic_flatten_codes_/high_precise_codes_ under global_mutex_, but other public read paths (e.g., CalcDistanceById/CalDistanceById) access these members without taking global_mutex_. That means concurrent Tune() can still race with those APIs and potentially trigger the same use-after-free. Consider taking a std::shared_lock on global_mutex_ in those distance-by-id methods (and any other readers of these pointers), or otherwise ensuring the swapped state is read atomically everywhere.
| auto new_basic = tune_and_rebuild(is_tune_base_code, basic_flatten_codes_, new_basic_code); | ||
| auto new_precise = | ||
| tune_and_rebuild(is_tune_precise_code, high_precise_codes_, new_precise_code); | ||
|
|
||
| check_and_init_raw_vector(param->raw_vector_param, common_param, false); | ||
| init_resize_bit_and_reorder(); | ||
| // Acquire exclusive global lock to atomically swap flatten codes, | ||
| // preventing concurrent searches from accessing partially updated state. | ||
| { | ||
| std::scoped_lock<std::shared_mutex> wlock(this->global_mutex_); | ||
| basic_flatten_codes_ = new_basic; | ||
| if (drop_precise_codes) { | ||
| high_precise_codes_.reset(); | ||
| } else { | ||
| high_precise_codes_ = new_precise; | ||
| } |
There was a problem hiding this comment.
This PR fixes a specific concurrent crash scenario (Tune() vs searches) but doesn't add a regression test that runs Tune() concurrently with KnnSearch()/RangeSearch() to ensure the race stays fixed. There are already concurrent tests for read/write in tests/test_hgraph.cpp, so adding a focused Tune-vs-search stress/regression test would help prevent future reintroductions.
Change Type
Linked Issue
What Changed
Tune()mutatesbasic_flatten_codes_,high_precise_codes_, anduse_reorder_underadd_mutex_, while search methods holdglobal_mutex_. Different mutexes → no synchronization → use-after-free in SIMD distance computation when Tune destroys the old flatten codes mid-search.Tune(): Defer member mutations to local variables (new_use_reorder,drop_precise_codes). Build new flatten codes fully, then atomically swap all shared state under exclusiveglobal_mutex_.RangeSearch(): Add missingshared_lock(global_mutex_)— was present inKnnSearch/SearchWithRequestbut absent here.Lock order
add_mutex_ → global_mutex_is preserved (consistent withadd_one_point).Test Evidence
make fmtmake lintmake testmake cov, run tests, and collect coverageTest details:
Compatibility Impact
Tune()now blocks concurrent searches only during the pointer swap (not during training/rebuild).RangeSearch()now correctly synchronizes withTune().Performance and Concurrency Impact
global_mutex_in Tune is minimal (pointer swaps only; training happens outside the lock)Documentation Impact
README.mdDEVELOPMENT.mdCONTRIBUTING.mdRisk and Rollback
Checklist
[skip ci]prefix)🔒 GitHub Advanced Security automatically protects Copilot coding agent pull requests. You can protect all pull requests by enabling Advanced Security for your repositories. Learn more about Advanced Security.