feat(pyramid): add PyramidAnalyzer for index statistics and diagnostics#1766
feat(pyramid): add PyramidAnalyzer for index statistics and diagnostics#1766
Conversation
Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: jinjiabao.jjb <jinjiabao.jjb@antgroup.com>
There was a problem hiding this comment.
Code Review
This pull request implements a comprehensive PyramidAnalyzer to provide detailed diagnostic statistics for the Pyramid index, covering node structure, leaf size distribution, and sub-index quality. Key changes include adding GetStats and GetVectorByInnerId to the Pyramid class, extending the analyze_index tool for Pyramid support, and fixing a stream-handling bug during index loading. Review feedback identifies a missing implementation for quantization error calculation, suggests replacing magic numbers with named constants, and recommends making several hardcoded analysis thresholds and logging intervals configurable to improve flexibility and performance.
| return {0.0F, 0.0F}; | ||
| } | ||
|
|
||
| float total_quantization_error = 0.0F; |
There was a problem hiding this comment.
The variable total_quantization_error is initialized to 0.0F but is never updated within the calculate_quantization_result function. As a result, the first element of the returned tuple will always be 0.0F, which is likely not the intended behavior. Please add the logic to calculate and update this variable.
| get_suitable_max_degree(int64_t data_num) { | ||
| if (data_num < 100'000) { | ||
| return 16; | ||
| return 24; |
| analyzer_param.topk = 10; | ||
| analyzer_param.base_sample_size = std::min<uint64_t>(10, this->GetNumElements()); | ||
| analyzer_param.search_params = R"({"pyramid": {"ef_search": 500}})"; |
There was a problem hiding this comment.
| auto start_time = std::chrono::steady_clock::now(); | ||
|
|
||
| for (uint32_t i = 0; i < sample_size; ++i) { | ||
| if (i % 1 == 0) { |
There was a problem hiding this comment.
The progress logging in calculate_search_result occurs on every iteration (i % 1 == 0). This can generate an excessive amount of log messages and significantly degrade performance for large sample sizes. It should be adjusted to log less frequently, for example, i % 100 == 0 or i % (sample_size / 10) == 0.
| if (i % 1 == 0) { | |
| if (i % 100 == 0) { |
| uint32_t sample_count = | ||
| std::min(static_cast<uint32_t>(100), static_cast<uint32_t>(node_ids.size())); |
| total_weighted_recall += recall * static_cast<float>(size); | ||
| total_size += size; | ||
|
|
||
| if (recall < 0.8F) { |
Summary
Add a new
PyramidAnalyzerclass to provide comprehensive statistics and diagnostics for Pyramid indexes.Closes #1765
Changes
New Files
src/analyzer/pyramid_analyzer.h- PyramidAnalyzer class declarationsrc/analyzer/pyramid_analyzer.cpp- PyramidAnalyzer implementationModified Files
src/algorithm/pyramid.cpp- AddGetStats()andGetVectorByInnerId()methodssrc/algorithm/pyramid.h- Add declarations and friend classsrc/analyzer/analyzer.cpp- Add PyramidAnalyzer factory methodsrc/analyzer/CMakeLists.txt- Add compilation configsrc/datacell/compressed_graph_datacell.*- AddGetIds()methodtests/test_pyramid.cpp- Add analyzer teststools/analyze_index/analyze_index.cpp- Support Pyramid index analysisFeatures
Index Structure Analysis
Subindex Quality Analysis
Recall Statistics
Query-based Analysis
Test
make releasecompiles successfully