Skip to content

feat(pyramid): add PyramidAnalyzer for index statistics and diagnostics#1766

Open
inabao wants to merge 2 commits intomainfrom
feat-pyramid-analyzer
Open

feat(pyramid): add PyramidAnalyzer for index statistics and diagnostics#1766
inabao wants to merge 2 commits intomainfrom
feat-pyramid-analyzer

Conversation

@inabao
Copy link
Copy Markdown
Collaborator

@inabao inabao commented Mar 27, 2026

Summary

Add a new PyramidAnalyzer class to provide comprehensive statistics and diagnostics for Pyramid indexes.

Closes #1765

Changes

New Files

  • src/analyzer/pyramid_analyzer.h - PyramidAnalyzer class declaration
  • src/analyzer/pyramid_analyzer.cpp - PyramidAnalyzer implementation

Modified Files

  • src/algorithm/pyramid.cpp - Add GetStats() and GetVectorByInnerId() methods
  • src/algorithm/pyramid.h - Add declarations and friend class
  • src/analyzer/analyzer.cpp - Add PyramidAnalyzer factory method
  • src/analyzer/CMakeLists.txt - Add compilation config
  • src/datacell/compressed_graph_datacell.* - Add GetIds() method
  • tests/test_pyramid.cpp - Add analyzer tests
  • tools/analyze_index/analyze_index.cpp - Support Pyramid index analysis

Features

Index Structure Analysis

  • Node structure statistics (total nodes, max depth, distribution)
  • Leaf node size distribution (percentiles, histogram)

Subindex Quality Analysis

  • Graph vs FLAT node statistics
  • Vectors in graph ratio

Recall Statistics

  • Weighted recall calculation
  • Low recall node detection (< 80%)
  • Duplicate ratio for low recall nodes
  • Entry point duplicate check

Query-based Analysis

  • Recall with user queries
  • Quantization error analysis

Test

  • Unit tests pass
  • Functional tests added
  • make release compiles successfully

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: jinjiabao.jjb <jinjiabao.jjb@antgroup.com>
Signed-off-by: jinjiabao.jjb <jinjiabao.jjb@antgroup.com>
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements a comprehensive PyramidAnalyzer to provide detailed diagnostic statistics for the Pyramid index, covering node structure, leaf size distribution, and sub-index quality. Key changes include adding GetStats and GetVectorByInnerId to the Pyramid class, extending the analyze_index tool for Pyramid support, and fixing a stream-handling bug during index loading. Review feedback identifies a missing implementation for quantization error calculation, suggests replacing magic numbers with named constants, and recommends making several hardcoded analysis thresholds and logging intervals configurable to improve flexibility and performance.

return {0.0F, 0.0F};
}

float total_quantization_error = 0.0F;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The variable total_quantization_error is initialized to 0.0F but is never updated within the calculate_quantization_result function. As a result, the first element of the returned tuple will always be 0.0F, which is likely not the intended behavior. Please add the logic to calculate and update this variable.

get_suitable_max_degree(int64_t data_num) {
if (data_num < 100'000) {
return 16;
return 24;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The value 24 for max_degree is a magic number. Consider defining this as a named constant (e.g., DEFAULT_MAX_DEGREE_SMALL_DATASET) to improve readability and maintainability.

Suggested change
return 24;
return DEFAULT_MAX_DEGREE_SMALL_DATASET;

Comment on lines +924 to +926
analyzer_param.topk = 10;
analyzer_param.base_sample_size = std::min<uint64_t>(10, this->GetNumElements());
analyzer_param.search_params = R"({"pyramid": {"ef_search": 500}})";
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The topk and base_sample_size parameters for the AnalyzerParam are hardcoded. It would be more flexible to allow these to be configurable, perhaps by passing them as arguments to GetStats() or through the Pyramid constructor if they are meant to be index-wide analysis parameters.

auto start_time = std::chrono::steady_clock::now();

for (uint32_t i = 0; i < sample_size; ++i) {
if (i % 1 == 0) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The progress logging in calculate_search_result occurs on every iteration (i % 1 == 0). This can generate an excessive amount of log messages and significantly degrade performance for large sample sizes. It should be adjusted to log less frequently, for example, i % 100 == 0 or i % (sample_size / 10) == 0.

Suggested change
if (i % 1 == 0) {
if (i % 100 == 0) {

Comment on lines +1037 to +1038
uint32_t sample_count =
std::min(static_cast<uint32_t>(100), static_cast<uint32_t>(node_ids.size()));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The sample_count for neighbor recall analysis is hardcoded to 100. This value might need tuning based on the dataset or desired analysis depth. Consider making it a configurable parameter in AnalyzerParam.

total_weighted_recall += recall * static_cast<float>(size);
total_size += size;

if (recall < 0.8F) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The recall threshold (0.8F) for identifying low-recall nodes is hardcoded. This value might need tuning based on the dataset or desired analysis depth. Consider making it a configurable parameter in AnalyzerParam.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat(pyramid): add PyramidAnalyzer for index statistics and diagnostics

1 participant