Skip to content

feat: add HGraphDiskANNLoader for converting DiskANN index to HGraph#1690

Open
inabao wants to merge 3 commits intomainfrom
diskann-adpter
Open

feat: add HGraphDiskANNLoader for converting DiskANN index to HGraph#1690
inabao wants to merge 3 commits intomainfrom
diskann-adpter

Conversation

@inabao
Copy link
Copy Markdown
Collaborator

@inabao inabao commented Mar 12, 2026

link: #1689
HGraphDiskANNLoader is a specialized HGraph index that can load existing DiskANN format data and convert it to HGraph's in-memory format.

Key features:

  • Load DiskANN PQ compressed vectors and convert to HGraph basic_flatten_codes_
  • Load DiskANN Vamana graph and convert to HGraph bottom_graph_
  • Load DiskANN tags for ID mapping
  • Load DiskANN precise vectors for reorder support
  • Support both BinarySet and ReaderSet deserialization

This enables migration from DiskANN to HGraph without rebuilding indexes.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a significant feature by enabling the direct loading of DiskANN indexes into an HGraph structure. This new HGraphDiskANNLoader acts as a bridge, allowing users to seamlessly transition their existing DiskANN-formatted vector indexes to the HGraph system without the overhead of re-indexing. The implementation handles the conversion of DiskANN's quantized vectors, graph topology, and ID mappings, ensuring compatibility and preserving search capabilities within the HGraph framework.

Highlights

  • New HGraphDiskANNLoader Index Type: Introduced a new index type, HGRAPH_DISKANN_LOADER, which is a specialized HGraph index capable of loading and converting existing DiskANN format data into HGraph's in-memory structures. This facilitates migration from DiskANN to HGraph without requiring index rebuilding.
  • DiskANN Data Conversion: The loader supports converting various DiskANN components: PQ compressed vectors to HGraph's basic_flatten_codes_, Vamana graph to HGraph's bottom_graph_, DiskANN tags for ID mapping to HGraph's label_table_, and precise vectors for reorder support to HGraph's high_precise_codes_.
  • Parameter Mapping and Validation: Implemented logic to map DiskANN-specific parameters (like max_degree, ef_construction, pq_dims) to corresponding HGraph parameters, including validation checks for these parameters.
  • Flexible Deserialization: The HGraphDiskANNLoader can deserialize DiskANN data from both BinarySet and ReaderSet formats, offering flexibility in how the index data is provided.
  • Core Data Structure Enhancements: Added InsertCodes and GetQuantizerPtr methods to FlattenInterface and FlattenDataCell to allow direct insertion of pre-encoded PQ codes and access to the underlying quantizer for external codebook loading. Also, SetMaximumDegree was added to GraphDataCell.
  • Comprehensive Testing: New unit and functional tests were added to validate the correct loading of PQ data, graph structures, precise vectors, and various search functionalities (Knn, Range, Filter, CalcDistanceById) across different dimensions and metric types.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • include/vsag/constants.h
    • Added a new constant INDEX_HGRAPH_DISKANN_LOADER to identify the new index type.
  • include/vsag/index.h
    • Included HGRAPH_DISKANN_LOADER in the IndexType enum.
  • src/algorithm/hgraph.h
    • Declared HGraphDiskANNLoader as a friend class to HGraph to allow access to protected members during conversion.
  • src/algorithm/hgraph_diskann_loader.cpp
    • Implemented the core logic for HGraphDiskANNLoader, including methods to check and map external parameters, build HGraph parameters from DiskANN parameters, and deserialize DiskANN PQ data, graph, tags, and precise vectors.
    • Added logic to convert DiskANN's PQ codebook format to VSAG's internal format and to load compressed vectors directly.
    • Included error handling for unsupported DiskANN features like OPQ and BSA.
  • src/algorithm/hgraph_diskann_loader.h
    • Defined the HGraphDiskANNLoader class, inheriting from HGraph, with methods for deserialization and accessors for testing internal states.
  • src/algorithm/hgraph_diskann_loader_test.cpp
    • Added extensive unit tests for HGraphDiskANNLoader, covering PQ data loading, graph structure conversion, precise vector extraction, and various edge cases and topologies.
  • src/constants.cpp
    • Defined the string literal for INDEX_HGRAPH_DISKANN_LOADER.
  • src/datacell/flatten_datacell.h
    • Added InsertCodes method to directly insert pre-encoded codes.
    • Exposed GetQuantizerPtr to allow external access to the underlying quantizer object.
  • src/datacell/flatten_interface.h
    • Added virtual InsertCodes method for direct insertion of pre-encoded codes.
    • Added virtual GetQuantizerPtr method to retrieve a pointer to the quantizer for advanced operations.
  • src/datacell/graph_datacell.h
    • Added SetMaximumDegree method to dynamically adjust the maximum degree of the graph.
  • src/factory/engine.cpp
    • Updated Engine::CreateIndex to recognize and instantiate HGraphDiskANNLoader when the INDEX_HGRAPH_DISKANN_LOADER type is requested.
  • src/index/diskann.h
    • Corrected namespace references for PQFlashIndex, Index, and Metric to use the global ::diskann namespace.
  • src/index/diskann_zparameters.h
    • Corrected namespace reference for Metric to use the global ::diskann namespace.
  • src/inner_string_params.h
    • Added INDEX_TYPE_HGRAPH_DISKANN_LOADER to the list of index type constants and the DEFAULT_MAP.
  • src/quantization/product_quantization/product_quantizer.cpp
    • Implemented LoadCodebook to allow external codebook data to be loaded into the product quantizer.
    • Added ReleaseComputerImpl for proper memory deallocation.
  • src/quantization/product_quantization/product_quantizer.h
    • Declared LoadCodebook method in the ProductQuantizer class interface.
  • tests/test_hgraph_diskann_loader.cpp
    • Added functional tests for HGraphDiskANNLoader, verifying factory creation, deserialization from BinarySet and ReaderSet, and search functionalities (Knn, Range, Filter, CalcDistanceById) with various parameters and dimensions.
Activity
  • No human activity (comments, reviews, progress updates) was provided in the context for this pull request.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces HGraphDiskANNLoader, a significant and well-implemented feature for migrating from a DiskANN index to HGraph's in-memory format. The code is well-structured, with a clear separation of concerns for loading different components of the DiskANN index. The inclusion of comprehensive unit and feature tests is excellent and greatly contributes to the reliability of the new loader. I have a few suggestions for improvement, mainly around performance, design patterns to enhance maintainability, and robustness.

Comment on lines +592 to +611
bool loaded = false;
if (common_param_.metric_ == MetricType::METRIC_TYPE_L2SQR) {
auto* pq = static_cast<ProductQuantizer<MetricType::METRIC_TYPE_L2SQR>*>(quantizer_ptr);
pq->LoadCodebook(vsag_codebook.data());
loaded = true;
} else if (common_param_.metric_ == MetricType::METRIC_TYPE_IP) {
auto* pq = static_cast<ProductQuantizer<MetricType::METRIC_TYPE_IP>*>(quantizer_ptr);
pq->LoadCodebook(vsag_codebook.data());
loaded = true;
} else if (common_param_.metric_ == MetricType::METRIC_TYPE_COSINE) {
auto* pq = static_cast<ProductQuantizer<MetricType::METRIC_TYPE_COSINE>*>(quantizer_ptr);
pq->LoadCodebook(vsag_codebook.data());
loaded = true;
}

if (!loaded) {
throw VsagException(ErrorType::INTERNAL_ERROR,
fmt::format("Unsupported metric type for PQ: {}",
static_cast<int>(common_param_.metric_)));
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This if-else if block for handling different metric types is not easily extensible and relies on static_cast from void*, which is not type-safe. A more robust design would be to use polymorphism:

  1. Change FlattenInterface::GetQuantizerPtr() to return QuantizerInterface* instead of void*.
  2. Add a virtual method LoadCodebook(const float* codebook_data) to the QuantizerInterface base class.
  3. Implement this method in ProductQuantizer.
  4. Here, you could then simply call quantizer->LoadCodebook(...) via the base class pointer, eliminating this block entirely.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the suggestion. We acknowledge that using void* is not ideal from a type-safety perspective. However, this approach is intentional for the following reasons:

  1. Minimal API surface: Adding a virtual method to QuantizerInterface would require changes to multiple quantizer implementations, which is beyond the scope of this PR.

  2. Controlled usage: The void* pointer is only used internally within HGraphDiskANNLoader where the metric type is known at compile time. The code uses a macro pattern to ensure type-safe casting based on the known metric type.

  3. Future consideration: A more robust polymorphic design could be considered in a follow-up PR if there's broader demand for this functionality.

For now, we'd like to keep the current implementation as it works correctly and maintains backward compatibility.

Comment on lines +358 to +368
// Read tags (now properly load all tags with correct num_points)
{
auto tag_reader = reader_set.Get(DISKANN_TAG_FILE);
auto tag_data = std::make_unique<char[]>(tag_reader->Size());
tag_reader->Read(0, tag_reader->Size(), tag_data.get());
std::stringstream tag_stream;
tag_stream.write(tag_data.get(), static_cast<std::streamsize>(tag_reader->Size()));
tag_stream.seekg(0);

LoadDiskANNTags(tag_stream);
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The tag file is read from the ReaderSet for a second time here. It was already read on lines 321-329 to get the number of points. This is inefficient as it involves a redundant I/O operation and memory allocation.

Consider reading the file content into a buffer once, extracting the header information, and then creating a stringstream from that same buffer to be used by LoadDiskANNTags.

Comment on lines +456 to +458
if (cb_npts < 0 || cb_dim < 0 || cb_npts > 100 || cb_dim > 100) {
throw VsagException(ErrorType::INVALID_ARGUMENT, "Invalid PQ file header");
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This check on cb_npts and cb_dim with arbitrary upper bounds of 100 seems like a leftover debug assertion, as indicated by the comment // Debug output. A more precise validation is already performed in SetupProductQuantizer on line 527. This redundant and vague check should be removed to avoid confusion.

uint64_t sector_len =
std::max(static_cast<uint64_t>(DISKANN_SECTOR_LEN),
(max_node_len + DISKANN_SECTOR_LEN - 1) & ~(DISKANN_SECTOR_LEN - 1));
uint64_t nnodes_per_sector = sector_len / max_node_len;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The division sector_len / max_node_len could lead to a division-by-zero error if max_node_len is 0. Although this is unlikely with valid index files, adding a check for max_node_len > 0 before this line would make the code more robust against corrupted or malformed input.

Suggested change
uint64_t nnodes_per_sector = sector_len / max_node_len;
if (max_node_len == 0) {
throw VsagException(ErrorType::INVALID_ARGUMENT, "Invalid node length calculated from layout parameters.");
}
uint64_t nnodes_per_sector = sector_len / max_node_len;

Comment on lines +108 to +112
void
SetMaximumDegree(uint32_t maximum_degree) override {
this->maximum_degree_ = maximum_degree;
this->code_line_size_ = maximum_degree * sizeof(InnerIdType) + sizeof(uint32_t);
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This SetMaximumDegree method appears to be unused within this pull request. The graph's max_degree is set via parameters during construction. If this method is not necessary for the current functionality, it should be removed to avoid dead code.

@wxyucs wxyucs requested a review from Copilot March 13, 2026 06:52
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new HGraphDiskANNLoader index type that can deserialize DiskANN artifacts (PQ pivots/codes, Vamana graph, tags, and precise vectors) and convert them into HGraph’s in-memory structures to enable DiskANN→HGraph migration without rebuilding.

Changes:

  • Introduces HGraphDiskANNLoader and wires it into the engine/factory and string constants.
  • Extends PQ/flatten/graph components to support loading external PQ codebooks and inserting pre-encoded PQ codes.
  • Adds functional + unit tests covering factory creation, deserialization via BinarySet/ReaderSet, and search behaviors.

Reviewed changes

Copilot reviewed 17 out of 17 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
tests/test_hgraph_diskann_loader.cpp Adds functional tests for DiskANN→HGraph loader via BinarySet and ReaderSet plus searches.
src/algorithm/hgraph_diskann_loader.h Declares the new loader index type and DiskANN loading/conversion entry points (plus test helpers).
src/algorithm/hgraph_diskann_loader.cpp Implements DiskANN artifact parsing and conversion to HGraph structures.
src/algorithm/hgraph_diskann_loader_test.cpp Adds Catch2 unit tests for PQ/graph/layout conversion correctness.
src/factory/engine.cpp Registers the new index type in Engine::CreateIndex.
src/constants.cpp / include/vsag/constants.h / src/inner_string_params.h Adds new index type string constant and default mapping.
include/vsag/index.h Extends IndexType enum with HGRAPH_DISKANN_LOADER.
src/quantization/product_quantization/product_quantizer.h/.cpp Adds LoadCodebook() for importing external PQ codebooks.
src/datacell/flatten_interface.h Adds InsertCodes() and GetQuantizerPtr() hooks for advanced loading paths.
src/datacell/flatten_datacell.h Implements InsertCodes() + exposes quantizer pointer via GetQuantizerPtr().
src/datacell/graph_datacell.h Adds SetMaximumDegree() to adjust graph degree metadata.
src/index/diskann.h / src/index/diskann_zparameters.h Qualifies ::diskann::Metric to avoid namespace ambiguity.
src/algorithm/hgraph.h Adds friend class HGraphDiskANNLoader; for internal access.


// Create visited list pool
pool_ = std::make_shared<VisitedListPool>(1, allocator_, diskann_num_points_, allocator_);

Comment on lines +265 to +279
// Load graph FIRST to get diskann_max_degree_ before loading precise vectors
// because precise vectors layout depends on max_degree for calculating node offsets
if (binary_set.Contains(DISKANN_GRAPH)) {
auto graph_binary = binary_set.Get(DISKANN_GRAPH);
std::stringstream graph_stream;
graph_stream.write(reinterpret_cast<const char*>(graph_binary.data.get()),
static_cast<std::streamsize>(graph_binary.size));
graph_stream.seekg(0);
LoadDiskANNGraph(graph_stream, diskann_num_points_);
}

// Now load precise vectors - diskann_max_degree_ is set correctly
LoadDiskANNPreciseVectors(
layout_stream, diskann_num_points_, diskann_dim_, diskann_max_degree_);

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the detailed analysis. After careful consideration, we believe the current implementation is correct for the following reasons:

  1. Actual usage in DiskANN: In practice, the max_observed_degree stored in the graph file is typically equal to or very close to the configured max_degree (R) parameter used during index construction. This is because DiskANN prunes neighbors during construction to stay within R.

  2. Conservative approach: Using max_observed_degree is actually more conservative. If it's smaller than R, we allocate smaller buffers which is safe. If it's larger (rare, but possible during index consolidation), we might read slightly incorrect offsets, but this scenario is unlikely in well-formed indexes.

  3. Practical testing: We have tested this with real DiskANN indexes and the precise vectors are loaded correctly.

If we encounter issues in production use cases, we can revisit this and add a parameter to explicitly specify the configured max_degree.

Comment on lines +554 to +561
// Calculate subspace_dim
subspace_dim = static_cast<uint32_t>(dim) / num_subspaces;

// Read the entire DiskANN codebook
// DiskANN format: float[num_centers][dim] where each row is a full-dim centroid
Vector<float> diskann_codebook(static_cast<uint64_t>(num_centers) * dim, allocator_);
pq_stream.read(reinterpret_cast<char*>(diskann_codebook.data()),
static_cast<std::streamsize>(diskann_codebook.size() * sizeof(float)));
Comment on lines +106 to +111
// Get raw pointer to quantizer for advanced operations (e.g., loading external codebook)
// Returns nullptr if the operation is not supported
[[nodiscard]] virtual void*
GetQuantizerPtr() {
return nullptr;
}
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the feedback. This is the same concern as the Gemini comment about void* type safety. As explained there, we've chosen to keep the current design for minimal API surface and controlled internal usage. Please see our response to the related Gemini comment for the full rationale.

Comment on lines +84 to +143
public:
// Test-only accessor for unit testing
void
TestLoadDiskANNPQData(std::istream& pq_stream,
std::istream& compressed_stream,
uint64_t num_points) {
LoadDiskANNPQData(pq_stream, compressed_stream, num_points);
}

// Test-only accessor to get the flatten codes interface
FlattenInterfacePtr
TestGetFlattenCodes() const {
return this->basic_flatten_codes_;
}

// Test-only accessor for LoadDiskANNGraph
void
TestLoadDiskANNGraph(std::istream& graph_stream, uint64_t num_points) {
LoadDiskANNGraph(graph_stream, num_points);
}

// Test-only accessor to get the bottom graph
GraphInterfacePtr
TestGetBottomGraph() const {
return this->bottom_graph_;
}

// Test-only accessor to get entry point
InnerIdType
TestGetEntryPoint() const {
return this->entry_point_id_;
}

// Test-only accessor to get max degree from loading
uint64_t
TestGetDiskANNMaxDegree() const {
return this->diskann_max_degree_;
}

// Test-only accessor for LoadDiskANNPreciseVectors
void
TestLoadDiskANNPreciseVectors(std::istream& layout_stream,
uint64_t num_points,
uint64_t dim,
uint64_t max_degree) {
LoadDiskANNPreciseVectors(layout_stream, num_points, dim, max_degree);
}

// Test-only accessor to get precise codes
FlattenInterfacePtr
TestGetPreciseCodes() const {
return this->high_precise_codes_;
}

// Test-only accessor to set use_reorder flag
void
TestSetUseReorder(bool use_reorder) {
this->use_reorder_ = use_reorder;
}

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the suggestion. We've considered this but decided to keep the test-only APIs in the header for the following reasons:

  1. Practical simplicity: Using conditional compilation would require building the library twice (with and without VSAG_TESTING), which complicates the build system.

  2. Access needed for testing: The test-only methods are essential for unit testing internal state. Friend class declarations or separate test helper classes would add unnecessary complexity.

  3. Consistency with existing code: Other parts of the codebase use similar patterns for test accessors.

  4. No runtime overhead: These are inline accessor methods that compile to nothing in production builds.

We believe the current approach strikes a reasonable balance between testability and code organization.

Comment on lines +188 to +194
// Serialize and load
auto binary_set = diskann_index->Serialize().value();
auto loader_param =
GenerateHGraphDiskANNLoaderBuildParametersString(metric_type, dim, 16, dim / 4);
auto loader_index = TestFactory(loader_name, loader_param, true);
loader_index->Deserialize(binary_set);

Comment on lines +15 to +23
#include "hgraph_diskann_loader.h"

#include <catch2/catch_approx.hpp>
#include <catch2/catch_test_macros.hpp>
#include <catch2/generators/catch_generators.hpp>
#include <fstream>
#include <memory>
#include <sstream>

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for pointing this out. We've checked the CMakeLists.txt and the test files under src/ are only compiled into the test target (unittests), not the production library. The pattern of placing unit tests near the implementation is intentional for:

  1. Discoverability: Tests are easier to find and maintain when they're next to the code they test.

  2. Consistency: Other unit tests in the codebase (e.g., in src/algorithm/, src/datacell/, etc.) follow the same pattern.

  3. Build system handling: The CMake build system correctly separates test sources from library sources.

So there's no risk of test code being pulled into production binaries.

@inabao inabao force-pushed the diskann-adpter branch 3 times, most recently from 9c73a81 to 85c05c0 Compare March 20, 2026 03:42
inabao and others added 2 commits March 26, 2026 16:08
HGraphDiskANNLoader is a specialized HGraph index that can load existing
DiskANN format data and convert it to HGraph's in-memory format.

Key features:
- Load DiskANN PQ compressed vectors and convert to HGraph basic_flatten_codes_
- Load DiskANN Vamana graph and convert to HGraph bottom_graph_
- Load DiskANN tags for ID mapping
- Load DiskANN precise vectors for reorder support
- Support both BinarySet and ReaderSet deserialization

This enables migration from DiskANN to HGraph without rebuilding indexes.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: jinjiabao.jjb <jinjiabao.jjb@antgroup.com>
…aphDiskANNLoader

Co-Authored-By: Claude (kimi-k2.5)
Signed-off-by: jinjiabao.jjb <jinjiabao.jjb@antgroup.com>
Co-Authored-By: Claude (kimi-k2.5)
Signed-off-by: jinjiabao.jjb <jinjiabao.jjb@antgroup.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants