feat: add HGraphDiskANNLoader for converting DiskANN index to HGraph#1690
feat: add HGraphDiskANNLoader for converting DiskANN index to HGraph#1690
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a significant feature by enabling the direct loading of DiskANN indexes into an HGraph structure. This new Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces HGraphDiskANNLoader, a significant and well-implemented feature for migrating from a DiskANN index to HGraph's in-memory format. The code is well-structured, with a clear separation of concerns for loading different components of the DiskANN index. The inclusion of comprehensive unit and feature tests is excellent and greatly contributes to the reliability of the new loader. I have a few suggestions for improvement, mainly around performance, design patterns to enhance maintainability, and robustness.
| bool loaded = false; | ||
| if (common_param_.metric_ == MetricType::METRIC_TYPE_L2SQR) { | ||
| auto* pq = static_cast<ProductQuantizer<MetricType::METRIC_TYPE_L2SQR>*>(quantizer_ptr); | ||
| pq->LoadCodebook(vsag_codebook.data()); | ||
| loaded = true; | ||
| } else if (common_param_.metric_ == MetricType::METRIC_TYPE_IP) { | ||
| auto* pq = static_cast<ProductQuantizer<MetricType::METRIC_TYPE_IP>*>(quantizer_ptr); | ||
| pq->LoadCodebook(vsag_codebook.data()); | ||
| loaded = true; | ||
| } else if (common_param_.metric_ == MetricType::METRIC_TYPE_COSINE) { | ||
| auto* pq = static_cast<ProductQuantizer<MetricType::METRIC_TYPE_COSINE>*>(quantizer_ptr); | ||
| pq->LoadCodebook(vsag_codebook.data()); | ||
| loaded = true; | ||
| } | ||
|
|
||
| if (!loaded) { | ||
| throw VsagException(ErrorType::INTERNAL_ERROR, | ||
| fmt::format("Unsupported metric type for PQ: {}", | ||
| static_cast<int>(common_param_.metric_))); | ||
| } |
There was a problem hiding this comment.
This if-else if block for handling different metric types is not easily extensible and relies on static_cast from void*, which is not type-safe. A more robust design would be to use polymorphism:
- Change
FlattenInterface::GetQuantizerPtr()to returnQuantizerInterface*instead ofvoid*. - Add a virtual method
LoadCodebook(const float* codebook_data)to theQuantizerInterfacebase class. - Implement this method in
ProductQuantizer. - Here, you could then simply call
quantizer->LoadCodebook(...)via the base class pointer, eliminating this block entirely.
There was a problem hiding this comment.
Thank you for the suggestion. We acknowledge that using void* is not ideal from a type-safety perspective. However, this approach is intentional for the following reasons:
-
Minimal API surface: Adding a virtual method to
QuantizerInterfacewould require changes to multiple quantizer implementations, which is beyond the scope of this PR. -
Controlled usage: The
void*pointer is only used internally withinHGraphDiskANNLoaderwhere the metric type is known at compile time. The code uses a macro pattern to ensure type-safe casting based on the known metric type. -
Future consideration: A more robust polymorphic design could be considered in a follow-up PR if there's broader demand for this functionality.
For now, we'd like to keep the current implementation as it works correctly and maintains backward compatibility.
| // Read tags (now properly load all tags with correct num_points) | ||
| { | ||
| auto tag_reader = reader_set.Get(DISKANN_TAG_FILE); | ||
| auto tag_data = std::make_unique<char[]>(tag_reader->Size()); | ||
| tag_reader->Read(0, tag_reader->Size(), tag_data.get()); | ||
| std::stringstream tag_stream; | ||
| tag_stream.write(tag_data.get(), static_cast<std::streamsize>(tag_reader->Size())); | ||
| tag_stream.seekg(0); | ||
|
|
||
| LoadDiskANNTags(tag_stream); | ||
| } |
There was a problem hiding this comment.
The tag file is read from the ReaderSet for a second time here. It was already read on lines 321-329 to get the number of points. This is inefficient as it involves a redundant I/O operation and memory allocation.
Consider reading the file content into a buffer once, extracting the header information, and then creating a stringstream from that same buffer to be used by LoadDiskANNTags.
| if (cb_npts < 0 || cb_dim < 0 || cb_npts > 100 || cb_dim > 100) { | ||
| throw VsagException(ErrorType::INVALID_ARGUMENT, "Invalid PQ file header"); | ||
| } |
There was a problem hiding this comment.
This check on cb_npts and cb_dim with arbitrary upper bounds of 100 seems like a leftover debug assertion, as indicated by the comment // Debug output. A more precise validation is already performed in SetupProductQuantizer on line 527. This redundant and vague check should be removed to avoid confusion.
| uint64_t sector_len = | ||
| std::max(static_cast<uint64_t>(DISKANN_SECTOR_LEN), | ||
| (max_node_len + DISKANN_SECTOR_LEN - 1) & ~(DISKANN_SECTOR_LEN - 1)); | ||
| uint64_t nnodes_per_sector = sector_len / max_node_len; |
There was a problem hiding this comment.
The division sector_len / max_node_len could lead to a division-by-zero error if max_node_len is 0. Although this is unlikely with valid index files, adding a check for max_node_len > 0 before this line would make the code more robust against corrupted or malformed input.
| uint64_t nnodes_per_sector = sector_len / max_node_len; | |
| if (max_node_len == 0) { | |
| throw VsagException(ErrorType::INVALID_ARGUMENT, "Invalid node length calculated from layout parameters."); | |
| } | |
| uint64_t nnodes_per_sector = sector_len / max_node_len; |
src/datacell/graph_datacell.h
Outdated
| void | ||
| SetMaximumDegree(uint32_t maximum_degree) override { | ||
| this->maximum_degree_ = maximum_degree; | ||
| this->code_line_size_ = maximum_degree * sizeof(InnerIdType) + sizeof(uint32_t); | ||
| } |
There was a problem hiding this comment.
Pull request overview
Adds a new HGraphDiskANNLoader index type that can deserialize DiskANN artifacts (PQ pivots/codes, Vamana graph, tags, and precise vectors) and convert them into HGraph’s in-memory structures to enable DiskANN→HGraph migration without rebuilding.
Changes:
- Introduces
HGraphDiskANNLoaderand wires it into the engine/factory and string constants. - Extends PQ/flatten/graph components to support loading external PQ codebooks and inserting pre-encoded PQ codes.
- Adds functional + unit tests covering factory creation, deserialization via
BinarySet/ReaderSet, and search behaviors.
Reviewed changes
Copilot reviewed 17 out of 17 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/test_hgraph_diskann_loader.cpp | Adds functional tests for DiskANN→HGraph loader via BinarySet and ReaderSet plus searches. |
| src/algorithm/hgraph_diskann_loader.h | Declares the new loader index type and DiskANN loading/conversion entry points (plus test helpers). |
| src/algorithm/hgraph_diskann_loader.cpp | Implements DiskANN artifact parsing and conversion to HGraph structures. |
| src/algorithm/hgraph_diskann_loader_test.cpp | Adds Catch2 unit tests for PQ/graph/layout conversion correctness. |
| src/factory/engine.cpp | Registers the new index type in Engine::CreateIndex. |
| src/constants.cpp / include/vsag/constants.h / src/inner_string_params.h | Adds new index type string constant and default mapping. |
| include/vsag/index.h | Extends IndexType enum with HGRAPH_DISKANN_LOADER. |
| src/quantization/product_quantization/product_quantizer.h/.cpp | Adds LoadCodebook() for importing external PQ codebooks. |
| src/datacell/flatten_interface.h | Adds InsertCodes() and GetQuantizerPtr() hooks for advanced loading paths. |
| src/datacell/flatten_datacell.h | Implements InsertCodes() + exposes quantizer pointer via GetQuantizerPtr(). |
| src/datacell/graph_datacell.h | Adds SetMaximumDegree() to adjust graph degree metadata. |
| src/index/diskann.h / src/index/diskann_zparameters.h | Qualifies ::diskann::Metric to avoid namespace ambiguity. |
| src/algorithm/hgraph.h | Adds friend class HGraphDiskANNLoader; for internal access. |
|
|
||
| // Create visited list pool | ||
| pool_ = std::make_shared<VisitedListPool>(1, allocator_, diskann_num_points_, allocator_); | ||
|
|
| // Load graph FIRST to get diskann_max_degree_ before loading precise vectors | ||
| // because precise vectors layout depends on max_degree for calculating node offsets | ||
| if (binary_set.Contains(DISKANN_GRAPH)) { | ||
| auto graph_binary = binary_set.Get(DISKANN_GRAPH); | ||
| std::stringstream graph_stream; | ||
| graph_stream.write(reinterpret_cast<const char*>(graph_binary.data.get()), | ||
| static_cast<std::streamsize>(graph_binary.size)); | ||
| graph_stream.seekg(0); | ||
| LoadDiskANNGraph(graph_stream, diskann_num_points_); | ||
| } | ||
|
|
||
| // Now load precise vectors - diskann_max_degree_ is set correctly | ||
| LoadDiskANNPreciseVectors( | ||
| layout_stream, diskann_num_points_, diskann_dim_, diskann_max_degree_); | ||
|
|
There was a problem hiding this comment.
Thank you for the detailed analysis. After careful consideration, we believe the current implementation is correct for the following reasons:
-
Actual usage in DiskANN: In practice, the
max_observed_degreestored in the graph file is typically equal to or very close to the configuredmax_degree(R) parameter used during index construction. This is because DiskANN prunes neighbors during construction to stay within R. -
Conservative approach: Using
max_observed_degreeis actually more conservative. If it's smaller than R, we allocate smaller buffers which is safe. If it's larger (rare, but possible during index consolidation), we might read slightly incorrect offsets, but this scenario is unlikely in well-formed indexes. -
Practical testing: We have tested this with real DiskANN indexes and the precise vectors are loaded correctly.
If we encounter issues in production use cases, we can revisit this and add a parameter to explicitly specify the configured max_degree.
| // Calculate subspace_dim | ||
| subspace_dim = static_cast<uint32_t>(dim) / num_subspaces; | ||
|
|
||
| // Read the entire DiskANN codebook | ||
| // DiskANN format: float[num_centers][dim] where each row is a full-dim centroid | ||
| Vector<float> diskann_codebook(static_cast<uint64_t>(num_centers) * dim, allocator_); | ||
| pq_stream.read(reinterpret_cast<char*>(diskann_codebook.data()), | ||
| static_cast<std::streamsize>(diskann_codebook.size() * sizeof(float))); |
| // Get raw pointer to quantizer for advanced operations (e.g., loading external codebook) | ||
| // Returns nullptr if the operation is not supported | ||
| [[nodiscard]] virtual void* | ||
| GetQuantizerPtr() { | ||
| return nullptr; | ||
| } |
There was a problem hiding this comment.
Thank you for the feedback. This is the same concern as the Gemini comment about void* type safety. As explained there, we've chosen to keep the current design for minimal API surface and controlled internal usage. Please see our response to the related Gemini comment for the full rationale.
| public: | ||
| // Test-only accessor for unit testing | ||
| void | ||
| TestLoadDiskANNPQData(std::istream& pq_stream, | ||
| std::istream& compressed_stream, | ||
| uint64_t num_points) { | ||
| LoadDiskANNPQData(pq_stream, compressed_stream, num_points); | ||
| } | ||
|
|
||
| // Test-only accessor to get the flatten codes interface | ||
| FlattenInterfacePtr | ||
| TestGetFlattenCodes() const { | ||
| return this->basic_flatten_codes_; | ||
| } | ||
|
|
||
| // Test-only accessor for LoadDiskANNGraph | ||
| void | ||
| TestLoadDiskANNGraph(std::istream& graph_stream, uint64_t num_points) { | ||
| LoadDiskANNGraph(graph_stream, num_points); | ||
| } | ||
|
|
||
| // Test-only accessor to get the bottom graph | ||
| GraphInterfacePtr | ||
| TestGetBottomGraph() const { | ||
| return this->bottom_graph_; | ||
| } | ||
|
|
||
| // Test-only accessor to get entry point | ||
| InnerIdType | ||
| TestGetEntryPoint() const { | ||
| return this->entry_point_id_; | ||
| } | ||
|
|
||
| // Test-only accessor to get max degree from loading | ||
| uint64_t | ||
| TestGetDiskANNMaxDegree() const { | ||
| return this->diskann_max_degree_; | ||
| } | ||
|
|
||
| // Test-only accessor for LoadDiskANNPreciseVectors | ||
| void | ||
| TestLoadDiskANNPreciseVectors(std::istream& layout_stream, | ||
| uint64_t num_points, | ||
| uint64_t dim, | ||
| uint64_t max_degree) { | ||
| LoadDiskANNPreciseVectors(layout_stream, num_points, dim, max_degree); | ||
| } | ||
|
|
||
| // Test-only accessor to get precise codes | ||
| FlattenInterfacePtr | ||
| TestGetPreciseCodes() const { | ||
| return this->high_precise_codes_; | ||
| } | ||
|
|
||
| // Test-only accessor to set use_reorder flag | ||
| void | ||
| TestSetUseReorder(bool use_reorder) { | ||
| this->use_reorder_ = use_reorder; | ||
| } | ||
|
|
There was a problem hiding this comment.
Thank you for the suggestion. We've considered this but decided to keep the test-only APIs in the header for the following reasons:
-
Practical simplicity: Using conditional compilation would require building the library twice (with and without
VSAG_TESTING), which complicates the build system. -
Access needed for testing: The test-only methods are essential for unit testing internal state. Friend class declarations or separate test helper classes would add unnecessary complexity.
-
Consistency with existing code: Other parts of the codebase use similar patterns for test accessors.
-
No runtime overhead: These are inline accessor methods that compile to nothing in production builds.
We believe the current approach strikes a reasonable balance between testability and code organization.
| // Serialize and load | ||
| auto binary_set = diskann_index->Serialize().value(); | ||
| auto loader_param = | ||
| GenerateHGraphDiskANNLoaderBuildParametersString(metric_type, dim, 16, dim / 4); | ||
| auto loader_index = TestFactory(loader_name, loader_param, true); | ||
| loader_index->Deserialize(binary_set); | ||
|
|
| #include "hgraph_diskann_loader.h" | ||
|
|
||
| #include <catch2/catch_approx.hpp> | ||
| #include <catch2/catch_test_macros.hpp> | ||
| #include <catch2/generators/catch_generators.hpp> | ||
| #include <fstream> | ||
| #include <memory> | ||
| #include <sstream> | ||
|
|
There was a problem hiding this comment.
Thank you for pointing this out. We've checked the CMakeLists.txt and the test files under src/ are only compiled into the test target (unittests), not the production library. The pattern of placing unit tests near the implementation is intentional for:
-
Discoverability: Tests are easier to find and maintain when they're next to the code they test.
-
Consistency: Other unit tests in the codebase (e.g., in
src/algorithm/,src/datacell/, etc.) follow the same pattern. -
Build system handling: The CMake build system correctly separates test sources from library sources.
So there's no risk of test code being pulled into production binaries.
9c73a81 to
85c05c0
Compare
HGraphDiskANNLoader is a specialized HGraph index that can load existing DiskANN format data and convert it to HGraph's in-memory format. Key features: - Load DiskANN PQ compressed vectors and convert to HGraph basic_flatten_codes_ - Load DiskANN Vamana graph and convert to HGraph bottom_graph_ - Load DiskANN tags for ID mapping - Load DiskANN precise vectors for reorder support - Support both BinarySet and ReaderSet deserialization This enables migration from DiskANN to HGraph without rebuilding indexes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: jinjiabao.jjb <jinjiabao.jjb@antgroup.com>
…aphDiskANNLoader Co-Authored-By: Claude (kimi-k2.5) Signed-off-by: jinjiabao.jjb <jinjiabao.jjb@antgroup.com>
Co-Authored-By: Claude (kimi-k2.5) Signed-off-by: jinjiabao.jjb <jinjiabao.jjb@antgroup.com>
link: #1689
HGraphDiskANNLoader is a specialized HGraph index that can load existing DiskANN format data and convert it to HGraph's in-memory format.
Key features:
This enables migration from DiskANN to HGraph without rebuilding indexes.