Skip to content

Feature: Add WARP Index for Multi-Vector Document Retrieval #1601

Description

@inabao

Background

In modern information retrieval, ColBERT (Contextualized Late Interaction over BERT) has emerged as an effective approach for document retrieval. Unlike traditional single-vector representations, ColBERT represents each document and query as multiple vectors (one per token), enabling fine-grained matching through "late interaction".

The key similarity metric for ColBERT-style retrieval is MaxSim (Maximum Similarity), which computes the sum of maximum similarities between each query vector and all document vectors.

Feature Description

Add a new index type called WARP (Weighted Aggregation of Representative Points) that supports:

  1. Multi-Vector Document Storage: Each document can contain multiple vectors instead of a single aggregated vector
  2. MaxSim Similarity Search: Efficient computation of ColBERT-style maxsin similarity
  3. Standard API Compatibility: Works with existing VSAG index interfaces

Technical Details

Similarity Metric

For a query with m vectors q_1, q_2, ..., q_m and a document with n vectors d_1, d_2, ..., d_n:

MaxSim(q, d) = sum_i(max_j(similarity(q_i, d_j)))

Where similarity can be inner product (IP), cosine similarity, or L2 distance.

New Dataset API

Add VectorCounts field to Dataset class to specify the number of vectors per document:

auto dataset = vsag::Dataset::Make();
dataset->NumElements(num_docs)
       ->Dim(dim)
       ->Ids(ids.data())
       ->Float32Vectors(datas.data())
       ->VectorCounts(vector_counts.data())  // Number of vectors per document
       ->Owner(false);

Index Parameters

{
    "dtype": "float32",
    "metric_type": "ip",
    "dim": 128
}

Example Usage

// Create WARP index
auto index = vsag::Factory::CreateIndex("warp", build_parameters).value();

// Build with multi-vector documents
index->Build(multi_vector_dataset);

// Search with multi-vector query
auto result = index->KnnSearch(multi_vector_query, topk, search_parameters);

Implementation Components

Component File Description
Index Implementation src/algorithm/warp.cpp Core WARP algorithm implementation
Index Header src/algorithm/warp.h WARP class definition
Parameter src/algorithm/warp_parameter.cpp Index parameter handling
Example examples/cpp/110_index_warp.cpp Usage example
Tests tests/test_warp.cpp Unit tests

Features Supported

  • Build multi-vector document index
  • KnnSearch with multi-vector queries
  • KnnSearch with single-vector queries (backward compatible)
  • RangeSearch support
  • Serialization/Deserialization
  • Memory usage estimation
  • Add documents incrementally

Use Cases

  1. ColBERT-style Document Retrieval: Late interaction model for precise document matching
  2. Multi-Modal Document Search: Documents with multiple representations (e.g., text + image embeddings)
  3. Token-level Semantic Matching: Fine-grained semantic search at token level

Performance Considerations

  • Storage: Stores all document vectors without compression (FlattenInterface)
  • Search Complexity: O(total_vectors) for brute-force maxsin computation
  • Suitable for: Small to medium-scale document collections with multi-vector representation

Acceptance Criteria

  • Index can store documents with variable number of vectors
  • Search returns correct top-k documents based on maxsin similarity
  • Single-vector query works as fallback
  • All unit tests pass
  • Example code compiles and runs correctly

Related Work

Metadata

Metadata

Assignees

Labels

kind/featureBrand-new functionality or capabilities 引入全新的功能、新特性或新能力version/1.0

Type

No type

Fields

No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions