Is your feature request related to a problem?
Multi-vector retrieval models like ColBERT and ColPali produce per-token embeddings that require MaxSim scoring across all token pairs. This is expensive at scale because there's no way to do ANN prefetch on variable-length multi-vector representations — you're forced to either brute-force score every document or rely on external tooling to pre-encode vectors client-side.
Currently, the k-NN plugin supports lateInteractionScore for MaxSim reranking, but the inner query is typically match_all or a text filter, meaning every matching document gets scored. OpenSearch supports mean pooling approach where the query and doc token vectors are averaged to form single vectors which can then be used for regular ANN search. However this is a lossy approach where averaging the token vectors loses lot of information from the token level vectors and spoils the purpose of multi-vector late interaction models. There's no native way to narrow candidates without losing information using the multi-vector embeddings themselves.
What solution would you like?
Add two new processors implementing the MUVERA algorithm (Multi-Vector Retrieval via Fixed Dimensional Encodings, [paper](https://arxiv.org/abs/2405.19504)):
-
muvera ingest processor — Converts variable-length multi-vector embeddings into a single fixed-dimensional encoding (FDE) vector using SimHash clustering and random projections. The FDE is stored in a knn_vector field for ANN indexing. The original multi-vectors remain in _source for reranking.
-
muvera_query search request processor — Intercepts script_score queries containing query_vectors in script params, MUVERA-encodes them into an FDE, and replaces the inner match_all with a knn query on the FDE field. The lateInteractionScore script wrapper stays intact for MaxSim reranking on the prefetched candidates.
User flow
Step 1: Create ingest pipeline
PUT _ingest/pipeline/muvera-ingest
{
"description": "MUVERA FDE encoding for ColBERT vectors",
"processors": [
{
"muvera": {
"source_field": "colbert_vectors",
"target_field": "muvera_fde",
"dim": 128,
"fde_dimension": 2560
}
}
]
}
Defaults: k_sim=4, dim_proj=8, r_reps=20, seed=42. FDE dimension = r_reps * 2^k_sim * dim_proj = 2560. The fde_dimension parameter validates the computed value so the user explicitly acknowledges the output size.
Step 2: Create index
PUT muvera-index
{
"settings": {
"index.knn": true,
"default_pipeline": "muvera-ingest"
},
"mappings": {
"dynamic": false,
"properties": {
"muvera_fde": {
"type": "knn_vector",
"dimension": 2560,
"method": {
"name": "hnsw",
"space_type": "innerproduct",
"engine": "faiss"
}
},
"title": { "type": "text" }
}
}
}
Note: colbert_vectors is intentionally left unmapped, it stays in _source for reranking but doesn't need its own field mapping.
Step 3: Index documents
POST muvera-index/_doc/1
{
"title": "example document",
"colbert_vectors": [
[0.1, 0.2, ...],
[0.3, 0.4, ...],
[0.5, 0.6, ...]
]
}
The ingest processor reads colbert_vectors, produces the FDE, and writes it to muvera_fde. Both fields end up in the stored document.
Step 4: Create search pipeline
PUT _search/pipeline/muvera-search
{
"request_processors": [
{
"muvera_query": {
"target_field": "muvera_fde",
"dim": 128,
"fde_dimension": 2560,
"oversample_factor": 4
}
}
]
}
Same MUVERA hyperparams as ingest (must match). oversample_factor controls how many candidates the knn prefetch retrieves relative to the requested result size.
Step 5: Search
POST muvera-index/_search?search_pipeline=muvera-search
{
"size": 10,
"query": {
"script_score": {
"query": { "match_all": {} },
"script": {
"source": "lateInteractionScore(params.query_vectors, 'colbert_vectors', params._source, params.space_type)",
"params": {
"query_vectors": [[0.1, 0.2, ...], [0.3, 0.4, ...]],
"space_type": "innerproduct"
}
}
}
}
}
What happens:
- Search processor extracts
query_vectors from script params
- MUVERA-encodes them into a query FDE
- Replaces
match_all with knn on muvera_fde (k = size × oversample_factor = 40)
lateInteractionScore reranks the 40 candidates using exact MaxSim on original multi-vectors
- Top 10 returned to user
What alternatives have you considered?
- Client-side MUVERA encoding (works but requires users to maintain encoding logic outside OpenSearch)
- Binary quantization of multi-vectors (lossy, doesn't preserve MaxSim structure)
- Text-based prefetch with BM25 (misses semantic signal from embeddings)
Do you have any additional context?
- MUVERA is already implemented in [fastembed](https://github.com/qdrant/fastembed) (Python) and used in production with Qdrant
- We have a working implementation with unit tests, stable across multiple iterations with random seeds
- The implementation uses only public APIs — no reflection or core OpenSearch modifications required
- Tested end-to-end on a live cluster: ingest pipeline creates FDE vectors, search pipeline rewrites queries,
lateInteractionScore reranking produces correct MaxSim scores
Is your feature request related to a problem?
Multi-vector retrieval models like ColBERT and ColPali produce per-token embeddings that require MaxSim scoring across all token pairs. This is expensive at scale because there's no way to do ANN prefetch on variable-length multi-vector representations — you're forced to either brute-force score every document or rely on external tooling to pre-encode vectors client-side.
Currently, the k-NN plugin supports lateInteractionScore for MaxSim reranking, but the inner query is typically
match_allor a text filter, meaning every matching document gets scored. OpenSearch supports mean pooling approach where the query and doc token vectors are averaged to form single vectors which can then be used for regular ANN search. However this is a lossy approach where averaging the token vectors loses lot of information from the token level vectors and spoils the purpose of multi-vector late interaction models. There's no native way to narrow candidates without losing information using the multi-vector embeddings themselves.What solution would you like?
Add two new processors implementing the MUVERA algorithm (Multi-Vector Retrieval via Fixed Dimensional Encodings, [paper](https://arxiv.org/abs/2405.19504)):
muveraingest processor — Converts variable-length multi-vector embeddings into a single fixed-dimensional encoding (FDE) vector using SimHash clustering and random projections. The FDE is stored in aknn_vectorfield for ANN indexing. The original multi-vectors remain in_sourcefor reranking.muvera_querysearch request processor — Interceptsscript_scorequeries containingquery_vectorsin script params, MUVERA-encodes them into an FDE, and replaces the innermatch_allwith aknnquery on the FDE field. ThelateInteractionScorescript wrapper stays intact for MaxSim reranking on the prefetched candidates.User flow
Step 1: Create ingest pipeline
Defaults:
k_sim=4,dim_proj=8,r_reps=20,seed=42. FDE dimension =r_reps * 2^k_sim * dim_proj= 2560. Thefde_dimensionparameter validates the computed value so the user explicitly acknowledges the output size.Step 2: Create index
Note:
colbert_vectorsis intentionally left unmapped, it stays in_sourcefor reranking but doesn't need its own field mapping.Step 3: Index documents
The ingest processor reads
colbert_vectors, produces the FDE, and writes it tomuvera_fde. Both fields end up in the stored document.Step 4: Create search pipeline
Same MUVERA hyperparams as ingest (must match).
oversample_factorcontrols how many candidates the knn prefetch retrieves relative to the requested result size.Step 5: Search
What happens:
query_vectorsfrom script paramsmatch_allwithknnonmuvera_fde(k = size × oversample_factor = 40)lateInteractionScorereranks the 40 candidates using exact MaxSim on original multi-vectorsWhat alternatives have you considered?
Do you have any additional context?
lateInteractionScorereranking produces correct MaxSim scores