Preserve original vectors for Faiss + cosine similarity with derived source #3294
Open
krocky-cooky wants to merge 4 commits into
Open
Preserve original vectors for Faiss + cosine similarity with derived source #3294krocky-cooky wants to merge 4 commits into
krocky-cooky wants to merge 4 commits into
Conversation
Faiss does not natively support cosine similarity. Previously the plugin L2-normalized vectors in-place in KNNVectorFieldMapper#parseCreateField before they were written to BinaryDocValues. As a result, derived source returned normalized vectors instead of the user-indexed ones (issue opensearch-project#3083). Defer the normalization to the native index build path so that doc values keep the original vectors. A new NormalizingKNNFloatVectorValues wraps the KNNFloatVectorValues fed to the Faiss builder and returns L2-normalized copies on demand. Derived source reconstruction reads BinaryDocValues directly and therefore returns the original vectors without any additional denormalization step. Changes: - KNNVectorFieldMapper: drop in-place L2 normalization in parseCreateField - KNN80DocValuesConsumer: wrap KNNVectorValues with NormalizingKNNFloatVectorValues when the field is Faiss + cosine + float - NormalizingKNNFloatVectorValues: new decorator that returns normalized copies - DerivedSourceIT: add integration tests covering original vector restoration, zero vector handling, and multi-segment merge behaviour Related: opensearch-project#3083 Signed-off-by: Takuo Kuroki <kurotaku9679.sub@gmail.com>
Consolidates the two previously separate checks that decided whether a vector field requires L2 normalization into a single entry point on VectorTransformerFactory. Both the query-time path (KNNVectorFieldType. transformQueryVector) and the native-index build path (KNN80DocValuesConsumer. addKNNBinaryField) now go through the same factory/transformer pair. Changes: - VectorTransformer: add wrap(KNNFloatVectorValues) and wrap(KNNByteVectorValues) default no-op methods so callers can uniformly apply transformations to a stream of vector values. - NormalizeVectorTransformer: override wrap(KNNFloatVectorValues) to return a NormalizingKNNFloatVectorValues. - VectorTransformerFactory: add getVectorTransformer(FieldInfo) overload that resolves SpaceType and MethodComponentContext directly from field attributes (SPACE_TYPE / MODEL_ID / PARAMETERS) and falls back to ModelUtil cluster state lookup for model-based fields. MethodComponentContext.parse is reused by narrowing the JSON map to name/parameters first, which allows the Lucene flat and Lucene SQ 1-bit conditions to be evaluated the same way as at query time. - KNN80DocValuesConsumer: replace the field-local needsNormalizationForFaissBuild helper and ModelUtil plumbing with a single call to VectorTransformerFactory.getVectorTransformer(field).wrap(...). Unused imports (StringUtils, SpaceType, KNNConstants, ModelMetadata, ModelUtil, NormalizingKNNFloatVectorValues) are removed. Tests: - NormalizeVectorTransformerTests: cover wrap(KNNFloatVectorValues). - VectorTransformerFactoryTests: cover getVectorTransformer(FieldInfo) for Faiss cosine, Faiss L2, missing attributes, Lucene cosine + flat, Lucene cosine + SQ 1-bit, and Lucene cosine + HNSW without encoder. Related: opensearch-project#3083 Signed-off-by: Takuo Kuroki <kurotaku9679.sub@gmail.com>
…(FieldInfo) Parsing the PARAMETERS attribute JSON on every vector flush/merge adds a cost that is only worth paying when a caller actually needs the Lucene flat / SQ 1-bit conditions to be evaluated. The build path in KNN80DocValuesConsumer only uses engine + space type (Faiss + cosine), so it can safely skip the parse. Add a boolean resolveMethodContext flag to VectorTransformerFactory.getVectorTransformer(FieldInfo, boolean): - false: reuse just engine + space type (cheap, covers the Faiss + cosine case that drives the codec-layer wrap). - true: additionally reconstruct MethodComponentContext from the PARAMETERS attribute, enabling the Lucene-specific conditions. Tests exercise this path for Lucene + cosine + flat and Lucene + cosine + SQ 1-bit. Callers updated: - KNN80DocValuesConsumer: passes false. - VectorTransformerFactoryTests: passes true to keep coverage of the Lucene flat / SQ 1-bit paths. Signed-off-by: Takuo Kuroki <kurotaku9679.sub@gmail.com>
BinaryDocValues now hold the original, unnormalized vectors bit-for-bit, so the retrieved _source vectors should match the indexed vectors exactly. The previous 1e-4 tolerance was needed when norm*denorm roundtripping introduced rounding; it no longer applies. Replace the 1e-4f delta with 0.0f in the three cosine-derived-source tests and update the surrounding comment to reflect the new contract. Signed-off-by: Takuo Kuroki <kurotaku9679.sub@gmail.com>
5 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Fixes the behaviour where derived source returns L2-normalized vectors instead of the user-indexed vectors when the field uses Faiss engine with cosine similarity.
Faiss does not natively support cosine similarity, so the plugin converts cosine to inner product and feeds unit vectors to the index. Previously the normalization was applied in-place inside
KNNVectorFieldMapper#parseCreateFieldbefore the vectors were written toBinaryDocValues. As a result, derived source reconstruction — which reads vectors back fromBinaryDocValues— returned the normalized vectors and the original data was lost.This PR defers normalization to the native index build path.
BinaryDocValueskeeps the original (unnormalized) vectors, and normalization is applied only to the stream of vectors handed to the Faiss builderRelated Issues
Resolves #3083
Implementation
KNNVectorFieldMapper#parseCreateField— remove the in-placeVectorTransformer#transform(array, true)call for theFLOATbranch. Vectors are now written to the Document unchanged.NormalizingKNNFloatVectorValues— a new decorator aroundKNNFloatVectorValuesthat returns an L2-normalized copy of each vector.getVector()allocates a freshfloat[]per call and appliesVectorUtil.l2normalize, which is SIMD-optimized via the Panama Vector API.KNN80DocValuesConsumer#addKNNBinaryField— when the field is Faiss + cosine + float, wrap theKNNVectorValueswithNormalizingKNNFloatVectorValuesbefore handing it toNativeIndexWriter. Space type is resolved from the field attribute first, falling back toModelUtil#getModelMetadata(reads cluster state, no network I/O) for model-based fields.With the original vectors retained in
BinaryDocValues, derived source reconstruction, translog source handling, merge, ExactSearcher, and quantization all continue to read the original vectors without any further changes.Performance
The added work per document at flush/merge time is a single
float[]copy plusVectorUtil.l2normalize, bothO(d). This is the same amount of work that previously ran once per document at indexing time — the computation is moved rather than duplicated. Compared to HNSW graph construction cost (O(d · M · efConstruction)distance computations per doc), the normalization is several orders of magnitude cheaper and does not dominate refresh/merge. No JNI changes.Check List
--signoff.By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.