Skip to content

Preserve original vectors for Faiss + cosine similarity with derived source #3294

Open
krocky-cooky wants to merge 4 commits into
opensearch-project:mainfrom
krocky-cooky:feat/normalize-at-faiss-build-3083
Open

Preserve original vectors for Faiss + cosine similarity with derived source #3294
krocky-cooky wants to merge 4 commits into
opensearch-project:mainfrom
krocky-cooky:feat/normalize-at-faiss-build-3083

Conversation

@krocky-cooky
Copy link
Copy Markdown

Description

Fixes the behaviour where derived source returns L2-normalized vectors instead of the user-indexed vectors when the field uses Faiss engine with cosine similarity.

Faiss does not natively support cosine similarity, so the plugin converts cosine to inner product and feeds unit vectors to the index. Previously the normalization was applied in-place inside KNNVectorFieldMapper#parseCreateField before the vectors were written to BinaryDocValues. As a result, derived source reconstruction — which reads vectors back from BinaryDocValues — returned the normalized vectors and the original data was lost.

This PR defers normalization to the native index build path. BinaryDocValues keeps the original (unnormalized) vectors, and normalization is applied only to the stream of vectors handed to the Faiss builder

Related Issues

Resolves #3083

Implementation

  1. KNNVectorFieldMapper#parseCreateField — remove the in-place VectorTransformer#transform(array, true) call for the FLOAT branch. Vectors are now written to the Document unchanged.

  2. NormalizingKNNFloatVectorValues — a new decorator around KNNFloatVectorValues that returns an L2-normalized copy of each vector. getVector() allocates a fresh float[] per call and applies VectorUtil.l2normalize, which is SIMD-optimized via the Panama Vector API.

  3. KNN80DocValuesConsumer#addKNNBinaryField — when the field is Faiss + cosine + float, wrap the KNNVectorValues with NormalizingKNNFloatVectorValues before handing it to NativeIndexWriter. Space type is resolved from the field attribute first, falling back to ModelUtil#getModelMetadata (reads cluster state, no network I/O) for model-based fields.

With the original vectors retained in BinaryDocValues, derived source reconstruction, translog source handling, merge, ExactSearcher, and quantization all continue to read the original vectors without any further changes.

Performance

The added work per document at flush/merge time is a single float[] copy plus VectorUtil.l2normalize, both O(d). This is the same amount of work that previously ran once per document at indexing time — the computation is moved rather than duplicated. Compared to HNSW graph construction cost (O(d · M · efConstruction) distance computations per doc), the normalization is several orders of magnitude cheaper and does not dominate refresh/merge. No JNI changes.

Check List

  • New functionality includes testing.
  • New functionality has been documented.
  • API changes companion pull request created.
  • Commits are signed per the DCO using --signoff.
  • Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Faiss does not natively support cosine similarity. Previously the
plugin L2-normalized vectors in-place in KNNVectorFieldMapper#parseCreateField
before they were written to BinaryDocValues. As a result, derived source
returned normalized vectors instead of the user-indexed ones (issue opensearch-project#3083).

Defer the normalization to the native index build path so that doc values
keep the original vectors. A new NormalizingKNNFloatVectorValues wraps the
KNNFloatVectorValues fed to the Faiss builder and returns L2-normalized
copies on demand. Derived source reconstruction reads BinaryDocValues
directly and therefore returns the original vectors without any additional
denormalization step.

Changes:
- KNNVectorFieldMapper: drop in-place L2 normalization in parseCreateField
- KNN80DocValuesConsumer: wrap KNNVectorValues with NormalizingKNNFloatVectorValues
  when the field is Faiss + cosine + float
- NormalizingKNNFloatVectorValues: new decorator that returns normalized copies
- DerivedSourceIT: add integration tests covering original vector restoration,
  zero vector handling, and multi-segment merge behaviour

Related: opensearch-project#3083
Signed-off-by: Takuo Kuroki <kurotaku9679.sub@gmail.com>
Consolidates the two previously separate checks that decided whether a
vector field requires L2 normalization into a single entry point on
VectorTransformerFactory. Both the query-time path (KNNVectorFieldType.
transformQueryVector) and the native-index build path (KNN80DocValuesConsumer.
addKNNBinaryField) now go through the same factory/transformer pair.

Changes:
- VectorTransformer: add wrap(KNNFloatVectorValues) and wrap(KNNByteVectorValues)
  default no-op methods so callers can uniformly apply transformations to a
  stream of vector values.
- NormalizeVectorTransformer: override wrap(KNNFloatVectorValues) to return a
  NormalizingKNNFloatVectorValues.
- VectorTransformerFactory: add getVectorTransformer(FieldInfo) overload that
  resolves SpaceType and MethodComponentContext directly from field attributes
  (SPACE_TYPE / MODEL_ID / PARAMETERS) and falls back to ModelUtil cluster
  state lookup for model-based fields. MethodComponentContext.parse is reused
  by narrowing the JSON map to name/parameters first, which allows the Lucene
  flat and Lucene SQ 1-bit conditions to be evaluated the same way as at
  query time.
- KNN80DocValuesConsumer: replace the field-local needsNormalizationForFaissBuild
  helper and ModelUtil plumbing with a single call to
  VectorTransformerFactory.getVectorTransformer(field).wrap(...). Unused imports
  (StringUtils, SpaceType, KNNConstants, ModelMetadata, ModelUtil,
  NormalizingKNNFloatVectorValues) are removed.

Tests:
- NormalizeVectorTransformerTests: cover wrap(KNNFloatVectorValues).
- VectorTransformerFactoryTests: cover getVectorTransformer(FieldInfo) for
  Faiss cosine, Faiss L2, missing attributes, Lucene cosine + flat, Lucene
  cosine + SQ 1-bit, and Lucene cosine + HNSW without encoder.

Related: opensearch-project#3083
Signed-off-by: Takuo Kuroki <kurotaku9679.sub@gmail.com>
…(FieldInfo)

Parsing the PARAMETERS attribute JSON on every vector flush/merge adds
a cost that is only worth paying when a caller actually needs the Lucene
flat / SQ 1-bit conditions to be evaluated. The build path in
KNN80DocValuesConsumer only uses engine + space type (Faiss + cosine),
so it can safely skip the parse.

Add a boolean resolveMethodContext flag to
VectorTransformerFactory.getVectorTransformer(FieldInfo, boolean):

- false: reuse just engine + space type (cheap, covers the Faiss + cosine
  case that drives the codec-layer wrap).
- true: additionally reconstruct MethodComponentContext from the PARAMETERS
  attribute, enabling the Lucene-specific conditions. Tests exercise this
  path for Lucene + cosine + flat and Lucene + cosine + SQ 1-bit.

Callers updated:
- KNN80DocValuesConsumer: passes false.
- VectorTransformerFactoryTests: passes true to keep coverage of the
  Lucene flat / SQ 1-bit paths.

Signed-off-by: Takuo Kuroki <kurotaku9679.sub@gmail.com>
BinaryDocValues now hold the original, unnormalized vectors bit-for-bit,
so the retrieved _source vectors should match the indexed vectors
exactly. The previous 1e-4 tolerance was needed when norm*denorm
roundtripping introduced rounding; it no longer applies.

Replace the 1e-4f delta with 0.0f in the three cosine-derived-source
tests and update the surrounding comment to reflect the new contract.

Signed-off-by: Takuo Kuroki <kurotaku9679.sub@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEATURE] Enable cosine similarity to return original vectors rather than normalized vectors with Faiss Engine

1 participant