Skip to content

Tracking issue: ANN (Approximate Nearest Neighbors) Index #25

@asg017

Description

@asg017

sqlite-vec as of v0.1.0 will be brute-force search only, which slows down on large datasets (>1M w/ large dimensions). I want to include some form of approximate nearest neighbors search before v1, which trades accuracy/resource usage for speed.

This issue is a general "tracking issue" for how ANN will be implemented in sqlite-vec. The open questions I have:

Which ANN index should we use?

We want something that fits well with SQLite - meaning storing data in shadow tables, data that fits in pages, low memory usage, etc.

The main options I see:

  • IVF: Like Faiss IndexIVF, pre-compute centroids of a training dataset, search nearest N centroids, etc.
  • HNSW: Should "scale" more, way more params to tune. Would be pretty complicated to implement
  • DiskANN: Kinda dig the simplicitly of this, especially LM-DiskANN

Unsure which one will turn out best, will need to reseach more. It's possible we add support for all these options.

How should one "declare" an index?

SQLite doesn't have custom indexes, so I think the best way would be to include index info in the CREATE VIRTUAL TABLE constructor. Like:

create virtual table vec_movies(
  synopsis_embeddings float[768] INDEXED BY diskann(...)
);

or:

create virtual table vec_movies(
  synopsis_embeddings float[768] index=hnsw(...)
);

syntax heavily depends what ANN index we pick. Also how would training work?

How would they work with metadata filtering?

How do we allow bruteforce + ANN on the same table?

How do we pick between KNN/ANN in a SQL query?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions