-
Notifications
You must be signed in to change notification settings - Fork 278
Description
sqlite-vec as of v0.1.0 will be brute-force search only, which slows down on large datasets (>1M w/ large dimensions). I want to include some form of approximate nearest neighbors search before v1, which trades accuracy/resource usage for speed.
This issue is a general "tracking issue" for how ANN will be implemented in sqlite-vec. The open questions I have:
Which ANN index should we use?
We want something that fits well with SQLite - meaning storing data in shadow tables, data that fits in pages, low memory usage, etc.
The main options I see:
- IVF: Like Faiss IndexIVF, pre-compute centroids of a training dataset, search nearest N centroids, etc.
- HNSW: Should "scale" more, way more params to tune. Would be pretty complicated to implement
- DiskANN: Kinda dig the simplicitly of this, especially LM-DiskANN
Unsure which one will turn out best, will need to reseach more. It's possible we add support for all these options.
How should one "declare" an index?
SQLite doesn't have custom indexes, so I think the best way would be to include index info in the CREATE VIRTUAL TABLE constructor. Like:
create virtual table vec_movies(
synopsis_embeddings float[768] INDEXED BY diskann(...)
);or:
create virtual table vec_movies(
synopsis_embeddings float[768] index=hnsw(...)
);syntax heavily depends what ANN index we pick. Also how would training work?
How would they work with metadata filtering?
How do we allow bruteforce + ANN on the same table?
How do we pick between KNN/ANN in a SQL query?