Tracking issue: ANN (Approximate Nearest Neighbors) Index

`sqlite-vec` as of `v0.1.0` will be brute-force search only, which slows down on large datasets (>1M w/ large dimensions). I want to include some form of approximate nearest neighbors search before `v1`, which trades accuracy/resource usage for speed.

This issue is a general "tracking issue" for how ANN will be implemented in `sqlite-vec`. The open questions I have:

## Which ANN index should we use?

We want something that fits well with SQLite - meaning storing data in shadow tables, data that fits in pages, low memory usage, etc.

The main options I see:

- IVF: Like [Faiss IndexIVF](https://faiss.ai/cpp_api/struct/structfaiss_1_1IndexIVF.html), pre-compute centroids of a training dataset, search nearest N centroids, etc.
- HNSW: Should "scale" more, way more params to tune. Would be pretty complicated to implement
- DiskANN: Kinda dig the simplicitly of this, especially LM-DiskANN

Unsure which one will turn out best, will need to reseach more. It's possible we add support for all these options.

## How should one "declare" an index?

SQLite doesn't have custom indexes, so I think the best way would be to include index info in the `CREATE VIRTUAL TABLE` constructor. Like:

```sql
create virtual table vec_movies(
  synopsis_embeddings float[768] INDEXED BY diskann(...)
);
```

or:

```sql
create virtual table vec_movies(
  synopsis_embeddings float[768] index=hnsw(...)
);
```

syntax heavily depends what ANN index we pick. Also how would training work?

## How would they work with metadata filtering?

## How do we allow bruteforce + ANN on the same table?

How do we pick between KNN/ANN in a SQL query?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tracking issue: ANN (Approximate Nearest Neighbors) Index #25

Which ANN index should we use?

How should one "declare" an index?

How would they work with metadata filtering?

How do we allow bruteforce + ANN on the same table?

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Tracking issue: ANN (Approximate Nearest Neighbors) Index #25

Description

Which ANN index should we use?

How should one "declare" an index?

How would they work with metadata filtering?

How do we allow bruteforce + ANN on the same table?

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions