feat: add ArcadeDB document store integration#2898
feat: add ArcadeDB document store integration#2898julian-risch merged 16 commits intodeepset-ai:mainfrom
Conversation
Add ArcadeDBDocumentStore and ArcadeDBEmbeddingRetriever for Haystack 2.x. ArcadeDB is an open-source multi-model database that combines document storage, HNSW vector search (LSM_VECTOR), and SQL metadata filtering in a single engine. This integration connects via the HTTP/JSON API using only the requests library — no special drivers needed. Components: - ArcadeDBDocumentStore: full DocumentStore protocol (count, filter, write, delete) with automatic schema/index initialization - ArcadeDBEmbeddingRetriever: pipeline component for vector similarity retrieval with FilterPolicy support - Filter conversion: Haystack filter dicts → ArcadeDB SQL WHERE clauses - Document converters: Haystack Document ↔ ArcadeDB record mapping Includes CI workflow with ArcadeDB Docker service, unit tests for filter conversion, and integration tests for all DocumentStore operations.
- Remove unused noqa: B008 directives (B008 already in ignore list) - Use HTTPStatus.BAD_REQUEST instead of magic value 400 (PLR2004) - Add S608 to ruff ignore (SQL string construction is intentional for ArcadeDB HTTP/JSON API with proper value escaping) - Set requests>=2.28.0 minimum to ensure Python 3.13 compatibility (older versions use removed cgi module)
julian-risch
left a comment
There was a problem hiding this comment.
Thank you for opening this pull request @lvca ! Looks quite good to me already but there are also a couple of things we need to change before we can merge it. I'll make most of the changes myself. What you could please do is open a PR in https://github.com/deepset-ai/haystack-integrations with a page to highlight this new integration using this structure. Feel free to reuse what you had in mind for the README.md there.
...ons/arcadedb/src/haystack_integrations/components/retrievers/arcadedb/embedding_retriever.py
Outdated
Show resolved
Hide resolved
integrations/arcadedb/src/haystack_integrations/document_stores/arcadedb/document_store.py
Outdated
Show resolved
Hide resolved
integrations/arcadedb/src/haystack_integrations/document_stores/arcadedb/document_store.py
Outdated
Show resolved
Hide resolved
integrations/arcadedb/src/haystack_integrations/document_stores/arcadedb/document_store.py
Outdated
Show resolved
Hide resolved
integrations/arcadedb/src/haystack_integrations/document_stores/arcadedb/document_store.py
Outdated
Show resolved
Hide resolved
|
I ran the example code successfully on my local machine and applied suggestions from my code review. Main task remaining is to use mixin DocumentStore tests based on https://github.com/deepset-ai/haystack/blob/main/haystack/testing/document_store.py#L252 I'll do that next. |
…rievers/arcadedb/embedding_retriever.py
…s/arcadedb/document_store.py
|
I updated the tests to use DocumentStore mixin tests. Both unit and integration tests passed locally. |
Summary
ArcadeDBDocumentStoreimplementing the full HaystackDocumentStoreprotocol (count, filter, write/upsert/skip, delete) with automatic database, schema, and HNSW vector index initializationArcadeDBEmbeddingRetrieverpipeline component for vector similarity retrieval withFilterPolicysupport (REPLACE/MERGE)requests— no special database drivers neededArcadeDB is an open-source multi-model database (~4k GitHub stars) that combines document storage, HNSW vector search (LSM_VECTOR), full-text search, and graph capabilities in a single engine. For Haystack users, this means one database replaces separate backends for document storage and vector retrieval.
Key design decisions
POST /api/v1/command/{database}with SQL — no Bolt protocol, no neo4j driver, no custom binary protocolvectorNeighbors()returns{record, distance}maps — metadata filters are applied as a post-filter stepHow did you test it?
ArcadeDBEmbeddingRetrieverdemonstrating filtered vector searcharcadedata/arcadedb:latestChecklist
feat: