Skip to content

feat: add ArcadeDB document store integration#2898

Merged
julian-risch merged 16 commits intodeepset-ai:mainfrom
lvca:feat/arcadedb-integration
Mar 2, 2026
Merged

feat: add ArcadeDB document store integration#2898
julian-risch merged 16 commits intodeepset-ai:mainfrom
lvca:feat/arcadedb-integration

Conversation

@lvca
Copy link
Contributor

@lvca lvca commented Feb 28, 2026

Summary

  • Add ArcadeDBDocumentStore implementing the full Haystack DocumentStore protocol (count, filter, write/upsert/skip, delete) with automatic database, schema, and HNSW vector index initialization
  • Add ArcadeDBEmbeddingRetriever pipeline component for vector similarity retrieval with FilterPolicy support (REPLACE/MERGE)
  • Pure HTTP/JSON API integration via requests — no special database drivers needed

ArcadeDB is an open-source multi-model database (~4k GitHub stars) that combines document storage, HNSW vector search (LSM_VECTOR), full-text search, and graph capabilities in a single engine. For Haystack users, this means one database replaces separate backends for document storage and vector retrieval.

Key design decisions

  • HTTP-only: All operations use POST /api/v1/command/{database} with SQL — no Bolt protocol, no neo4j driver, no custom binary protocol
  • LSM_VECTOR index: Uses ArcadeDB's ACID-compliant HNSW implementation with configurable dimension and similarity metric (cosine, euclidean, dot product)
  • ARRAY_OF_FLOATS property type: Required by ArcadeDB's vector index (not generic LIST)
  • Post-filtering for vector search: vectorNeighbors() returns {record, distance} maps — metadata filters are applied as a post-filter step
  • Automatic schema setup: Creates database, vertex type, properties, unique index, and vector index on first use — zero manual setup required

How did you test it?

  • 14 unit tests for filter conversion (no ArcadeDB required)
  • 12 integration tests against a live ArcadeDB Docker container covering: count, write (overwrite/skip/duplicate), delete, filter (equality/comparison/AND), embedding retrieval, and serialization round-trip
  • End-to-end pipeline example with ArcadeDBEmbeddingRetriever demonstrating filtered vector search
  • All 26 tests pass against arcadedata/arcadedb:latest

Checklist

  • I have read the contributors guidelines and the code of conduct
  • I added unit tests and updated the docstrings
  • I've used one of the conventional commit types for my PR title: feat:
  • CI workflow with ArcadeDB Docker service included
  • Labeler configuration updated
  • README inventory table updated

Add ArcadeDBDocumentStore and ArcadeDBEmbeddingRetriever for Haystack 2.x.

ArcadeDB is an open-source multi-model database that combines document
storage, HNSW vector search (LSM_VECTOR), and SQL metadata filtering
in a single engine. This integration connects via the HTTP/JSON API
using only the requests library — no special drivers needed.

Components:
- ArcadeDBDocumentStore: full DocumentStore protocol (count, filter,
  write, delete) with automatic schema/index initialization
- ArcadeDBEmbeddingRetriever: pipeline component for vector similarity
  retrieval with FilterPolicy support
- Filter conversion: Haystack filter dicts → ArcadeDB SQL WHERE clauses
- Document converters: Haystack Document ↔ ArcadeDB record mapping

Includes CI workflow with ArcadeDB Docker service, unit tests for
filter conversion, and integration tests for all DocumentStore operations.
@lvca lvca requested a review from a team as a code owner February 28, 2026 23:10
@lvca lvca requested review from julian-risch and removed request for a team February 28, 2026 23:10
@CLAassistant
Copy link

CLAassistant commented Feb 28, 2026

CLA assistant check
All committers have signed the CLA.

@github-actions github-actions bot added topic:CI type:documentation Improvements or additions to documentation labels Feb 28, 2026
lvca added 3 commits February 28, 2026 18:33
- Remove unused noqa: B008 directives (B008 already in ignore list)
- Use HTTPStatus.BAD_REQUEST instead of magic value 400 (PLR2004)
- Add S608 to ruff ignore (SQL string construction is intentional for
  ArcadeDB HTTP/JSON API with proper value escaping)
- Set requests>=2.28.0 minimum to ensure Python 3.13 compatibility
  (older versions use removed cgi module)
Copy link
Member

@julian-risch julian-risch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for opening this pull request @lvca ! Looks quite good to me already but there are also a couple of things we need to change before we can merge it. I'll make most of the changes myself. What you could please do is open a PR in https://github.com/deepset-ai/haystack-integrations with a page to highlight this new integration using this structure. Feel free to reuse what you had in mind for the README.md there.

@julian-risch
Copy link
Member

I ran the example code successfully on my local machine and applied suggestions from my code review. Main task remaining is to use mixin DocumentStore tests based on https://github.com/deepset-ai/haystack/blob/main/haystack/testing/document_store.py#L252 I'll do that next.

@julian-risch julian-risch self-assigned this Mar 2, 2026
@julian-risch
Copy link
Member

I updated the tests to use DocumentStore mixin tests. Both unit and integration tests passed locally.

@julian-risch julian-risch self-requested a review March 2, 2026 13:09
@julian-risch julian-risch merged commit d65b2fd into deepset-ai:main Mar 2, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

integration:arcadedb topic:CI type:documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants