Skip to content

VectorDB Modular Design#316

Open
CShorb wants to merge 2 commits into
mlcommons:mainfrom
CShorb:main
Open

VectorDB Modular Design#316
CShorb wants to merge 2 commits into
mlcommons:mainfrom
CShorb:main

Conversation

@CShorb
Copy link
Copy Markdown
Contributor

@CShorb CShorb commented Apr 6, 2026

VectorDB Modular Design

Adds a modular, backend-agnostic vector database benchmarking framework that measures load throughput, search QPS, recall@K, and latency percentiles (P50/P90/P99) across pluggable database backends.

Architecture

The framework introduces an abstract VectorDBBackend interface with a self-describing descriptor system and auto-discovery registry, enabling new backends to be added by simply dropping a sub-package into the backends/ directory with no other code changes required.

Included Backends

Three backend implementations are included out of the box:

  • Milvus (HNSW, DiskANN) — leverages native segment compaction
  • Elasticsearch (HNSW) — leverages bulk API
  • pgvector/PostgreSQL (HNSW, IVFFlat) — leverages GUC parameter tuning

Benchmark Pipeline

Uses a three-way producer-consumer architecture:

  1. A background thread generates L2-normalized vectors onto a bounded queue
  2. The main thread ingests batches into the database
  3. A parallel executor incrementally builds a brute-force ground-truth table in streaming fashion with capped memory usage

Configuration & CLI

  • YAML-driven configuration with layered precedence: CLI > environment > YAML > defaults
  • --what-if dry-run mode
  • --plan execution planning
  • Interactive collection administration tool for inspecting and managing collections across backends

Included Configs

Four ready-to-use benchmark configs (1M vectors, 1536 dimensions) are provided along with a .env.example template for connection credentials.

@CShorb CShorb requested a review from a team April 6, 2026 19:33
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 6, 2026

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

Comment thread vdb_benchmark/README.md
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add either requirements.txt or dependency update in storage/pyproject.toml for the new optional dependencies (elasticsearch, psycopg2-binary, pgvector, python-dotenv)? The existing pyproject.toml for now only lists pymilvus. This can be added as optional deps.

@russfellows
Copy link
Copy Markdown
Contributor

russfellows commented Apr 8, 2026 via email

@idevasena
Copy link
Copy Markdown
Contributor

idevasena commented May 18, 2026

@CShorb Finally reviewed your PR completely.

Thank you for the modular VDB benchmark work. I like the backend abstraction, descriptor/registry model, config/env precedence, and the producer/consumer load + ground-truth pipeline.

I think we need changes as below before merge. I updated the respective files as well, which I will push as a separate commit to this PR next:

  1. The new runner is not wired into the current mlpstorage vectordb path. mlpstorage_py/benchmarks/vectordbbench.py still dispatches to the existing load-vdb, vdbbench, and enhanced-bench scripts. There are 2 options: We can either integrate
    python -m vdbbench.benchmark into mlpstorage or document it clearly as a standalone preview.

  2. Need to add optional dependency extras and validation for the new backends: elasticsearch, psycopg2-binary, pgvector, and optional python-dotenv. Backend-specific missing dependency errors should provide install hints. Adding it to pyproject.toml.

  3. The pgvector backend appears to call psycopg2.extensions.quote_ident(name) without the required connection/cursor scope. Need to fix identifier quoting, preferably with psycopg2.sql.Identifier.

  4. Need to add CI/smoke coverage: fake-backend unit test, Milvus 10k HNSW smoke, pgvector 5k smoke, Elasticsearch 5k smoke, and regression tests proving the existing ./mlpstorage vectordb commands still work.

  5. Need to update vdb_benchmark/README.md, not only the nested benchmark README.
    We will document the modular runner, backend dependencies, env vars, --what-if/--plan, artifact reuse, and explicitly state that multi-node / multi-client VDB execution is not implemented yet. We will update multi-node execution in a separate PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants