VectorDB Modular Design by CShorb · Pull Request #316 · mlcommons/storage

CShorb · 2026-04-06T19:33:46Z

VectorDB Modular Design

Adds a modular, backend-agnostic vector database benchmarking framework that measures load throughput, search QPS, recall@K, and latency percentiles (P50/P90/P99) across pluggable database backends.

Architecture

The framework introduces an abstract VectorDBBackend interface with a self-describing descriptor system and auto-discovery registry, enabling new backends to be added by simply dropping a sub-package into the backends/ directory with no other code changes required.

Included Backends

Three backend implementations are included out of the box:

Milvus (HNSW, DiskANN) — leverages native segment compaction
Elasticsearch (HNSW) — leverages bulk API
pgvector/PostgreSQL (HNSW, IVFFlat) — leverages GUC parameter tuning

Benchmark Pipeline

Uses a three-way producer-consumer architecture:

A background thread generates L2-normalized vectors onto a bounded queue
The main thread ingests batches into the database
A parallel executor incrementally builds a brute-force ground-truth table in streaming fashion with capped memory usage

Configuration & CLI

YAML-driven configuration with layered precedence: CLI > environment > YAML > defaults
--what-if dry-run mode
--plan execution planning
Interactive collection administration tool for inspecting and managing collections across backends

Included Configs

Four ready-to-use benchmark configs (1M vectors, 1536 dimensions) are provided along with a .env.example template for connection credentials.

github-actions · 2026-04-06T19:34:01Z

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

idevasena · 2026-04-08T14:58:51Z

Can we add either requirements.txt or dependency update in storage/pyproject.toml for the new optional dependencies (elasticsearch, psycopg2-binary, pgvector, python-dotenv)? The existing pyproject.toml for now only lists pymilvus. This can be added as optional deps.

russfellows · 2026-04-08T15:13:42Z

Devesena, I believe the best way to handle this is via optional requirements in pyproject.toml. I can help on this if desired. Regards, —Russ

…

On Apr 8, 2026, at 8:59 AM, Devasena I ***@***.***> wrote: @idevasena commented on this pull request. On vdb_benchmark/README.md <#316 (comment)>: Can we add either requirements.txt or dependency update in storage/pyproject.toml for the new optional dependencies (elasticsearch, psycopg2-binary, pgvector, python-dotenv)? The existing pyproject.toml for now only lists pymilvus. This can be added as optional deps. — Reply to this email directly, view it on GitHub <#316 (review)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AF64UJ4BY6MFOD3GUPNFOST4UZSNHAVCNFSM6AAAAACXOQKWD6VHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHM2DANZWGA3TKMZZGU>. You are receiving this because your review was requested.

idevasena · 2026-05-18T23:08:07Z

@CShorb Finally reviewed your PR completely.

Thank you for the modular VDB benchmark work. I like the backend abstraction, descriptor/registry model, config/env precedence, and the producer/consumer load + ground-truth pipeline.

I think we need changes as below before merge. I updated the respective files as well, which I will push as a separate commit to this PR next:

The new runner is not wired into the current mlpstorage vectordb path. mlpstorage_py/benchmarks/vectordbbench.py still dispatches to the existing load-vdb, vdbbench, and enhanced-bench scripts. There are 2 options: We can either integrate
python -m vdbbench.benchmark into mlpstorage or document it clearly as a standalone preview.
Need to add optional dependency extras and validation for the new backends: elasticsearch, psycopg2-binary, pgvector, and optional python-dotenv. Backend-specific missing dependency errors should provide install hints. Adding it to pyproject.toml.
The pgvector backend appears to call psycopg2.extensions.quote_ident(name) without the required connection/cursor scope. Need to fix identifier quoting, preferably with psycopg2.sql.Identifier.
Need to add CI/smoke coverage: fake-backend unit test, Milvus 10k HNSW smoke, pgvector 5k smoke, Elasticsearch 5k smoke, and regression tests proving the existing ./mlpstorage vectordb commands still work.
Need to update vdb_benchmark/README.md, not only the nested benchmark README.
We will document the modular runner, backend dependencies, env vars, --what-if/--plan, artifact reuse, and explicitly state that multi-node / multi-client VDB execution is not implemented yet. We will update multi-node execution in a separate PR.

CShorb and others added 2 commits April 6, 2026 15:20

Modular VectorDB Benchmark

10a8bb1

Update README.md

69164d5

CShorb requested a review from a team April 6, 2026 19:33

idevasena reviewed Apr 8, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VectorDB Modular Design#316

VectorDB Modular Design#316
CShorb wants to merge 2 commits into
mlcommons:mainfrom
CShorb:main

CShorb commented Apr 6, 2026 •

edited by idevasena

Loading

Uh oh!

github-actions Bot commented Apr 6, 2026

Uh oh!

idevasena Apr 8, 2026

Uh oh!

russfellows commented Apr 8, 2026 via email

Uh oh!

idevasena commented May 18, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

CShorb commented Apr 6, 2026 • edited by idevasena Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!