Skip to content

vdb benchmark shows a very low recall@10 because the flat_gt collection size is too small #375

@Raysmond

Description

@Raysmond

We had a few dryrun tests for vdb benchmark on a single-host system with a single Gen5 NVMe (Solidigm D7-PS1010).
The result shows a very low number only about 0.0090 for mean recall@10. We had some study for the issue and found that the collectionmlps_1m_1shards_1536dim_uniform_flat_gt size is 10000, which is only 1% of the total 1M vector size. Since the flat collection size is too small, how can we get a higher recall@10 result ? I think the flat collection size should be equal to the original vector size. That's more reasonable ? So, I don't know if this is by design or it's a bug ?

(.venv) root@cnit-zz-01:~/raysmond/workspace/mlperf_v3/storage2/vdb_benchmark# python vdbbench/list_collections.py
2026-05-14 00:58:06,702 - INFO - Connected to Milvus server at 127.0.0.1:19530
2026-05-14 00:58:06,704 - INFO - Found 2 collections
2026-05-14 00:58:06,704 - INFO - Getting information for collection: mlps_1m_1shards_1536dim_uniform
2026-05-14 00:58:06,714 - INFO - Getting information for collection: mlps_1m_1shards_1536dim_uniform_flat_gt
+-----------------------------------------+----------------+-------------+---------------+----------------+--------------+
| Collection Name                         |   Vector Count |   Dimension | Index Types   | Metric Types   |   Partitions |
+=========================================+================+=============+===============+================+==============+
| mlps_1m_1shards_1536dim_uniform         |        1000000 |        1536 | DISKANN       | COSINE         |            1 |
+-----------------------------------------+----------------+-------------+---------------+----------------+--------------+
| mlps_1m_1shards_1536dim_uniform_flat_gt |          10000 |        1536 | FLAT          | COSINE         |            1 |
+-----------------------------------------+----------------+-------------+---------------+----------------+--------------+
2026-05-14 00:58:06,900 - INFO - Disconnected from Milvus server

Here are the commands we executed and the dryrun logs:

# load the vector database
(.venv) root@cnit-zz-01:~/raysmond/workspace/mlperf_v3/storage2# ./mlpstorage vectordb datagen     --host 127.0.0.1 --port 19530 --config default     --force --results-dir ./vdb_results --file

# run the query cmd (PATH A)
(.venv) root@cnit-zz-01:~/raysmond/workspace/mlperf_v3/storage2/vdb_benchmark# python vdbbench/enhanced_bench.py \
  --host 127.0.0.1 \
  --collection mlps_1m_1shards_1536dim_uniform \
  --auto-create-flat \
  --runtime 120 \
  --batch-size 10 \
  --processes 8 \
  --search-limit 10 \
  --search-ef 200 \
  --queries 100000 \
  --recall-k 10 \
  --cache-state cold \
  --drop-caches-cmd "sh -c 'echo 3 > /proc/sys/vm/drop_caches'"

Logs:

============================================================
ENHANCED VDB BENCH — runtime/query-count mode
============================================================
Results will be saved to: vdbbench_results/20260514_005609

============================================================
Database Verification and Collection Loading
============================================================
Connecting to Milvus server at 127.0.0.1:19530...
Collection mlps_1m_1shards_1536dim_uniform already loaded.

+---------------------------------+----------------+-------------+---------------+----------------+--------------+
| Collection Name                 |   Vector Count |   Dimension | Index Types   | Metric Types   |   Partitions |
+=================================+================+=============+===============+================+==============+
| mlps_1m_1shards_1536dim_uniform |        1000000 |        1536 | DISKANN       | COSINE         |            1 |
+---------------------------------+----------------+-------------+---------------+----------------+--------------+
Detected source vector field: 'vector'

============================================================
RECALL SETUP (outside benchmark timing)
============================================================
Ground truth is pre-computed using a FLAT (brute-force) index.
Using metric type: COSINE

Generating 1000 query vectors (dim=1536, seed=42)...
Generated 1000 query vectors.

Setting up FLAT collection: mlps_1m_1shards_1536dim_uniform_flat_gt
FLAT collection exists but has 10000 vs 1000000 vectors. Dropping and recreating...
Creating FLAT collection 'mlps_1m_1shards_1536dim_uniform_flat_gt' from source 'mlps_1m_1shards_1536dim_uniform'...
Source schema: pk_field='id' (INT64), vec_field='vector', vectors=1000000
Copying 1000000 vectors to FLAT collection (batch_size=5000)...
  Copied 152/1000000 vectors (0.0%)
  Copied 304/1000000 vectors (0.0%)
  Copied 456/1000000 vectors (0.0%)
  Copied 608/1000000 vectors (0.1%)
  Copied 760/1000000 vectors (0.1%)
  Copied 912/1000000 vectors (0.1%)
  Copied 1064/1000000 vectors (0.1%)
  Copied 1216/1000000 vectors (0.1%)
  Copied 1368/1000000 vectors (0.1%)
  Copied 1520/1000000 vectors (0.2%)
  Copied 1672/1000000 vectors (0.2%)
  Copied 1824/1000000 vectors (0.2%)
  Copied 1976/1000000 vectors (0.2%)
  Copied 2132/1000000 vectors (0.2%)
  Copied 2289/1000000 vectors (0.2%)
  Copied 2446/1000000 vectors (0.2%)
  Copied 2603/1000000 vectors (0.3%)
  Copied 2757/1000000 vectors (0.3%)
  Copied 2909/1000000 vectors (0.3%)
  Copied 3063/1000000 vectors (0.3%)
  Copied 3220/1000000 vectors (0.3%)
  Copied 3376/1000000 vectors (0.3%)
  Copied 3528/1000000 vectors (0.4%)
  Copied 3680/1000000 vectors (0.4%)
  Copied 3837/1000000 vectors (0.4%)
  Copied 3994/1000000 vectors (0.4%)
  Copied 4146/1000000 vectors (0.4%)
  Copied 4298/1000000 vectors (0.4%)
  Copied 4450/1000000 vectors (0.4%)
  Copied 4602/1000000 vectors (0.5%)
  Copied 4754/1000000 vectors (0.5%)
  Copied 4906/1000000 vectors (0.5%)
  Copied 10000/1000000 vectors (100.0%)
Building FLAT index...
FLAT collection 'mlps_1m_1shards_1536dim_uniform_flat_gt' ready with 10000 vectors.
Pre-computing ground truth for 1000 queries using FLAT index (top_k=10)...
Ground truth pre-computation complete: 1000 queries in 0.61s
Ground truth ready: 1000 queries pre-computed.

Collecting initial disk statistics...

============================================================
Benchmark Execution
============================================================
Starting benchmark: 8 processes × 12500 queries/process
Recall: 1000 pre-generated queries, recall@10
NOTE: batch_end timing is placed BEFORE recall capture — performance unaffected.
NOTE: recall hits written to per-worker recall_hits_p<N>.jsonl files.
Staggering process startup by 0.125s
Starting process 0...
Process 0 initialized
Process 0 - Loading collection
Process 0: Writing results to vdbbench_results/20260514_005609/milvus_benchmark_p0.csv
Process 0: Starting benchmark ...
Starting process 1...
Process 1 initialized



Calculating recall from per-worker JSONL files...
  Loaded ANN hits for 1000 unique query indices from 8 worker(s).
Calculating benchmark statistics...

============================================================
BENCHMARK SUMMARY
============================================================
Total Queries: 100000
Total Batches: 10000
Total Runtime: 46.55s

QUERY STATISTICS
------------------------------------------------------------
Mean Latency:      3.64 ms
Median Latency:    3.67 ms
P95 Latency:       3.98 ms
P99 Latency:       4.16 ms
P99.9 Latency:     4.52 ms
P99.99 Latency:    5.72 ms
Throughput:        2148.38 queries/second

BATCH STATISTICS
------------------------------------------------------------
Mean Batch Time:   36.40 ms
Median Batch Time: 36.74 ms
P95 Batch Time:    39.84 ms
P99 Batch Time:    41.64 ms
P99.9 Batch Time:  45.19 ms
P99.99 Batch Time: 57.24 ms
Max Batch Time:    97.02 ms
Batch Throughput:  27.47 batches/second

RECALL STATISTICS (recall@10)
------------------------------------------------------------
Mean Recall:       0.0090
Median Recall:     0.0000
Min Recall:        0.0000
Max Recall:        0.3000
P95 Recall:        0.1000
P99 Recall:        0.1000
Queries Evaluated: 1000

DISK I/O DURING BENCHMARK
------------------------------------------------------------
Total Read:        295.53 GB  (6501.54 MB/s,  832193 IOPS)
Total Write:       223.03 MB  (4.79 MB/s,  82 IOPS)

Per-Device Breakdown:
  nvme11n1:
    Read:  600.00 KB  (0.01 MB/s, 0 IOPS)
    Write: 3.82 MB  (0.08 MB/s, 8 IOPS)
  nvme11n1p3:
    Read:  600.00 KB  (0.01 MB/s, 0 IOPS)
    Write: 3.82 MB  (0.08 MB/s, 8 IOPS)
  nvme14n1:
    Read:  147.77 GB  (3250.75 MB/s, 416096 IOPS)
    Write: 105.79 MB  (2.27 MB/s, 26 IOPS)
  nvme14n1p1:
    Read:  147.77 GB  (3250.75 MB/s, 416096 IOPS)
    Write: 105.79 MB  (2.27 MB/s, 26 IOPS)
  dm-0:
    Read:  600.00 KB  (0.01 MB/s, 0 IOPS)
    Write: 3.82 MB  (0.08 MB/s, 14 IOPS)

Detailed results: vdbbench_results/20260514_005609
Recall details:   vdbbench_results/20260514_005609/recall_stats.json
============================================================

As you can see from the logs, the test suite only copy 1% (10000) vectors to a new FLAT collection. But the progress shows as 100%.

** The rootcause (guess) **
I think there might be an issue with the insert_data(collection, vectors, batch_size=10000) methond in file load_vdb.py.
The caller method always passes the new chunk_vectors to the insert_data instead of passing the whole vectors. So the method insert_data always generates the new batch chunk vectors starting with id 0 to 10000. So the final vector collection has many duplicated ids.

For that reason, when we try to run the enhanced_bench.py with --auto-create-flat. It will count and copy only 10000 vectors to the FLAT collection.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions