mlcommons · wvaske · Nov 21, 2025 · Dec 9, 2025 · Dec 19, 2025 · Dec 19, 2025
@@ -22,7 +22,7 @@ jobs:
           path-to-signatures: 'cla-bot/v1/cla.json'
           # branch should not be protected
           branch: 'main'
-          allowlist: user1,bot*
+          allowlist: user1,claude[bot],claude,bot*
           remote-organization-name: mlcommons
           remote-repository-name: systems
 

@@ -0,0 +1,39 @@
+# MLPerf Storage KV Cache Benchmark
+
+This directory contains the initial implementation of the KV Cache benchmark for MLPerf Storage v3.
+
+## Overview
+
+The KV Cache benchmark simulates the storage access patterns of Large Language Model (LLM) inference systems, specifically focusing on key-value cache operations that are critical for multi-turn conversations and long-context processing.
+
+## Components
+
+### Core Scripts
+
+- **kv-cache.py**: Main benchmark implementation for KV cache storage performance testing
+- **kv-cache_sharegpt_replay.py**: ShareGPT conversation replay-based benchmark for realistic workload simulation
+- **kv-cache-wrapper.sh**: Wrapper script for running benchmark configurations
+- **validate.sh**: Validation script for benchmark results
+
+### Documentation
+
+- **MLperf v3 KV cache proposal.md**: Detailed proposal for KV cache benchmark integration into MLPerf Storage
+- **MLperf v3 KV cache proposal.pdf**: PDF version of the proposal
+- **sources.md**: References and source documentation
+
+## Purpose
+
+This benchmark addresses the growing need to measure storage system performance under AI/ML inference workloads, particularly:
+
+- Key-value cache read/write patterns
+- Mixed sequential and random access patterns
+- Multi-threaded concurrent access
+- Realistic conversation-based workload replay
+
+## Getting Started
+
+See the proposal documents for detailed information about the benchmark design, metrics, and validation criteria.
+
+## Status
+
+Initial implementation - work in progress for MLPerf Storage v3.0
@@ -0,0 +1,91 @@
+## Recommended Invocations by Model
+
+### Why Two Invocations (cpu_mem=0 vs cpu_mem=4)?
+
+| cpu_mem  | Purpose                          | Primary Metric                           | Why                                                                                                                                     |
+| -------- | -------------------------------- | ---------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------- |
+| **0 GB** | **Maximum Storage Stress**       | Decode Bytes Read, Wall-Clock Throughput | All I/O goes through NVMe. 4x more read traffic. True test of storage bandwidth.                                                        |
+| **4 GB** | **Storage Throughput Benchmark** | Storage Throughput (tok/s)               | Some data cached in RAM. Storage Throughput metric works correctly (2.2x ratio). More representative of production inference workloads. |
+
+---
+
+### llama2-7b
+
+| Parameter                 | cpu_mem=0 (Storage Stress) | cpu_mem=4 (Throughput) |
+| ------------------------- | -------------------------- | ---------------------- |
+| `--cpu-memory-gb`         | **0**                      | **4**                  |
+| `--max-concurrent-allocs` | **0**                      | **4**                  |
+| `--users`                 | **150**                    | **200**                |
+| `--duration`              | **300**                    | **300**                |
+| `--generation-mode`       | **none**                   | **none**               |
+| **Expected Ratio**        | WC Tput: **4.64x**         | Stor Tput: **2.34x**   |
+
+```bash
+# llama2-7b: Storage Stress (cpu_mem=0)
+python kv-cache.py --model llama2-7b --cpu-memory-gb 0 --max-concurrent-allocs 0 --users 150 --duration 300 --generation-mode none --output results/llama2-7b_stress_trial${N}.json
+
+# llama2-7b: Throughput Benchmark (cpu_mem=4)
+python kv-cache.py --model llama2-7b --cpu-memory-gb 4 --max-concurrent-allocs 4 --users 200 --duration 300 --generation-mode none --output results/llama2-7b_tput_trial${N}.json
+```
+
+---
+
+### llama3.1-8b
+
+| Parameter | cpu_mem=0 (Storage Stress) | cpu_mem=4 (Throughput) |
+|-----------|---------------------------|------------------------|
+| `--cpu-memory-gb` | **0** | **4** |
+| `--max-concurrent-allocs` | **0** | **0** |
+| `--users` | **200** | **150** |
+| `--duration` | **300** | **300** |
+| `--generation-mode` | **none** | **none** |
+| **Expected Ratio** | WC Tput: **2.70x** | Stor Tput: **2.87x** |
+
+```bash
+# llama3.1-8b: Storage Stress (cpu_mem=0)
+python kv-cache.py --model llama3.1-8b --cpu-memory-gb 0 --max-concurrent-allocs 0 --users 200 --duration 300 --generation-mode none --output results/llama3.1-8b_stress_trial${N}.json
+
+# llama3.1-8b: Throughput Benchmark (cpu_mem=4)
+python kv-cache.py --model llama3.1-8b --cpu-memory-gb 4 --max-concurrent-allocs 0 --users 150 --duration 300 --generation-mode none --output results/llama3.1-8b_tput_trial${N}.json
+```
+
+---
+
+### llama3.1-70b-instruct
+
+| Parameter | cpu_mem=0 (Storage Stress) | cpu_mem=4 (Throughput) |
+|-----------|---------------------------|------------------------|
+| `--cpu-memory-gb` | **0** | **4** |
+| `--max-concurrent-allocs` | **0** | **4** |
+| `--users` | **70** | **20** |
+| `--duration` | **300** | **300** |
+| `--generation-mode` | **none** | **none** |
+| **Expected Ratio** | WC Tput: **2.44x** | Stor Tput: **3.25x** |
+
+```bash
+# llama3.1-70b: Storage Stress (cpu_mem=0)
+python kv-cache.py --model llama3.1-70b-instruct --cpu-memory-gb 0 --max-concurrent-allocs 0 --users 70 --duration 300 --generation-mode none --output results/llama3.1-70b_stress_trial${N}.json
+
+# llama3.1-70b: Throughput Benchmark (cpu_mem=4)
+python kv-cache.py --model llama3.1-70b-instruct --cpu-memory-gb 4 --max-concurrent-allocs 4 --users 20 --duration 300 --generation-mode none --output results/llama3.1-70b_tput_trial${N}.json
+```
+
+---
+
+## Summary Table
+
+| Model | Invocation | cpu_mem | mca | users | Primary Metric | Expected Ratio |
+|-------|------------|---------|-----|-------|----------------|----------------|
+| **llama2-7b** | Stress | 0 | 0 | 150 | WC Throughput | 4.64x |
+| **llama2-7b** | Tput | 4 | 4 | 200 | Stor Throughput | 2.34x |
+| **llama3.1-8b** | Stress | 0 | 0 | 200 | WC Throughput | 2.70x |
+| **llama3.1-8b** | Tput | 4 | 0 | 150 | Stor Throughput | 2.87x |
+| **llama3.1-70b** | Stress | 0 | 0 | 70 | WC Throughput | 2.44x |
+| **llama3.1-70b** | Tput | 4 | 4 | 20 | Stor Throughput | 3.25x |
+
+**Notes:**
+- **70b model uses fewer users** because larger KV cache = more memory per request
+- **mca=0 often best at cpu_mem=0** (no allocation throttling when fully I/O-bound)
+- **mca=4 often best at cpu_mem=4** (moderate throttling helps throughput)
+- **gen_mode=none** for pure storage benchmark (no simulated token delays)
+- **Run 3-5 trials** and report median