Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
22 commits
Select commit Hold shift + click to select a range
0c2561d
Add initial KV Cache benchmark implementation for MLPerf Storage v3
hazemawadalla Nov 21, 2025
073fe61
feat: Replace legacy spillover logic with Waterfall LRU architecture
hazemawadalla Dec 9, 2025
2eb39cf
Fix two runtime errors in RAG-enabled benchmark mode
hazemawadalla Dec 19, 2025
f78bf60
Add detailed README.md for running the different invocations of kv-ca…
hazemawadalla Dec 19, 2025
2464edf
fix: line endings from dos2unix; increase cpu memory to 4GB for mlper…
hazemawadalla Dec 19, 2025
70b8f69
Update MLperf v3 KV cache proposal.md to recommend using a minimum of…
hazemawadalla Dec 19, 2025
9e60b98
Add storage throughput metric, ShareGPT integration, LMCache validati…
hazemawadalla Jan 10, 2026
db82626
Update MLPerf v3 submission guidelines with discovery test validation
hazemawadalla Jan 13, 2026
f1ff963
Improve test suite with HTML reporting and flexible tier assertions
hazemawadalla Jan 13, 2026
e016954
Add pytest-html dependency for HTML test reports
hazemawadalla Jan 13, 2026
c1e5ff7
Add unit test HTML report showing all 112 tests passing
hazemawadalla Jan 13, 2026
e995340
Update NVMe Bandwidth specification to 14,000 MB/s
hazemawadalla Jan 13, 2026
bad674c
Fix KV cache size per token values in discovery doc
hazemawadalla Jan 13, 2026
2159bef
Merge pull request #224 from hazemawadalla/TF_KVCache
FileSystemGuy Jan 13, 2026
fafd1c6
allow claude to bypass cla check (#234)
BarnacleBob Feb 9, 2026
549c6a8
Remove unused imports and ShareGPT dataset loader
FileSystemGuy Feb 13, 2026
fdd95fd
Revise KV Cache Benchmark script for MLPerf updates
FileSystemGuy Feb 13, 2026
01ca824
Revise README for KV Cache benchmark implementation
FileSystemGuy Feb 13, 2026
71a79cb
Enhance KV cache benchmark with ShareGPT integration
FileSystemGuy Feb 13, 2026
9d3e0bd
Update allowlist format in CLA workflow
FileSystemGuy Feb 13, 2026
56b969b
Merge pull request #238 from mlcommons/FileSystemGuy-KVCache-revert
FileSystemGuy Feb 15, 2026
5b991b4
Merge pull request #239 from mlcommons/FileSystemGuy-claudebot
FileSystemGuy Feb 17, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/cla.yml
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ jobs:
path-to-signatures: 'cla-bot/v1/cla.json'
# branch should not be protected
branch: 'main'
allowlist: user1,bot*
allowlist: user1,claude[bot],claude,bot*
remote-organization-name: mlcommons
remote-repository-name: systems

Expand Down
1,204 changes: 1,204 additions & 0 deletions kv_cache_benchmark/MLperf v3 KV cache proposal.md

Large diffs are not rendered by default.

Binary file not shown.
39 changes: 39 additions & 0 deletions kv_cache_benchmark/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
# MLPerf Storage KV Cache Benchmark

This directory contains the initial implementation of the KV Cache benchmark for MLPerf Storage v3.

## Overview

The KV Cache benchmark simulates the storage access patterns of Large Language Model (LLM) inference systems, specifically focusing on key-value cache operations that are critical for multi-turn conversations and long-context processing.

## Components

### Core Scripts

- **kv-cache.py**: Main benchmark implementation for KV cache storage performance testing
- **kv-cache_sharegpt_replay.py**: ShareGPT conversation replay-based benchmark for realistic workload simulation
- **kv-cache-wrapper.sh**: Wrapper script for running benchmark configurations
- **validate.sh**: Validation script for benchmark results

### Documentation

- **MLperf v3 KV cache proposal.md**: Detailed proposal for KV cache benchmark integration into MLPerf Storage
- **MLperf v3 KV cache proposal.pdf**: PDF version of the proposal
- **sources.md**: References and source documentation

## Purpose

This benchmark addresses the growing need to measure storage system performance under AI/ML inference workloads, particularly:

- Key-value cache read/write patterns
- Mixed sequential and random access patterns
- Multi-threaded concurrent access
- Realistic conversation-based workload replay

## Getting Started

See the proposal documents for detailed information about the benchmark design, metrics, and validation criteria.

## Status

Initial implementation - work in progress for MLPerf Storage v3.0
Original file line number Diff line number Diff line change
@@ -0,0 +1,91 @@
## Recommended Invocations by Model

### Why Two Invocations (cpu_mem=0 vs cpu_mem=4)?

| cpu_mem | Purpose | Primary Metric | Why |
| -------- | -------------------------------- | ---------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------- |
| **0 GB** | **Maximum Storage Stress** | Decode Bytes Read, Wall-Clock Throughput | All I/O goes through NVMe. 4x more read traffic. True test of storage bandwidth. |
| **4 GB** | **Storage Throughput Benchmark** | Storage Throughput (tok/s) | Some data cached in RAM. Storage Throughput metric works correctly (2.2x ratio). More representative of production inference workloads. |

---

### llama2-7b

| Parameter | cpu_mem=0 (Storage Stress) | cpu_mem=4 (Throughput) |
| ------------------------- | -------------------------- | ---------------------- |
| `--cpu-memory-gb` | **0** | **4** |
| `--max-concurrent-allocs` | **0** | **4** |
| `--users` | **150** | **200** |
| `--duration` | **300** | **300** |
| `--generation-mode` | **none** | **none** |
| **Expected Ratio** | WC Tput: **4.64x** | Stor Tput: **2.34x** |

```bash
# llama2-7b: Storage Stress (cpu_mem=0)
python kv-cache.py --model llama2-7b --cpu-memory-gb 0 --max-concurrent-allocs 0 --users 150 --duration 300 --generation-mode none --output results/llama2-7b_stress_trial${N}.json

# llama2-7b: Throughput Benchmark (cpu_mem=4)
python kv-cache.py --model llama2-7b --cpu-memory-gb 4 --max-concurrent-allocs 4 --users 200 --duration 300 --generation-mode none --output results/llama2-7b_tput_trial${N}.json
```

---

### llama3.1-8b

| Parameter | cpu_mem=0 (Storage Stress) | cpu_mem=4 (Throughput) |
|-----------|---------------------------|------------------------|
| `--cpu-memory-gb` | **0** | **4** |
| `--max-concurrent-allocs` | **0** | **0** |
| `--users` | **200** | **150** |
| `--duration` | **300** | **300** |
| `--generation-mode` | **none** | **none** |
| **Expected Ratio** | WC Tput: **2.70x** | Stor Tput: **2.87x** |

```bash
# llama3.1-8b: Storage Stress (cpu_mem=0)
python kv-cache.py --model llama3.1-8b --cpu-memory-gb 0 --max-concurrent-allocs 0 --users 200 --duration 300 --generation-mode none --output results/llama3.1-8b_stress_trial${N}.json

# llama3.1-8b: Throughput Benchmark (cpu_mem=4)
python kv-cache.py --model llama3.1-8b --cpu-memory-gb 4 --max-concurrent-allocs 0 --users 150 --duration 300 --generation-mode none --output results/llama3.1-8b_tput_trial${N}.json
```

---

### llama3.1-70b-instruct

| Parameter | cpu_mem=0 (Storage Stress) | cpu_mem=4 (Throughput) |
|-----------|---------------------------|------------------------|
| `--cpu-memory-gb` | **0** | **4** |
| `--max-concurrent-allocs` | **0** | **4** |
| `--users` | **70** | **20** |
| `--duration` | **300** | **300** |
| `--generation-mode` | **none** | **none** |
| **Expected Ratio** | WC Tput: **2.44x** | Stor Tput: **3.25x** |

```bash
# llama3.1-70b: Storage Stress (cpu_mem=0)
python kv-cache.py --model llama3.1-70b-instruct --cpu-memory-gb 0 --max-concurrent-allocs 0 --users 70 --duration 300 --generation-mode none --output results/llama3.1-70b_stress_trial${N}.json

# llama3.1-70b: Throughput Benchmark (cpu_mem=4)
python kv-cache.py --model llama3.1-70b-instruct --cpu-memory-gb 4 --max-concurrent-allocs 4 --users 20 --duration 300 --generation-mode none --output results/llama3.1-70b_tput_trial${N}.json
```

---

## Summary Table

| Model | Invocation | cpu_mem | mca | users | Primary Metric | Expected Ratio |
|-------|------------|---------|-----|-------|----------------|----------------|
| **llama2-7b** | Stress | 0 | 0 | 150 | WC Throughput | 4.64x |
| **llama2-7b** | Tput | 4 | 4 | 200 | Stor Throughput | 2.34x |
| **llama3.1-8b** | Stress | 0 | 0 | 200 | WC Throughput | 2.70x |
| **llama3.1-8b** | Tput | 4 | 0 | 150 | Stor Throughput | 2.87x |
| **llama3.1-70b** | Stress | 0 | 0 | 70 | WC Throughput | 2.44x |
| **llama3.1-70b** | Tput | 4 | 4 | 20 | Stor Throughput | 3.25x |

**Notes:**
- **70b model uses fewer users** because larger KV cache = more memory per request
- **mca=0 often best at cpu_mem=0** (no allocation throttling when fully I/O-bound)
- **mca=4 often best at cpu_mem=4** (moderate throttling helps throughput)
- **gen_mode=none** for pure storage benchmark (no simulated token delays)
- **Run 3-5 trials** and report median
Loading
Loading