This guide covers running benchmarks, interpreting results, and optimizing performance.
- Understanding benchmark_app
- Running Benchmarks
- Benchmark Parameters
- Performance Metrics
- Optimization Strategies
- Result Analysis
- Best Practices
benchmark_app is OpenVINO's official tool for measuring inference performance. It:
- Loads a model and runs inference
- Measures throughput and latency
- Supports various execution modes
- Provides detailed performance metrics
- Throughput: Inferences per second (FPS)
- Latency: Time for single inference (ms)
- Synchronous API: Blocking, sequential execution
- Asynchronous API: Non-blocking, parallel execution
- Inference Request: Single inference operation
- Stream: Parallel execution pipeline
# Simple benchmark with default settings
ovmobilebench run -c experiments/basic.yamlBasic configuration:
run:
repeats: 3
matrix:
niter: [100]
api: ["sync"]
device: ["CPU"]
threads: [4]run:
repeats: 5
warmup_runs: 2
cooldown_sec: 30
timeout_sec: 600
matrix:
niter: [100, 200, 500]
api: ["sync", "async"]
nireq: [1, 2, 4, 8]
nstreams: ["1", "2", "AUTO"]
device: ["CPU"]
threads: [1, 2, 4, 8]
infer_precision: ["FP32", "FP16", "INT8"]# Run specific configuration
ovmobilebench run -c experiments/config.yaml \
--filter "threads=4,nstreams=2"| Parameter | Description | Values | Impact |
|---|---|---|---|
niter |
Number of iterations | 50-10000 | Higher = more accurate |
api |
Execution API | sync, async | Async = better throughput |
nireq |
Inference requests | 1-16 | More = parallel execution |
nstreams |
Execution streams | 1, 2, AUTO | AUTO = OpenVINO optimizes |
device |
Target device | CPU, GPU, NPU | Hardware selection |
threads |
CPU threads | 1-N cores | Parallelism level |
run:
matrix:
hint: ["LATENCY", "THROUGHPUT", "CUMULATIVE_THROUGHPUT"]- LATENCY: Optimize for minimal latency
- THROUGHPUT: Optimize for maximum FPS
- CUMULATIVE_THROUGHPUT: Multiple models simultaneously
run:
matrix:
infer_precision: ["FP32", "FP16", "INT8"]- FP32: Full precision (baseline)
- FP16: Half precision (2x faster, minimal accuracy loss)
- INT8: 8-bit integers (4x faster, requires calibration)
Throughput: 156.23 FPS
- Total inferences per second
- Higher is better
- Best for batch processing
Latency:
Median: 6.4 ms
Average: 6.5 ms
Min: 5.8 ms
Max: 8.2 ms
- Time per inference
- Lower is better
- Critical for real-time applications
# Monitor during benchmark
adb shell top -d 1 | grep benchmark_app# Check memory consumption
adb shell dumpsys meminfo | grep benchmark_app# Battery stats (Android)
adb shell dumpsys batterystatsFind optimal thread count:
run:
matrix:
threads: [1, 2, 3, 4, 5, 6, 7, 8]Analysis:
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('results.csv')
thread_perf = df.groupby('threads')['throughput_fps'].mean()
thread_perf.plot(kind='line', marker='o')
plt.xlabel('Threads')
plt.ylabel('Throughput (FPS)')
plt.title('Performance vs Thread Count')run:
matrix:
nstreams: ["1", "2", "4", "AUTO"]
nireq: [1, 2, 4, 8]Best practices:
- nstreams ≤ number of CPU cores
- nireq ≥ nstreams for async mode
- Use AUTO for automatic optimization
models:
- name: "model"
path: "model.xml"
batch_size: [1, 2, 4, 8, 16]
run:
matrix:
batch: [1, 2, 4, 8, 16]Trade-offs:
- Larger batch = better throughput
- Larger batch = higher latency
- Memory constraints limit max batch
run:
advanced:
cache_dir: "/data/local/tmp/cache"
enable_mmap: true
memory_reuse: true# Pin to big cores (Android)
adb shell "taskset 0xF0 benchmark_app ..."
# Pin to specific cores (Linux)
taskset -c 4-7 benchmark_app ...import pandas as pd
import numpy as np
# Load results
df = pd.read_csv('results.csv')
# Group by configuration
grouped = df.groupby(['model', 'threads', 'nstreams'])
# Calculate statistics
stats = grouped['throughput_fps'].agg([
'mean',
'median',
'std',
('cv', lambda x: x.std() / x.mean()), # Coefficient of variation
('q25', lambda x: x.quantile(0.25)),
('q75', lambda x: x.quantile(0.75))
])
print(stats)# Compare configurations
baseline = df[df['threads'] == 1]['throughput_fps'].median()
for threads in [2, 4, 8]:
perf = df[df['threads'] == threads]['throughput_fps'].median()
speedup = perf / baseline
efficiency = speedup / threads
print(f"Threads={threads}: Speedup={speedup:.2f}x, Efficiency={efficiency:.2%}")import matplotlib.pyplot as plt
import seaborn as sns
# Heatmap of performance
pivot = df.pivot_table(
values='throughput_fps',
index='threads',
columns='nstreams',
aggfunc='median'
)
plt.figure(figsize=(10, 6))
sns.heatmap(pivot, annot=True, fmt='.1f', cmap='YlOrRd')
plt.title('Performance Heatmap: Threads vs Streams')
plt.show()def detect_regression(current, baseline, threshold=-0.05):
"""Detect performance regression"""
change = (current - baseline) / baseline
if change < threshold:
return True, change
return False, change
# Example usage
baseline_fps = 100.0
current_fps = 92.0
is_regression, change = detect_regression(current_fps, baseline_fps)
if is_regression:
print(f"REGRESSION: {change:.1%} drop in performance")-
Control Variables
- Keep one parameter variable at a time
- Document all fixed parameters
- Record environmental conditions
-
Statistical Validity
- Run at least 3 repetitions
- Discard warmup runs
- Use median instead of mean
-
Fair Comparison
- Same model, same input
- Same device state
- Same thermal conditions
graph TD
A[Prepare Device] --> B[Warmup Run]
B --> C[Cooldown]
C --> D[Benchmark Run]
D --> E[Collect Metrics]
E --> F{More Runs?}
F -->|Yes| C
F -->|No| G[Analyze Results]
run:
matrix:
api: ["sync"]
nireq: [1]
nstreams: ["1"]
hint: ["LATENCY"]run:
matrix:
api: ["async"]
nireq: [8]
nstreams: ["AUTO"]
hint: ["THROUGHPUT"]run:
matrix:
threads: [2] # Use efficiency cores
nstreams: ["1"]
device: ["CPU"]## Benchmark Report
### Configuration
- Model: ResNet-50
- Device: Snapdragon 888
- Precision: FP16
- Batch Size: 1
### Results
| Threads | Streams | Throughput (FPS) | Latency (ms) |
|---------|---------|------------------|--------------|
| 1 | 1 | 25.3 | 39.5 |
| 4 | 2 | 78.2 | 12.8 |
| 8 | AUTO | 95.6 | 10.5 |
### Analysis
- Optimal configuration: 8 threads, AUTO streams
- Linear scaling up to 4 threads
- Diminishing returns beyond 4 threads# Detect thermal throttling
def detect_throttling(fps_over_time):
"""Detect performance degradation over time"""
first_quarter = np.mean(fps_over_time[:len(fps_over_time)//4])
last_quarter = np.mean(fps_over_time[-len(fps_over_time)//4:])
degradation = (first_quarter - last_quarter) / first_quarter
if degradation > 0.1: # 10% drop
return True, degradation
return False, degradationrun:
warmup_runs: 5 # Increase for stable results
matrix:
niter: [200] # Ensure enough iterations# Check for background processes
adb shell ps -A | grep -v idle
# Monitor CPU frequency
adb shell "while true; do
cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_cur_freq
sleep 1
done"models:
- name: "resnet50"
path: "models/resnet50.xml"
- name: "mobilenet"
path: "models/mobilenet.xml"
- name: "yolo"
path: "models/yolo.xml"
run:
multi_model_mode: "sequential" # or "parallel"models:
- name: "model"
path: "model.xml"
dynamic_shapes:
input: [1, 3, -1, -1] # Dynamic H, W
run:
matrix:
input_shape:
- [1, 3, 224, 224]
- [1, 3, 416, 416]
- [1, 3, 640, 640]def calculate_efficiency_metrics(df):
"""Calculate custom efficiency metrics"""
metrics = {}
# FPS per thread
metrics['fps_per_thread'] = df['throughput_fps'] / df['threads']
# FPS per watt (if power data available)
if 'power_w' in df.columns:
metrics['fps_per_watt'] = df['throughput_fps'] / df['power_w']
# Latency consistency
if 'latency_std' in df.columns:
metrics['latency_cv'] = df['latency_std'] / df['latency_avg']
return pd.DataFrame(metrics)- Check thermal state
- Verify CPU frequency
- Ensure proper thread affinity
- Check memory availability
- Verify model optimization
- Increase warmup runs
- Add cooldown periods
- Disable background apps
- Use performance governor
- Pin CPU frequency
- Check memory limits
- Verify model compatibility
- Reduce batch size
- Check library dependencies
- Review device logs
- CI/CD Integration - Automated benchmarking
- API Reference - Programming interface
- Troubleshooting - Common issues