Benchmarking Guide

This guide covers running benchmarks, interpreting results, and optimizing performance.

Understanding benchmark_app
Running Benchmarks
Benchmark Parameters
Performance Metrics
Optimization Strategies
Result Analysis
Best Practices

Understanding benchmark_app

What is benchmark_app?

benchmark_app is OpenVINO's official tool for measuring inference performance. It:

Loads a model and runs inference
Measures throughput and latency
Supports various execution modes
Provides detailed performance metrics

Key Concepts

Throughput: Inferences per second (FPS)
Latency: Time for single inference (ms)
Synchronous API: Blocking, sequential execution
Asynchronous API: Non-blocking, parallel execution
Inference Request: Single inference operation
Stream: Parallel execution pipeline

Running Benchmarks

Basic Benchmark

# Simple benchmark with default settings
ovmobilebench run -c experiments/basic.yaml

Basic configuration:

run:
  repeats: 3
  matrix:
    niter: [100]
    api: ["sync"]
    device: ["CPU"]
    threads: [4]

Advanced Benchmark

run:
  repeats: 5
  warmup_runs: 2
  cooldown_sec: 30
  timeout_sec: 600
  matrix:
    niter: [100, 200, 500]
    api: ["sync", "async"]
    nireq: [1, 2, 4, 8]
    nstreams: ["1", "2", "AUTO"]
    device: ["CPU"]
    threads: [1, 2, 4, 8]
    infer_precision: ["FP32", "FP16", "INT8"]

Running Individual Configurations

# Run specific configuration
ovmobilebench run -c experiments/config.yaml \
    --filter "threads=4,nstreams=2"

Benchmark Parameters

Core Parameters

Parameter	Description	Values	Impact
`niter`	Number of iterations	50-10000	Higher = more accurate
`api`	Execution API	sync, async	Async = better throughput
`nireq`	Inference requests	1-16	More = parallel execution
`nstreams`	Execution streams	1, 2, AUTO	AUTO = OpenVINO optimizes
`device`	Target device	CPU, GPU, NPU	Hardware selection
`threads`	CPU threads	1-N cores	Parallelism level

Performance Hints

run:
  matrix:
    hint: ["LATENCY", "THROUGHPUT", "CUMULATIVE_THROUGHPUT"]

LATENCY: Optimize for minimal latency
THROUGHPUT: Optimize for maximum FPS
CUMULATIVE_THROUGHPUT: Multiple models simultaneously

Precision Settings

run:
  matrix:
    infer_precision: ["FP32", "FP16", "INT8"]

FP32: Full precision (baseline)
FP16: Half precision (2x faster, minimal accuracy loss)
INT8: 8-bit integers (4x faster, requires calibration)

Performance Metrics

Primary Metrics

Throughput

Throughput: 156.23 FPS

Total inferences per second
Higher is better
Best for batch processing

Latency

Latency:
  Median: 6.4 ms
  Average: 6.5 ms
  Min: 5.8 ms
  Max: 8.2 ms

Time per inference
Lower is better
Critical for real-time applications

Secondary Metrics

CPU Utilization

# Monitor during benchmark
adb shell top -d 1 | grep benchmark_app

Memory Usage

# Check memory consumption
adb shell dumpsys meminfo | grep benchmark_app

Power Consumption

# Battery stats (Android)
adb shell dumpsys batterystats

Optimization Strategies

Thread Optimization

Find optimal thread count:

run:
  matrix:
    threads: [1, 2, 3, 4, 5, 6, 7, 8]

Analysis:

import pandas as pd
import matplotlib.pyplot as plt

df = pd.read_csv('results.csv')
thread_perf = df.groupby('threads')['throughput_fps'].mean()
thread_perf.plot(kind='line', marker='o')
plt.xlabel('Threads')
plt.ylabel('Throughput (FPS)')
plt.title('Performance vs Thread Count')

Stream Optimization

run:
  matrix:
    nstreams: ["1", "2", "4", "AUTO"]
    nireq: [1, 2, 4, 8]

Best practices:

nstreams ≤ number of CPU cores
nireq ≥ nstreams for async mode
Use AUTO for automatic optimization

Batch Size Optimization

models:
  - name: "model"
    path: "model.xml"
    batch_size: [1, 2, 4, 8, 16]

run:
  matrix:
    batch: [1, 2, 4, 8, 16]

Trade-offs:

Larger batch = better throughput
Larger batch = higher latency
Memory constraints limit max batch

Memory Optimization

run:
  advanced:
    cache_dir: "/data/local/tmp/cache"
    enable_mmap: true
    memory_reuse: true

CPU Affinity

# Pin to big cores (Android)
adb shell "taskset 0xF0 benchmark_app ..."

# Pin to specific cores (Linux)
taskset -c 4-7 benchmark_app ...

Result Analysis

Statistical Analysis

import pandas as pd
import numpy as np

# Load results
df = pd.read_csv('results.csv')

# Group by configuration
grouped = df.groupby(['model', 'threads', 'nstreams'])

# Calculate statistics
stats = grouped['throughput_fps'].agg([
    'mean',
    'median',
    'std',
    ('cv', lambda x: x.std() / x.mean()),  # Coefficient of variation
    ('q25', lambda x: x.quantile(0.25)),
    ('q75', lambda x: x.quantile(0.75))
])

print(stats)

Performance Comparison

# Compare configurations
baseline = df[df['threads'] == 1]['throughput_fps'].median()

for threads in [2, 4, 8]:
    perf = df[df['threads'] == threads]['throughput_fps'].median()
    speedup = perf / baseline
    efficiency = speedup / threads
    print(f"Threads={threads}: Speedup={speedup:.2f}x, Efficiency={efficiency:.2%}")

Visualization

import matplotlib.pyplot as plt
import seaborn as sns

# Heatmap of performance
pivot = df.pivot_table(
    values='throughput_fps',
    index='threads',
    columns='nstreams',
    aggfunc='median'
)

plt.figure(figsize=(10, 6))
sns.heatmap(pivot, annot=True, fmt='.1f', cmap='YlOrRd')
plt.title('Performance Heatmap: Threads vs Streams')
plt.show()

Regression Detection

def detect_regression(current, baseline, threshold=-0.05):
    """Detect performance regression"""
    change = (current - baseline) / baseline
    if change < threshold:
        return True, change
    return False, change

# Example usage
baseline_fps = 100.0
current_fps = 92.0
is_regression, change = detect_regression(current_fps, baseline_fps)
if is_regression:
    print(f"REGRESSION: {change:.1%} drop in performance")

Best Practices

Experimental Design

Control Variables
- Keep one parameter variable at a time
- Document all fixed parameters
- Record environmental conditions
Statistical Validity
- Run at least 3 repetitions
- Discard warmup runs
- Use median instead of mean
Fair Comparison
- Same model, same input
- Same device state
- Same thermal conditions

Benchmark Workflow

graph TD
    A[Prepare Device] --> B[Warmup Run]
    B --> C[Cooldown]
    C --> D[Benchmark Run]
    D --> E[Collect Metrics]
    E --> F{More Runs?}
    F -->|Yes| C
    F -->|No| G[Analyze Results]

Configuration Examples

Latency-Optimized

run:
  matrix:
    api: ["sync"]
    nireq: [1]
    nstreams: ["1"]
    hint: ["LATENCY"]

Throughput-Optimized

run:
  matrix:
    api: ["async"]
    nireq: [8]
    nstreams: ["AUTO"]
    hint: ["THROUGHPUT"]

Power-Efficient

run:
  matrix:
    threads: [2]  # Use efficiency cores
    nstreams: ["1"]
    device: ["CPU"]

Reporting Template

## Benchmark Report

### Configuration
- Model: ResNet-50
- Device: Snapdragon 888
- Precision: FP16
- Batch Size: 1

### Results
| Threads | Streams | Throughput (FPS) | Latency (ms) |
|---------|---------|------------------|--------------|
| 1       | 1       | 25.3            | 39.5         |
| 4       | 2       | 78.2            | 12.8         |
| 8       | AUTO    | 95.6            | 10.5         |

### Analysis
- Optimal configuration: 8 threads, AUTO streams
- Linear scaling up to 4 threads
- Diminishing returns beyond 4 threads

Common Pitfalls

Thermal Throttling

# Detect thermal throttling
def detect_throttling(fps_over_time):
    """Detect performance degradation over time"""
    first_quarter = np.mean(fps_over_time[:len(fps_over_time)//4])
    last_quarter = np.mean(fps_over_time[-len(fps_over_time)//4:])

    degradation = (first_quarter - last_quarter) / first_quarter
    if degradation > 0.1:  # 10% drop
        return True, degradation
    return False, degradation

Insufficient Warmup

run:
  warmup_runs: 5  # Increase for stable results
  matrix:
    niter: [200]  # Ensure enough iterations

Background Interference

# Check for background processes
adb shell ps -A | grep -v idle

# Monitor CPU frequency
adb shell "while true; do
    cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_cur_freq
    sleep 1
done"

Advanced Topics

Multi-Model Benchmarking

models:
  - name: "resnet50"
    path: "models/resnet50.xml"
  - name: "mobilenet"
    path: "models/mobilenet.xml"
  - name: "yolo"
    path: "models/yolo.xml"

run:
  multi_model_mode: "sequential"  # or "parallel"

Dynamic Shapes

models:
  - name: "model"
    path: "model.xml"
    dynamic_shapes:
      input: [1, 3, -1, -1]  # Dynamic H, W

run:
  matrix:
    input_shape:
      - [1, 3, 224, 224]
      - [1, 3, 416, 416]
      - [1, 3, 640, 640]

Custom Metrics

def calculate_efficiency_metrics(df):
    """Calculate custom efficiency metrics"""
    metrics = {}

    # FPS per thread
    metrics['fps_per_thread'] = df['throughput_fps'] / df['threads']

    # FPS per watt (if power data available)
    if 'power_w' in df.columns:
        metrics['fps_per_watt'] = df['throughput_fps'] / df['power_w']

    # Latency consistency
    if 'latency_std' in df.columns:
        metrics['latency_cv'] = df['latency_std'] / df['latency_avg']

    return pd.DataFrame(metrics)

Troubleshooting

Low Performance

Check thermal state
Verify CPU frequency
Ensure proper thread affinity
Check memory availability
Verify model optimization

Inconsistent Results

Increase warmup runs
Add cooldown periods
Disable background apps
Use performance governor
Pin CPU frequency

Crashes/Errors

Check memory limits
Verify model compatibility
Reduce batch size
Check library dependencies
Review device logs

Next Steps

CI/CD Integration - Automated benchmarking
API Reference - Programming interface
Troubleshooting - Common issues

FilesExpand file tree

benchmarking.md

Latest commit

History