The CASE Benchmark includes 24 evaluation protocols organized into 6 categories. Each protocol tests a specific carrier condition with clean enrollment audio.
clean_<category>_<variant>
clean_prefix indicates clean enrollment<category>is one of: clean, codec, mic, noise, reverb, playback<variant>specifies the exact degradation applied
| Protocol | Description | Challenge |
|---|---|---|
clean_clean |
Clean vs Clean | Baseline performance |
Tests robustness to audio codec compression artifacts.
| Protocol | Codec | Bitrate | Use Case |
|---|---|---|---|
clean_codec_gsm |
GSM-FR | 13 kbps | 2G mobile |
clean_codec_ulaw |
G.711 μ-law | 64 kbps | US telephony |
clean_codec_alaw |
G.711 A-law | 64 kbps | EU telephony |
clean_codec_opus_6k |
Opus | 6 kbps | Low-BW VoIP |
clean_codec_opus_12k |
Opus | 12 kbps | Standard VoIP |
clean_codec_opus_24k |
Opus | 24 kbps | HD VoIP |
clean_codec_mp3_32k |
MP3 | 32 kbps | Compressed audio |
Expected difficulty: GSM and low-bitrate Opus are hardest due to aggressive compression.
Tests robustness to microphone frequency response variations.
| Protocol | Microphone Profile | Characteristics |
|---|---|---|
clean_mic_webcam_budget |
Budget webcam | Bass rolloff, harsh highs |
clean_mic_webcam_quality |
Quality webcam | Flatter response |
clean_mic_headset_usb |
USB headset | Narrow band |
clean_mic_laptop_internal |
Laptop mic | Resonant, colored |
clean_mic_phone |
Phone mic | Telephony band |
clean_mic_smartphone_flagship |
Flagship phone | Wide band |
clean_mic_conference_ceiling |
Ceiling mic | Room coloration |
Expected difficulty: Budget webcam and laptop internal are hardest.
Tests robustness to additive background noise at various SNR levels.
| Protocol | SNR | Difficulty |
|---|---|---|
clean_noise_snr_25 |
25 dB | Easy |
clean_noise_snr_20 |
20 dB | Moderate |
clean_noise_snr_15 |
15 dB | Challenging |
clean_noise_snr_10 |
10 dB | Hard |
clean_noise_snr_5 |
5 dB | Very hard |
Noise types: DEMAND corpus (domestic, office, transport, nature, etc.)
Note: The benchmark uses DEMAND noise (not MUSAN). If you train with MUSAN, your training data is properly separated.
Tests robustness to room reverberation using real Room Impulse Responses.
| Protocol | Description |
|---|---|
clean_reverb |
Real RIRs from OpenSLR-28 + BUT ReverbDB |
RIR sources: OpenSLR-28 (~417 real RIRs) + BUT ReverbDB (~1,500 real RIRs)
Expected difficulty: High - reverb smears temporal features.
Note: The benchmark uses real RIRs (not simulated). If you train with pyroomacoustics or OpenSLR-26, your training data is properly separated.
Tests robustness to full playback chains (the hardest category).
| Protocol | Chain |
|---|---|
clean_playback_alaw_snr25_phone |
A-law → speaker → room → phone mic (SNR 25dB) |
clean_playback_gsm_snr20_webcam_budget |
GSM → speaker → room → budget webcam (SNR 20dB) |
clean_playback_ulaw_snr15_laptop_internal |
μ-law → speaker → room → laptop mic (SNR 15dB) |
Expected difficulty: Very high - combines codec, room acoustics, and microphone effects.
Based on typical model performance:
- Easy: clean_clean, clean_noise_snr_25, clean_mic_smartphone_flagship
- Moderate: clean_codec_opus_24k, clean_noise_snr_20, clean_mic_headset_usb
- Challenging: clean_codec_ulaw, clean_codec_alaw, clean_noise_snr_10
- Hard: clean_codec_gsm, clean_reverb, clean_codec_opus_6k
- Very Hard: All playback protocols
You can evaluate on specific protocols instead of the full benchmark:
from case_benchmark import CASEBenchmark, load_model
benchmark = CASEBenchmark("/path/to/benchmark")
model = load_model("speechbrain")
# Evaluate only codec protocols
codec_protocols = [p for p in benchmark.list_protocols() if "codec" in p]
results = benchmark.evaluate(model, protocols=codec_protocols)Or via CLI:
case-benchmark evaluate \
--model speechbrain \
--benchmark-dir /path/to/benchmark \
--protocols clean_clean clean_codec_gsm clean_reverb