fix: bypass pyannote-rs segmentation bug that drops all speech#70
Conversation
silverstein
left a comment
There was a problem hiding this comment.
Nice root-cause analysis, @gregoire22enpc — the x as f32 without /32768.0 in pyannote-rs is a real bug and your workaround is well-motivated. A few things to think about before this is ready to merge:
Architecture concern: duplicating the ONNX model interface
segment_speech() reimplements pyannote-rs's sliding-window inference loop (~80 lines) with hard-coded frame constants (frame_size = 270, frame_start = 721, window_size = sr * 10). This works today, but:
-
These constants are tied to the segmentation-3.0 model architecture. If a future pyannote model changes window size or frame stride, this code silently breaks. Worth adding a comment documenting which model version these constants assume, or even asserting the model's input/output shapes at session init.
-
Have you considered contributing the fix upstream to pyannote-rs? The one-line fix (
x as f32 / 32768.0) in pyannote-rs would eliminate the need for the entiresegment_speech()function + theortandndarraydependencies. That's a much smaller surface area to maintain. Even if upstream is slow to merge, a patched fork pinned to a commit would be lighter than carrying a parallel inference path. This PR is a fine stopgap, but I'd love to see an upstream issue/PR linked. -
Two new heavyweight deps (
ort2.0.0-rc.10 +ndarray) are added to thediarizefeature.ortis a release candidate — are there stability concerns? Also, pyannote-rs already links ONNX Runtime internally, so you may end up with two copies of the runtime in the binary. Worth verifying.
Quiet-audio retry logic
The retry-on-empty approach (let the model decide, then normalize if needed) is elegant. One edge case:
TARGET_PEAK = 0.5is applied viagain = 0.5 / peak. For extremely quiet audio (peak = 0.0001), this is a 5000× gain, which will amplify noise floor dramatically. The model might then "find" speech in amplified noise. Consider capping the gain (e.g.,gain.min(100.0)) or logging a warning when gain exceeds some reasonable bound.
Trailing speech flush
Good catch that pyannote-rs drops the last segment when is_speeching is still true at EOF. That's a real fix.
Minor
- The
mut sessioninsegment_speech— doessession.run()actually need&mut self? If so, fine, just unusual for an inference session. load_audioreturns bothVec<f32>andVec<i16>(doubling memory). For large files this matters. Since i16 is only needed for embedding extraction on individual segments, you could defer the conversion to the per-segment loop and avoid allocating the full i16 copy upfront.
Overall this is solid work on a real, impactful bug. The main question is whether to carry the parallel inference path long-term or invest in the upstream fix.
Generated by Claude Code
1570b80 to
8ca20b6
Compare
|
Thank you for the comment. PR is still draft. I do agree with your architecture concern. Still, we need a fix as I'm fully unable to enroll a voice on |
|
Hey @gregoire22enpc, thanks for the reply -- and you're absolutely right that Here's what I'm thinking for a path forward: Short-term fix (this PR): The core bug is the missing Upstream PR: In parallel, I can open (or you could open) a one-line fix on pyannote-rs for the Quiet-audio retry: The retry logic is clever, but I'd like to defer it to a follow-up PR so we can think through the gain-capping implications separately. The normalization fix alone should resolve the "zero segments" issue for normal recordings. What do you think? Happy to pair on scoping this down if that helps. |
|
Agreed — I've reworked this PR per your feedback: What changed:
The diff is now 2 files changed, 6 insertions, 15 deletions — just the dependency pointer. Upstream PR: thewh1teagle/pyannote-rs#28 — once that lands, we switch back to crates.io. Quiet-audio retry: deferred to a follow-up PR as suggested. Ready for re-review. |
pyannote-rs's get_segments casts i16 to f32 without dividing by 32768, feeding the ONNX segmentation model values in [-32768, 32767] when it expects [-1.0, 1.0]. This causes the model to classify all frames as non-speech, breaking enrollment and diarization entirely. Switch to a patched fork (gregoire22enpc/pyannote-rs@ae9fbc1) that: 1. Normalizes i16→f32 with / 32768.0 in get_segments 2. Flushes trailing speech segments at end-of-audio Upstream PR: thewh1teagle/pyannote-rs#28 Once merged, we'll switch back to the crates.io release. Made-with: Cursor
c97167b to
303e522
Compare
@silverstein , I'm afraid the patch will be very slow to merge. Last release is 2 years old, and there are multiple old PRs open. |
silverstein
left a comment
There was a problem hiding this comment.
Merging this. The fork approach is the right call for unblocking enrollment now.
On the upstream concern you raised: I filed #79 to track it. The plan is to give thewh1teagle/pyannote-rs#28 about 30 days. If it doesn't move, we vendor pyannote-rs into the repo so we own the code directly and eliminate the git dependency. Either way, you won't be maintaining a fork long-term.
Thanks for narrowing this to just the dependency pointer. Clean and minimal.
Problem
Diarization produces unreliable results — speech segments are missed or absent entirely. This manifests as:
minutes enrollfailing with "No speech detected in the recording"minutes recordproducing transcripts with missing or no speaker labelsThe root cause is in
pyannote_rs::get_segments(line 59 of segment.rs): it converts i16 samples to f32 viax as f32without dividing by 32768. The ONNX segmentation model was trained on input in [-1.0, 1.0], but receives values in [-32768, 32767]. The model's weights, biases, and normalization layers are calibrated for the expected range, so feeding values 32768× larger produces unreliable frame classifications. In our testing, this resulted in zero speech segments detected.A secondary bug: when speech extends to end-of-audio,
get_segmentssilently drops the final segment (no flush whenis_speechingis still true at EOF).Solution
Rather than reimplementing the ONNX inference loop in minutes-core (which would pull in
ortandndarrayas direct deps), this PR switches to a patched fork of pyannote-rs (gregoire22enpc/pyannote-rs@ae9fbc1) that applies two targeted fixes:x as f32 / 32768.0instead ofx as f32Upstream PR: thewh1teagle/pyannote-rs#28 — once merged, we'll switch back to the crates.io release.
What changed in minutes
Only
Cargo.tomlandCargo.lock— the pyannote-rs dependency now points to the patched fork. Zero changes todiarize.rsor any Rust code in minutes-core.Test plan
minutes enroll --file recording.wavno longer errors with "No speech detected"minutes recordproduces diarized transcripts with speaker labelscargo test -p minutes-core --no-default-features— 265/266 pass (1 pre-existingvault::tests::tcc_protected_documentsfailure unrelated to this PR)cargo fmt --all -- --check— passcargo check -p minutes-core --features diarize --no-default-features— pass (only pre-existing warnings)