fix: bypass pyannote-rs segmentation bug that drops all speech by gregoire22enpc · Pull Request #70 · silverstein/minutes

gregoire22enpc · 2026-04-04T11:51:49Z

Problem

Diarization produces unreliable results — speech segments are missed or absent entirely. This manifests as:

minutes enroll failing with "No speech detected in the recording"
minutes record producing transcripts with missing or no speaker labels

The root cause is in pyannote_rs::get_segments (line 59 of segment.rs): it converts i16 samples to f32 via x as f32 without dividing by 32768. The ONNX segmentation model was trained on input in [-1.0, 1.0], but receives values in [-32768, 32767]. The model's weights, biases, and normalization layers are calibrated for the expected range, so feeding values 32768× larger produces unreliable frame classifications. In our testing, this resulted in zero speech segments detected.

A secondary bug: when speech extends to end-of-audio, get_segments silently drops the final segment (no flush when is_speeching is still true at EOF).

Solution

Rather than reimplementing the ONNX inference loop in minutes-core (which would pull in ort and ndarray as direct deps), this PR switches to a patched fork of pyannote-rs (gregoire22enpc/pyannote-rs@ae9fbc1) that applies two targeted fixes:

Normalize i16→f32: x as f32 / 32768.0 instead of x as f32
Flush trailing speech: emit the final segment when speech extends to end-of-audio

Upstream PR: thewh1teagle/pyannote-rs#28 — once merged, we'll switch back to the crates.io release.

What changed in minutes

Only Cargo.toml and Cargo.lock — the pyannote-rs dependency now points to the patched fork. Zero changes to diarize.rs or any Rust code in minutes-core.

Test plan

minutes enroll --file recording.wav no longer errors with "No speech detected"
minutes record produces diarized transcripts with speaker labels
cargo test -p minutes-core --no-default-features — 265/266 pass (1 pre-existing vault::tests::tcc_protected_documents failure unrelated to this PR)
cargo fmt --all -- --check — pass
cargo check -p minutes-core --features diarize --no-default-features — pass (only pre-existing warnings)

silverstein

Nice root-cause analysis, @gregoire22enpc — the x as f32 without /32768.0 in pyannote-rs is a real bug and your workaround is well-motivated. A few things to think about before this is ready to merge:

Architecture concern: duplicating the ONNX model interface

segment_speech() reimplements pyannote-rs's sliding-window inference loop (~80 lines) with hard-coded frame constants (frame_size = 270, frame_start = 721, window_size = sr * 10). This works today, but:

These constants are tied to the segmentation-3.0 model architecture. If a future pyannote model changes window size or frame stride, this code silently breaks. Worth adding a comment documenting which model version these constants assume, or even asserting the model's input/output shapes at session init.
Have you considered contributing the fix upstream to pyannote-rs? The one-line fix (x as f32 / 32768.0) in pyannote-rs would eliminate the need for the entire segment_speech() function + the ort and ndarray dependencies. That's a much smaller surface area to maintain. Even if upstream is slow to merge, a patched fork pinned to a commit would be lighter than carrying a parallel inference path. This PR is a fine stopgap, but I'd love to see an upstream issue/PR linked.
Two new heavyweight deps (ort 2.0.0-rc.10 + ndarray) are added to the diarize feature. ort is a release candidate — are there stability concerns? Also, pyannote-rs already links ONNX Runtime internally, so you may end up with two copies of the runtime in the binary. Worth verifying.

Quiet-audio retry logic

The retry-on-empty approach (let the model decide, then normalize if needed) is elegant. One edge case:

TARGET_PEAK = 0.5 is applied via gain = 0.5 / peak. For extremely quiet audio (peak = 0.0001), this is a 5000× gain, which will amplify noise floor dramatically. The model might then "find" speech in amplified noise. Consider capping the gain (e.g., gain.min(100.0)) or logging a warning when gain exceeds some reasonable bound.

Trailing speech flush

Good catch that pyannote-rs drops the last segment when is_speeching is still true at EOF. That's a real fix.

Minor

The mut session in segment_speech — does session.run() actually need &mut self? If so, fine, just unusual for an inference session.
load_audio returns both Vec<f32> and Vec<i16> (doubling memory). For large files this matters. Since i16 is only needed for embedding extraction on individual segments, you could defer the conversion to the per-segment loop and avoid allocating the full i16 copy upfront.

Overall this is solid work on a real, impactful bug. The main question is whether to carry the parallel inference path long-term or invest in the upstream fix.

Generated by Claude Code

gregoire22enpc · 2026-04-04T21:12:04Z

Thank you for the comment. PR is still draft. I do agree with your architecture concern. Still, we need a fix as I'm fully unable to enroll a voice on main. Keen to have your opinion there.

silverstein · 2026-04-04T22:04:12Z

Hey @gregoire22enpc, thanks for the reply -- and you're absolutely right that main is broken for enrollment right now, so I appreciate the urgency.

Here's what I'm thinking for a path forward:

Short-term fix (this PR): The core bug is the missing / 32768.0 in pyannote-rs's segment.rs:59. Would you be open to narrowing this PR to just the normalization fix applied locally (i.e., normalize the i16 samples to [-1.0, 1.0] before passing them to the model), without pulling in ort/ndarray as direct dependencies? That would unblock enrollment immediately with minimal new surface area.

Upstream PR: In parallel, I can open (or you could open) a one-line fix on pyannote-rs for the x as f32 cast. Once that lands and we bump the dep, the local workaround gets removed.

Quiet-audio retry: The retry logic is clever, but I'd like to defer it to a follow-up PR so we can think through the gain-capping implications separately. The normalization fix alone should resolve the "zero segments" issue for normal recordings.

What do you think? Happy to pair on scoping this down if that helps.

gregoire22enpc · 2026-04-05T12:13:03Z

Agreed — I've reworked this PR per your feedback:

What changed:

Removed segment_speech(), SpeechSegment, load_audio(), and the quiet-audio retry logic entirely
Removed ort and ndarray as direct dependencies
Applied the fix upstream in a patched fork of pyannote-rs with two changes:
1. x as f32 / 32768.0 (normalization fix)
2. Trailing speech flush at end-of-audio
Cargo.toml now points to the fork via git rev

The diff is now 2 files changed, 6 insertions, 15 deletions — just the dependency pointer.

Upstream PR: thewh1teagle/pyannote-rs#28 — once that lands, we switch back to crates.io.

Quiet-audio retry: deferred to a follow-up PR as suggested.

Ready for re-review.

pyannote-rs's get_segments casts i16 to f32 without dividing by 32768, feeding the ONNX segmentation model values in [-32768, 32767] when it expects [-1.0, 1.0]. This causes the model to classify all frames as non-speech, breaking enrollment and diarization entirely. Switch to a patched fork (gregoire22enpc/pyannote-rs@ae9fbc1) that: 1. Normalizes i16→f32 with / 32768.0 in get_segments 2. Flushes trailing speech segments at end-of-audio Upstream PR: thewh1teagle/pyannote-rs#28 Once merged, we'll switch back to the crates.io release. Made-with: Cursor

gregoire22enpc · 2026-04-05T22:19:19Z

Agreed — I've reworked this PR per your feedback:

What changed:

Removed segment_speech(), SpeechSegment, load_audio(), and the quiet-audio retry logic entirely

Removed ort and ndarray as direct dependencies

Applied the fix upstream in a patched fork of pyannote-rs with two changes:

x as f32 / 32768.0 (normalization fix)

Trailing speech flush at end-of-audio

Cargo.toml now points to the fork via git rev

The diff is now 2 files changed, 6 insertions, 15 deletions — just the dependency pointer.

Upstream PR: thewh1teagle/pyannote-rs#28 — once that lands, we switch back to crates.io.

Quiet-audio retry: deferred to a follow-up PR as suggested.

Ready for re-review.

@silverstein , I'm afraid the patch will be very slow to merge. Last release is 2 years old, and there are multiple old PRs open.
The fork approach seems shaky to me - let me know what you think

silverstein

Merging this. The fork approach is the right call for unblocking enrollment now.

On the upstream concern you raised: I filed #79 to track it. The plan is to give thewh1teagle/pyannote-rs#28 about 30 days. If it doesn't move, we vendor pyannote-rs into the repo so we own the code directly and eliminate the git dependency. Either way, you won't be maintaining a fork long-term.

Thanks for narrowing this to just the dependency pointer. Clean and minimal.

gregoire22enpc mentioned this pull request Apr 4, 2026

feat: improve speaker clustering with running-average templates and L2 normalization #71

Closed

7 tasks

gregoire22enpc marked this pull request as draft April 4, 2026 11:56

gregoire22enpc changed the title ~~fix: bypass pyannote-rs segmentation normalization bug~~ fix: bypass pyannote-rs segmentation bug that drops all speech Apr 4, 2026

silverstein reviewed Apr 4, 2026

View reviewed changes

silverstein mentioned this pull request Apr 4, 2026

fix: use diarization embeddings directly for voice enrollment #73

Merged

7 tasks

gregoire22enpc force-pushed the fix/speech-segmentation branch from 1570b80 to 8ca20b6 Compare April 4, 2026 21:06

gregoire22enpc marked this pull request as ready for review April 5, 2026 12:13

gregoire22enpc force-pushed the fix/speech-segmentation branch from c97167b to 303e522 Compare April 5, 2026 12:16

gregoire22enpc mentioned this pull request Apr 5, 2026

feat: make speaker embedding model configurable, add CAM++_LM option #77

Closed

silverstein mentioned this pull request Apr 5, 2026

Track pyannote-rs upstream fix for i16 normalization bug #79

Open

silverstein approved these changes Apr 6, 2026

View reviewed changes

silverstein merged commit d9595b3 into silverstein:main Apr 6, 2026
6 of 7 checks passed

gregoire22enpc deleted the fix/speech-segmentation branch April 6, 2026 10:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: bypass pyannote-rs segmentation bug that drops all speech#70

fix: bypass pyannote-rs segmentation bug that drops all speech#70
silverstein merged 1 commit intosilverstein:mainfrom
gregoire22enpc:fix/speech-segmentation

gregoire22enpc commented Apr 4, 2026 •

edited

Loading

Uh oh!

silverstein left a comment

Uh oh!

gregoire22enpc commented Apr 4, 2026

Uh oh!

silverstein commented Apr 4, 2026 •

edited

Loading

Uh oh!

gregoire22enpc commented Apr 5, 2026

Uh oh!

gregoire22enpc commented Apr 5, 2026

Uh oh!

silverstein left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

gregoire22enpc commented Apr 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Solution

What changed in minutes

Test plan

Uh oh!

silverstein left a comment

Choose a reason for hiding this comment

Architecture concern: duplicating the ONNX model interface

Quiet-audio retry logic

Trailing speech flush

Minor

Uh oh!

gregoire22enpc commented Apr 4, 2026

Uh oh!

silverstein commented Apr 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gregoire22enpc commented Apr 5, 2026

Uh oh!

gregoire22enpc commented Apr 5, 2026

Uh oh!

silverstein left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

gregoire22enpc commented Apr 4, 2026 •

edited

Loading

silverstein commented Apr 4, 2026 •

edited

Loading