Skip to content

fix: bypass pyannote-rs segmentation bug that drops all speech#70

Merged
silverstein merged 1 commit intosilverstein:mainfrom
gregoire22enpc:fix/speech-segmentation
Apr 6, 2026
Merged

fix: bypass pyannote-rs segmentation bug that drops all speech#70
silverstein merged 1 commit intosilverstein:mainfrom
gregoire22enpc:fix/speech-segmentation

Conversation

@gregoire22enpc
Copy link
Copy Markdown
Contributor

@gregoire22enpc gregoire22enpc commented Apr 4, 2026

Problem

Diarization produces unreliable results — speech segments are missed or absent entirely. This manifests as:

  • minutes enroll failing with "No speech detected in the recording"
  • minutes record producing transcripts with missing or no speaker labels

The root cause is in pyannote_rs::get_segments (line 59 of segment.rs): it converts i16 samples to f32 via x as f32 without dividing by 32768. The ONNX segmentation model was trained on input in [-1.0, 1.0], but receives values in [-32768, 32767]. The model's weights, biases, and normalization layers are calibrated for the expected range, so feeding values 32768× larger produces unreliable frame classifications. In our testing, this resulted in zero speech segments detected.

A secondary bug: when speech extends to end-of-audio, get_segments silently drops the final segment (no flush when is_speeching is still true at EOF).

Solution

Rather than reimplementing the ONNX inference loop in minutes-core (which would pull in ort and ndarray as direct deps), this PR switches to a patched fork of pyannote-rs (gregoire22enpc/pyannote-rs@ae9fbc1) that applies two targeted fixes:

  1. Normalize i16→f32: x as f32 / 32768.0 instead of x as f32
  2. Flush trailing speech: emit the final segment when speech extends to end-of-audio

Upstream PR: thewh1teagle/pyannote-rs#28 — once merged, we'll switch back to the crates.io release.

What changed in minutes

Only Cargo.toml and Cargo.lock — the pyannote-rs dependency now points to the patched fork. Zero changes to diarize.rs or any Rust code in minutes-core.

Test plan

  • minutes enroll --file recording.wav no longer errors with "No speech detected"
  • minutes record produces diarized transcripts with speaker labels
  • cargo test -p minutes-core --no-default-features — 265/266 pass (1 pre-existing vault::tests::tcc_protected_documents failure unrelated to this PR)
  • cargo fmt --all -- --check — pass
  • cargo check -p minutes-core --features diarize --no-default-features — pass (only pre-existing warnings)

@gregoire22enpc gregoire22enpc marked this pull request as draft April 4, 2026 11:56
@gregoire22enpc gregoire22enpc changed the title fix: bypass pyannote-rs segmentation normalization bug fix: bypass pyannote-rs segmentation bug that drops all speech Apr 4, 2026
Copy link
Copy Markdown
Owner

@silverstein silverstein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice root-cause analysis, @gregoire22enpc — the x as f32 without /32768.0 in pyannote-rs is a real bug and your workaround is well-motivated. A few things to think about before this is ready to merge:

Architecture concern: duplicating the ONNX model interface

segment_speech() reimplements pyannote-rs's sliding-window inference loop (~80 lines) with hard-coded frame constants (frame_size = 270, frame_start = 721, window_size = sr * 10). This works today, but:

  1. These constants are tied to the segmentation-3.0 model architecture. If a future pyannote model changes window size or frame stride, this code silently breaks. Worth adding a comment documenting which model version these constants assume, or even asserting the model's input/output shapes at session init.

  2. Have you considered contributing the fix upstream to pyannote-rs? The one-line fix (x as f32 / 32768.0) in pyannote-rs would eliminate the need for the entire segment_speech() function + the ort and ndarray dependencies. That's a much smaller surface area to maintain. Even if upstream is slow to merge, a patched fork pinned to a commit would be lighter than carrying a parallel inference path. This PR is a fine stopgap, but I'd love to see an upstream issue/PR linked.

  3. Two new heavyweight deps (ort 2.0.0-rc.10 + ndarray) are added to the diarize feature. ort is a release candidate — are there stability concerns? Also, pyannote-rs already links ONNX Runtime internally, so you may end up with two copies of the runtime in the binary. Worth verifying.

Quiet-audio retry logic

The retry-on-empty approach (let the model decide, then normalize if needed) is elegant. One edge case:

  • TARGET_PEAK = 0.5 is applied via gain = 0.5 / peak. For extremely quiet audio (peak = 0.0001), this is a 5000× gain, which will amplify noise floor dramatically. The model might then "find" speech in amplified noise. Consider capping the gain (e.g., gain.min(100.0)) or logging a warning when gain exceeds some reasonable bound.

Trailing speech flush

Good catch that pyannote-rs drops the last segment when is_speeching is still true at EOF. That's a real fix.

Minor

  • The mut session in segment_speech — does session.run() actually need &mut self? If so, fine, just unusual for an inference session.
  • load_audio returns both Vec<f32> and Vec<i16> (doubling memory). For large files this matters. Since i16 is only needed for embedding extraction on individual segments, you could defer the conversion to the per-segment loop and avoid allocating the full i16 copy upfront.

Overall this is solid work on a real, impactful bug. The main question is whether to carry the parallel inference path long-term or invest in the upstream fix.


Generated by Claude Code

@gregoire22enpc
Copy link
Copy Markdown
Contributor Author

Thank you for the comment. PR is still draft. I do agree with your architecture concern. Still, we need a fix as I'm fully unable to enroll a voice on main. Keen to have your opinion there.

Copy link
Copy Markdown
Owner

silverstein commented Apr 4, 2026

Hey @gregoire22enpc, thanks for the reply -- and you're absolutely right that main is broken for enrollment right now, so I appreciate the urgency.

Here's what I'm thinking for a path forward:

Short-term fix (this PR): The core bug is the missing / 32768.0 in pyannote-rs's segment.rs:59. Would you be open to narrowing this PR to just the normalization fix applied locally (i.e., normalize the i16 samples to [-1.0, 1.0] before passing them to the model), without pulling in ort/ndarray as direct dependencies? That would unblock enrollment immediately with minimal new surface area.

Upstream PR: In parallel, I can open (or you could open) a one-line fix on pyannote-rs for the x as f32 cast. Once that lands and we bump the dep, the local workaround gets removed.

Quiet-audio retry: The retry logic is clever, but I'd like to defer it to a follow-up PR so we can think through the gain-capping implications separately. The normalization fix alone should resolve the "zero segments" issue for normal recordings.

What do you think? Happy to pair on scoping this down if that helps.

@gregoire22enpc
Copy link
Copy Markdown
Contributor Author

Agreed — I've reworked this PR per your feedback:

What changed:

  • Removed segment_speech(), SpeechSegment, load_audio(), and the quiet-audio retry logic entirely
  • Removed ort and ndarray as direct dependencies
  • Applied the fix upstream in a patched fork of pyannote-rs with two changes:
    1. x as f32 / 32768.0 (normalization fix)
    2. Trailing speech flush at end-of-audio
  • Cargo.toml now points to the fork via git rev

The diff is now 2 files changed, 6 insertions, 15 deletions — just the dependency pointer.

Upstream PR: thewh1teagle/pyannote-rs#28 — once that lands, we switch back to crates.io.

Quiet-audio retry: deferred to a follow-up PR as suggested.

Ready for re-review.

@gregoire22enpc gregoire22enpc marked this pull request as ready for review April 5, 2026 12:13
pyannote-rs's get_segments casts i16 to f32 without dividing by 32768,
feeding the ONNX segmentation model values in [-32768, 32767] when it
expects [-1.0, 1.0]. This causes the model to classify all frames as
non-speech, breaking enrollment and diarization entirely.

Switch to a patched fork (gregoire22enpc/pyannote-rs@ae9fbc1) that:
1. Normalizes i16→f32 with / 32768.0 in get_segments
2. Flushes trailing speech segments at end-of-audio

Upstream PR: thewh1teagle/pyannote-rs#28
Once merged, we'll switch back to the crates.io release.

Made-with: Cursor
@gregoire22enpc
Copy link
Copy Markdown
Contributor Author

Agreed — I've reworked this PR per your feedback:

What changed:

  • Removed segment_speech(), SpeechSegment, load_audio(), and the quiet-audio retry logic entirely

  • Removed ort and ndarray as direct dependencies

  • Applied the fix upstream in a patched fork of pyannote-rs with two changes:

    1. x as f32 / 32768.0 (normalization fix)
    2. Trailing speech flush at end-of-audio
  • Cargo.toml now points to the fork via git rev

The diff is now 2 files changed, 6 insertions, 15 deletions — just the dependency pointer.

Upstream PR: thewh1teagle/pyannote-rs#28 — once that lands, we switch back to crates.io.

Quiet-audio retry: deferred to a follow-up PR as suggested.

Ready for re-review.

@silverstein , I'm afraid the patch will be very slow to merge. Last release is 2 years old, and there are multiple old PRs open.
The fork approach seems shaky to me - let me know what you think

Copy link
Copy Markdown
Owner

@silverstein silverstein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merging this. The fork approach is the right call for unblocking enrollment now.

On the upstream concern you raised: I filed #79 to track it. The plan is to give thewh1teagle/pyannote-rs#28 about 30 days. If it doesn't move, we vendor pyannote-rs into the repo so we own the code directly and eliminate the git dependency. Either way, you won't be maintaining a fork long-term.

Thanks for narrowing this to just the dependency pointer. Clean and minimal.

@silverstein silverstein merged commit d9595b3 into silverstein:main Apr 6, 2026
6 of 7 checks passed
@gregoire22enpc gregoire22enpc deleted the fix/speech-segmentation branch April 6, 2026 10:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants