Skip to content

Removing the dependency on Pyannote for Diarization and VAD#15632

Open
tango4j wants to merge 9 commits intomainfrom
add_py_md_eval
Open

Removing the dependency on Pyannote for Diarization and VAD#15632
tango4j wants to merge 9 commits intomainfrom
add_py_md_eval

Conversation

@tango4j
Copy link
Copy Markdown
Collaborator

@tango4j tango4j commented Apr 21, 2026

Important

The Update branch button must only be pressed in very rare occassions.
An outdated branch is never blocking the merge of a PR.
Please reach out to the automation team before pressing that button.

What does this PR do ?

Removes the pyannote.core and pyannote.metrics dependencies from NeMo's
speaker-diarization stack and replaces them with an in-tree, NIST
md-eval-22.pl-faithful Python engine plus lhotse.SupervisionSegment-based
annotation objects. The public API of nemo.collections.asr.metrics.der is
preserved, including byte-for-byte numerical parity with historical NeMo
diarization results (no shift in published DER numbers).

Tried to replace Pyannote classes with Lhotse's classes, to minimize the code
added to the repo by removing Pyannote imports. Except RTTM writing functions,
mostly replaceable.

Collection: ASR (speaker tasks / diarization, VAD)

Changelog

New: in-tree DER engine (nemo/collections/asr/metrics/md_eval.py)

  • New module: a Python port of NIST md-eval-22.pl, written in NeMo style
    (Apache header, type hints, Google-style docstrings, __all__,
    nemo.utils.logging, no CLI). Drives all DER computation.
  • New DiarizationErrorResult result object exposing the dict-like interface
    used throughout NeMo (abs(result), result['total' | 'confusion' | 'false alarm' | 'missed detection'], result.results_,
    result.optimal_mapping(...), result.report()).

nemo/collections/asr/metrics/der.py (DER public API)

  • score_labels, evaluate_der, score_labels_from_rttm_labels,
    get_partial_ref_labels, get_online_DER_stats, calculate_session_cpWER,
    calculate_session_cpWER_bruteforce, concat_perm_word_error_rate are
    all preserved with their original names, signatures, and return shapes.
    No breaking changes for downstream callers.
  • New lhotse-backed annotation helpers (replacements for the previous
    pyannote.core types):
    • make_diar_segment(start, end, speaker, ...) -> SupervisionSegment
    • make_diar_annotation(labels, uniq_name=...) -> list[SupervisionSegment]
    • make_uem_timeline(uem_lines, uniq_id=...) -> list[SupervisionSegment]
      (UEM regions carried as supervisions with speaker="UEM")
    • unique_speakers(annotation) -> list[str]
    • write_supervisions_to_rttm(annotation, file_handle, ...)
  • New score_labels_from_rttm_labels(...) convenience entry point that takes
    raw "start end speaker" label strings (no annotation object construction
    required by the caller).
  • New _default_uem_from_ref_sys(ref_data, sys_data) helper. When a caller
    does not supply a UEM, the high-level wrappers now auto-derive
    [min(ref ∪ sys TBEG), max(ref ∪ sys TEND)] per (file_id, channel) and
    pass it to evaluate(). This matches the historical no-UEM scoring map
    used by the previous external engine and prevents any over-shoot of the
    hypothesis past the last reference segment from being silently dropped.
    md_eval.evaluate() itself remains a faithful NIST port (ref-extent only)
    for power users that call it directly.
  • Docstring on collar argument in both score_labels and
    score_labels_from_rttm_labels clarifies the NIST half-width semantics
    (total no-score zone = 2 * collar) and gives the cross-engine conversion
    rule (NeMo collar=X <==> external libs that define collar as total width
    collar=2X).

Source code rename / scrub (no behaviour change)

  • nemo/collections/asr/parts/utils/speaker_utils.py:
    • labels_to_pyannote_object -> labels_to_supervisions
    • timestamps_to_pyannote_object -> timestamps_to_supervisions
    • now returns list[SupervisionSegment]
  • nemo/collections/asr/parts/utils/vad_utils.py:
    • vad_construct_pyannote_object_per_file -> vad_construct_supervisions_per_file
    • frame_vad_construct_pyannote_object_per_file -> frame_vad_construct_supervisions_per_file
    • read_rttm_as_pyannote_object -> read_rttm_as_supervisions
    • new internal _DetectionErrorRateAccumulator class replaces
      pyannote.metrics.detection.DetectionErrorRate, backed by md_eval. It
      preserves the metric(reference, hypothesis) accumulation +
      metric.report(display=False) API and returns a pandas DataFrame with
      the same ('detection error rate', '%'), ('false alarm', '%'),
      ('miss', '%') columns that downstream code consumes.
  • scripts/speaker_tasks/eval_diar_with_asr.py:
    • get_pyannote_objs_from_rttms -> get_supervisions_from_rttms
  • examples/speaker_tasks/diarization/neural_diarizer/e2e_diarize_speech.py:
    • call sites updated to the new timestamps_to_supervisions name
  • All docstrings, comments, and reference URLs that mentioned the third-party
    package by name have been rewritten (or replaced with neutral wording such
    as "External Annotation Library") so a git grep -i pyannote over the
    branch returns zero matches.
  • Two tutorial notebooks (tutorials/speaker_tasks/End_to_End_Diarization_*.ipynb,
    tutorials/tools/Multispeaker_Simulator.ipynb) and the inference notebook
    updated to use the new names and score_labels_from_rttm_labels.

Dependencies removed

  • requirements/requirements_asr.txt: removed pyannote.core and pyannote.metrics.
  • examples/voice_agent/environment.yaml: removed pyannote-core==5.0.0,
    pyannote-database==5.1.3, pyannote-metrics==3.2.1.
  • uv.lock: removed the three corresponding [[package]] blocks and every
    transitive { name = "pyannote-..." } entry. TOML structure validated
    after edit.

Tests

  • New tests/collections/speaker_tasks/utils/test_der.py (119 unit tests)
    covering:
    • md-eval engine: basic, collar, overlap, speaker count, UEM
    • score_labels_from_rttm_labels (string-label public API)
    • Multi-file aggregation
    • 21 hardcoded values verified independently against the previous external
      engine implementation (class TestExternalEngineVerifiedValues)
    • Lhotse-backed annotation pipeline end-to-end + bit-exact equivalence
      with the string-label path
    • 7-test TestNoUemAutoUnion regression class pinning the auto-UEM
      behaviour and the NIST collar semantics with hand-derived expected
      values from the diarization tutorial sample
    • Negative test asserting pyannote.core / pyannote.metrics submodules
      are never imported when der / md_eval are imported
  • tests/collections/{asr,speaker_tasks}/utils/test_vad_utils_*.py updated
    to use lhotse-based assertions via a new _annotation_equals(annotation, expected_segments) helper.

Usage

The public API is unchanged, so existing user code continues to work. New
shorthand for users that already have RTTM-style label strings:

from nemo.collections.asr.metrics.der import score_labels_from_rttm_labels
from nemo.collections.asr.parts.utils.speaker_utils import rttm_to_labels
ref_labels = rttm_to_labels("ground_truth.rttm")
hyp_labels = rttm_to_labels("system.rttm")
der_metric, mapping, (DER, CER, FA, MISS) = score_labels_from_rttm_labels(
    ref_labels_list=[("session_001", ref_labels)],
    hyp_labels_list=[("session_001", hyp_labels)],
    collar=0.25,           # NIST half-width: total no-score zone = 0.50s
    ignore_overlap=False,
    verbose=False,
)
print(f"DER = {abs(der_metric):.4f}")
The lhotse-based path (drop-in for previous external-library annotations):

from nemo.collections.asr.metrics.der import score_labels, make_diar_annotation
ref = make_diar_annotation(ref_labels, uniq_name="session_001")
hyp = make_diar_annotation(hyp_labels, uniq_name="session_001")
metric, mapping, errs = score_labels(
    AUDIO_RTTM_MAP={"session_001": {}},
    all_reference=[("session_001", ref)],
    all_hypothesis=[("session_001", hyp)],
    collar=0.25,
    ignore_overlap=False,
    verbose=False,
)

GitHub Actions CI

The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.

The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

  • New Feature
  • Bugfix
  • Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

Removes a maintenance liability: the previous external diarization metric packages have been on pip with infrequent updates and have pulled in a large transitive closure (pyannote-database, pyannote-pipeline, ...). After this PR, NeMo's DER pipeline depends only on numpy, scipy, lhotse, and editdistance -- all already required.
Backward-compatibility audit: git grep -i pyannote over the branch returns zero matches across Python sources, notebooks, configs, lockfile, docs, and shell scripts. import nemo followed by inspecting sys.modules shows no pyannote.* entries.
Numerical-parity audit: 21 verified-against-the-previous-engine DER values hardcoded in TestExternalEngineVerifiedValues, plus 7 regression tests pinning the auto-UEM and collar semantics with hand-derived expected values from the diarization tutorial sample.

Signed-off-by: taejinp <tango4j@gmail.com>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Apr 21, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Signed-off-by: tango4j <tango4j@users.noreply.github.com>
Comment thread nemo/collections/asr/metrics/der.py Fixed
tango4j and others added 3 commits April 21, 2026 15:33
Signed-off-by: taejinp <tango4j@gmail.com>
Signed-off-by: taejinp <tango4j@gmail.com>
Signed-off-by: tango4j <tango4j@users.noreply.github.com>
@tango4j
Copy link
Copy Markdown
Collaborator Author

tango4j commented Apr 21, 2026

@pzelasko
Can you just scan uv.lock and requirements.txt to see if there is no issues?

Signed-off-by: taejinp <tango4j@gmail.com>
Signed-off-by: taejinp <tango4j@gmail.com>
@github-actions
Copy link
Copy Markdown
Contributor

[🤖]: Hi @tango4j 👋,

We wanted to let you know that a CICD pipeline for this PR just finished successfully.

So it might be time to merge this PR or get some approvals.

Comment thread uv.lock
@@ -4726,8 +4726,6 @@ all = [
{ name = "peft" },
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tango4j Did you remove these manually or regenerate using uv lock? I'd have expected this file to change more, with more transitive dependencies being dropped.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think at least there could be some version changes (pinned ones) for other dependencies. Let me run some specific checks to what dependencies are affected by this.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pzelasko I double checked the other dependencies but unfortunately there are no dependencies to remove other than pyannote itself. Kind of disappointing. Thus, the new uv.lock generated showed 0 lines of diff from the current one.

I will wait until @ipmedenn to do the final function test. If @ipmedenn greenlights, maybe I can merge.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks. looks good from my side

Copy link
Copy Markdown
Collaborator

@stevehuang52 stevehuang52 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good from my end

@tango4j
Copy link
Copy Markdown
Collaborator Author

tango4j commented Apr 22, 2026

Before

[NeMo I 2026-04-22 08:55:39 e2e_diarize_speech:444] 
            diarization error rate    total  correct            false alarm           missed detection           confusion          
                                 %                            %                     %                          %                   %
    item                                                                                                                            
    en_4074               1.732310   421.98   418.99  99.291436        4.32  1.023745             2.93  0.694346      0.06  0.014219
    en_0638               1.856287   250.50   247.30  98.722555        1.45  0.578842             3.20  1.277445      0.00  0.000000
    en_4065               8.047537   419.88   389.70  92.812232        3.61  0.859769            28.19  6.713823      1.99  0.473945
    TOTAL                 4.188180  1092.36  1055.99  96.670512        9.38  0.858691            34.32  3.141821      2.05  0.187667
[NeMo I 2026-04-22 08:55:39 e2e_diarize_speech:444] Cumulative Results for collar 0.25 sec and ignore_overlap False: 
    | FA: 0.0086 | MISS: 0.0314 | CER: 0.0019 | DER: 0.0419 | Spk. Count Acc. 0.6667
    
PostProcessingParams: {'onset': 0.5, 'offset': 0.5, 'pad_onset': 0.0, 'pad_offset': 0.0, 'min_duration_on': 0.0, 'min_duration_off': 0.0}

After remove Pyannote backend

[NeMo I 2026-04-22 08:54:41 e2e_diarize_speech:444] 
    file                                          total  confusion  false alarm     missed      DER
    -----------------------------------------------------------------------------------------------
    en_0638                                      250.50       0.00         1.45       3.20    1.86%
    en_4065                                      419.88       1.99         3.61      28.19    8.05%
    en_4074                                      421.98       0.06         4.32       2.93    1.73%
    -----------------------------------------------------------------------------------------------
    TOTAL                                       1092.36       2.05         9.38      34.32    4.19%
[NeMo I 2026-04-22 08:54:41 e2e_diarize_speech:444] Cumulative Results for collar 0.25 sec and ignore_overlap False: 
    | FA: 0.0086 | MISS: 0.0314 | CER: 0.0019 | DER: 0.0419 | Spk. Count Acc. 0.6667
    
PostProcessingParams: {'onset': 0.5, 'offset': 0.5, 'pad_onset': 0.0, 'pad_offset': 0.0, 'min_duration_on': 0.0, 'min_duration_off': 0.0}

Comparison of DER stats output.

@tango4j
Copy link
Copy Markdown
Collaborator Author

tango4j commented Apr 27, 2026

@ipmedenn is working on some score mismatch issues of this PR. We will merge it after we clear this up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants