Improve ASR models' invariance to padding/batch size#13827
Conversation
Signed-off-by: Piotr Żelasko <petezor@gmail.com>
nithinraok
left a comment
There was a problem hiding this comment.
LGTM. Checked with parakeet models as well.
| @pytest.mark.skip(reason="Used only for debugging.") | ||
| @pytest.mark.parametrize("length", [16000]) | ||
| def test_canary_invariant_to_padding(deterministic_rng, length): | ||
| model = ASRModel.from_pretrained("nvidia/canary-180m-flash").eval() |
Signed-off-by: Piotr Żelasko <petezor@gmail.com>
…hub.com/nvidia/nemo into fix-pad-inconsistency-feature-extractor
Signed-off-by: tango4j <tango4j@users.noreply.github.com>
|
Just commenting for future reference. For Sortformer, Lhotse-based inference is supported but training is not supported yet. |
Signed-off-by: taejinp <tango4j@gmail.com>
Signed-off-by: Piotr Żelasko <petezor@gmail.com>
|
[🤖]: Hi @pzelasko 👋, We wanted to let you know that a CICD pipeline for this PR just finished successfully. So it might be time to merge this PR or get some approvals. |
|
@pzelasko I checked the diarization unit tests. As long as it passes all unit tests and CI test, I think the change makes no issues on Sortformer diarization. |
* Fix feature extractor to be invariant to padding Signed-off-by: Piotr Żelasko <petezor@gmail.com> * fix ci Signed-off-by: Piotr Żelasko <petezor@gmail.com> * preliminary conformer inference parity with/without padding Signed-off-by: Piotr Żelasko <petezor@gmail.com> * fix Signed-off-by: Piotr Żelasko <petezor@gmail.com> * fix tests Signed-off-by: Piotr Żelasko <petezor@gmail.com> * fixes Signed-off-by: Piotr Żelasko <petezor@gmail.com> * fix CI check Signed-off-by: Piotr Żelasko <petezor@gmail.com> * fix to cache-aware models Signed-off-by: Piotr Żelasko <petezor@gmail.com> * fix a bunch of tests Signed-off-by: Piotr Żelasko <petezor@gmail.com> * Fix failing CI tests Signed-off-by: Piotr Żelasko <petezor@gmail.com> * Fix failing CI tests part 2 Signed-off-by: Piotr Żelasko <petezor@gmail.com> * Unit test fixes for too short feature extractor inputs Signed-off-by: Piotr Żelasko <petezor@gmail.com> * Resolved feature frame length issue in E2E diarization dataloader Signed-off-by: taejinp <tango4j@gmail.com> * Apply isort and black reformatting Signed-off-by: tango4j <tango4j@users.noreply.github.com> * fix ci Signed-off-by: Piotr Żelasko <petezor@gmail.com> * removed test_ds from YAML file since it is not used Signed-off-by: taejinp <tango4j@gmail.com> * fix diarization unit tests after recent changes Signed-off-by: Piotr Żelasko <petezor@gmail.com> --------- Signed-off-by: Piotr Żelasko <petezor@gmail.com> Signed-off-by: taejinp <tango4j@gmail.com> Signed-off-by: tango4j <tango4j@users.noreply.github.com> Co-authored-by: oliver könig <okoenig@nvidia.com> Co-authored-by: Charlie Truong <chtruong@nvidia.com> Co-authored-by: taejinp <tango4j@gmail.com> Co-authored-by: tango4j <tango4j@users.noreply.github.com> Signed-off-by: Amir Hussein <amhussein@nvidia.com>
* Fix feature extractor to be invariant to padding Signed-off-by: Piotr Żelasko <petezor@gmail.com> * fix ci Signed-off-by: Piotr Żelasko <petezor@gmail.com> * preliminary conformer inference parity with/without padding Signed-off-by: Piotr Żelasko <petezor@gmail.com> * fix Signed-off-by: Piotr Żelasko <petezor@gmail.com> * fix tests Signed-off-by: Piotr Żelasko <petezor@gmail.com> * fixes Signed-off-by: Piotr Żelasko <petezor@gmail.com> * fix CI check Signed-off-by: Piotr Żelasko <petezor@gmail.com> * fix to cache-aware models Signed-off-by: Piotr Żelasko <petezor@gmail.com> * fix a bunch of tests Signed-off-by: Piotr Żelasko <petezor@gmail.com> * Fix failing CI tests Signed-off-by: Piotr Żelasko <petezor@gmail.com> * Fix failing CI tests part 2 Signed-off-by: Piotr Żelasko <petezor@gmail.com> * Unit test fixes for too short feature extractor inputs Signed-off-by: Piotr Żelasko <petezor@gmail.com> * Resolved feature frame length issue in E2E diarization dataloader Signed-off-by: taejinp <tango4j@gmail.com> * Apply isort and black reformatting Signed-off-by: tango4j <tango4j@users.noreply.github.com> * fix ci Signed-off-by: Piotr Żelasko <petezor@gmail.com> * removed test_ds from YAML file since it is not used Signed-off-by: taejinp <tango4j@gmail.com> * fix diarization unit tests after recent changes Signed-off-by: Piotr Żelasko <petezor@gmail.com> --------- Signed-off-by: Piotr Żelasko <petezor@gmail.com> Signed-off-by: taejinp <tango4j@gmail.com> Signed-off-by: tango4j <tango4j@users.noreply.github.com> Co-authored-by: oliver könig <okoenig@nvidia.com> Co-authored-by: Charlie Truong <chtruong@nvidia.com> Co-authored-by: taejinp <tango4j@gmail.com> Co-authored-by: tango4j <tango4j@users.noreply.github.com> Signed-off-by: Amir Hussein <amhussein@nvidia.com>
* Fix feature extractor to be invariant to padding Signed-off-by: Piotr Żelasko <petezor@gmail.com> * fix ci Signed-off-by: Piotr Żelasko <petezor@gmail.com> * preliminary conformer inference parity with/without padding Signed-off-by: Piotr Żelasko <petezor@gmail.com> * fix Signed-off-by: Piotr Żelasko <petezor@gmail.com> * fix tests Signed-off-by: Piotr Żelasko <petezor@gmail.com> * fixes Signed-off-by: Piotr Żelasko <petezor@gmail.com> * fix CI check Signed-off-by: Piotr Żelasko <petezor@gmail.com> * fix to cache-aware models Signed-off-by: Piotr Żelasko <petezor@gmail.com> * fix a bunch of tests Signed-off-by: Piotr Żelasko <petezor@gmail.com> * Fix failing CI tests Signed-off-by: Piotr Żelasko <petezor@gmail.com> * Fix failing CI tests part 2 Signed-off-by: Piotr Żelasko <petezor@gmail.com> * Unit test fixes for too short feature extractor inputs Signed-off-by: Piotr Żelasko <petezor@gmail.com> * Resolved feature frame length issue in E2E diarization dataloader Signed-off-by: taejinp <tango4j@gmail.com> * Apply isort and black reformatting Signed-off-by: tango4j <tango4j@users.noreply.github.com> * fix ci Signed-off-by: Piotr Żelasko <petezor@gmail.com> * removed test_ds from YAML file since it is not used Signed-off-by: taejinp <tango4j@gmail.com> * fix diarization unit tests after recent changes Signed-off-by: Piotr Żelasko <petezor@gmail.com> --------- Signed-off-by: Piotr Żelasko <petezor@gmail.com> Signed-off-by: taejinp <tango4j@gmail.com> Signed-off-by: tango4j <tango4j@users.noreply.github.com> Co-authored-by: oliver könig <okoenig@nvidia.com> Co-authored-by: Charlie Truong <chtruong@nvidia.com> Co-authored-by: taejinp <tango4j@gmail.com> Co-authored-by: tango4j <tango4j@users.noreply.github.com> Signed-off-by: Amir Hussein <amhussein@nvidia.com>
* Fix feature extractor to be invariant to padding Signed-off-by: Piotr Żelasko <petezor@gmail.com> * fix ci Signed-off-by: Piotr Żelasko <petezor@gmail.com> * preliminary conformer inference parity with/without padding Signed-off-by: Piotr Żelasko <petezor@gmail.com> * fix Signed-off-by: Piotr Żelasko <petezor@gmail.com> * fix tests Signed-off-by: Piotr Żelasko <petezor@gmail.com> * fixes Signed-off-by: Piotr Żelasko <petezor@gmail.com> * fix CI check Signed-off-by: Piotr Żelasko <petezor@gmail.com> * fix to cache-aware models Signed-off-by: Piotr Żelasko <petezor@gmail.com> * fix a bunch of tests Signed-off-by: Piotr Żelasko <petezor@gmail.com> * Fix failing CI tests Signed-off-by: Piotr Żelasko <petezor@gmail.com> * Fix failing CI tests part 2 Signed-off-by: Piotr Żelasko <petezor@gmail.com> * Unit test fixes for too short feature extractor inputs Signed-off-by: Piotr Żelasko <petezor@gmail.com> * Resolved feature frame length issue in E2E diarization dataloader Signed-off-by: taejinp <tango4j@gmail.com> * Apply isort and black reformatting Signed-off-by: tango4j <tango4j@users.noreply.github.com> * fix ci Signed-off-by: Piotr Żelasko <petezor@gmail.com> * removed test_ds from YAML file since it is not used Signed-off-by: taejinp <tango4j@gmail.com> * fix diarization unit tests after recent changes Signed-off-by: Piotr Żelasko <petezor@gmail.com> --------- Signed-off-by: Piotr Żelasko <petezor@gmail.com> Signed-off-by: taejinp <tango4j@gmail.com> Signed-off-by: tango4j <tango4j@users.noreply.github.com> Co-authored-by: oliver könig <okoenig@nvidia.com> Co-authored-by: Charlie Truong <chtruong@nvidia.com> Co-authored-by: taejinp <tango4j@gmail.com> Co-authored-by: tango4j <tango4j@users.noreply.github.com>
What does this PR do ?
Adds tests and fixes inconsistency in ASR feature extractor and subsampling when processing the same input with and without padding. Specifically:
audio_length < audio.shape["time"])As a result, the models' WER outcomes vary much less with batch size, but the outcome is still not 100% identical across batch sizes. For example, for parakeet-tdt-0.6b-v2, parakeet-rnnt-1.1b, and canary-180m-flash the absolute difference between batch sizes 128 and 512 was 0.01% WER.
Comparison of all NVIDIA NeMo ASR models on Open ASR Leaderboard (offline only):
I also checked the results on NVTalks for one cache-aware model:
Collection: ASR
Changelog
Usage
# Add a code snippet demonstrating how to use thisGitHub Actions CI
The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.
The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".
Before your PR is "Ready for review"
Pre checks:
PR Type:
If you haven't finished some of the above items you can still open "Draft" PR.
Who can review?
Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.
Additional Information