Skip to content

Add Bengali (Bangla) Language Support to Parakeet V3 / ASR Models #15653

@Istiyaq-Khan

Description

@Istiyaq-Khan

Is your feature request related to a problem? Please describe.
Currently, the Parakeet V3 ASR model supports 25 European languages but lacks support for Bengali (Bangla). Bengali is the 7th most spoken language globally, with over 270 million native speakers. Developers building lightweight, offline speech-to-text platforms (particularly those restricted to CPU-only hardware) highly benefit from Parakeet V3's optimized speed over the Whisper framework. However, the lack of Bengali support forces reliance on slower models.

Describe the solution you'd like
I am requesting the NeMo research team to include Bengali (Bangla) in future multilingual training pipelines for Parakeet or release a fine-tuned Parakeet V3 checkpoint that supports Bengali transcription.

Describe alternatives you've considered
Currently, I am forced to use whisper.cpp (Whisper Medium/Small) to run Bengali ASR on low-tier hardware. While accurate, the inference speed bottlenecks severely compared to Parakeet V3's architecture.

Additional context: Open-Source Bengali Datasets
To assist with data sourcing, here are several high-quality, open-source Bengali speech datasets already available for training:

  1. OpenSLR (SLR53) - Large Bengali ASR training data set: ~196K transcribed utterances (License: CC BY-SA 4.0).
    Link: https://www.openslr.org/53/
  2. Mozilla Common Voice (bn): Growing dataset of crowd-sourced Bengali speech.
    Link: https://commonvoice.mozilla.org/en/datasets
  3. Shrutilipi (AI4Bharat): Extensive corpus of labeled Indian language audio, including heavy Bengali representation.
    Link: https://ai4bharat.iitm.ac.in/shrutilipi/
  4. OpenSLR (SLR37) - High quality TTS data for Bengali: Multi-speaker data for Bangladesh (bn-BD) and Indian Bengali (bn-IN).
    Link: https://www.openslr.org/37/

Thank you for considering this request and for the continued excellent work on the NeMo project.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions