Awesome-Odia-AI

Curated list of all AI related resources in Odia Language.

Table of Contents

NLP
Audio
Computer Vision
- OCR
- Multimodal
Applications
Events
Community
- Policy & Ecosystem

NLP

Translation

Sua: Machine Translation from English to Odia language [dataset code]
IndicTrans: [paper] [web]
IndicTrans2: [paper][web][code]
WAT 2025 NLLB-200 Fine-tuned Model: A highly optimized 3.3B parameter neural machine translation model scored on WAT 2025. [model]
NLTM-EILMT Bhashini Translation API System: The sovereign, Bhashini-backed open-source repository providing robust, production-ready bi-directional API/REST endpoints for English to Odia machine translation. [code]

Transliteration

IndicXlit: [paper][web][code ][Demo][PyPi]
open-source unicode converter to transliterate between various languages to Odia language Demo.

Language Understanding

Datasets

IndicCorp: Large sentence-level monolingual corpora for 11 Indian languages and Indian English containing 8.5 billions words (250 million sentences) from multiple news domain sources. [paper][code][web]
Naamapadam: Training and evaluation datasets for named entity recognition in multiple Indian language. [paper][huggingface][web]
IndicCorp v2: he largest collection of texts for Indic languages consisting of 20.9 billion tokens of which 14.4B tokens correspond to 23 Indic languages and 6.5B tokens of Indian English content curated from Indian websites. [paper][code]
L3Cube-IndicSQuAD (Odia): A meticulously engineered, 118,000+ sample extractive Question Answering dataset. [dataset]

Language Model

Language Model : Pretrained Odia Language Model.
BertOdia : Bert-based Odia Language Model.
IndicBERT: Multilingual, compact ALBERT language model trained on IndicCorp covering 11 major Indian and English. Small model (18 million parameters) that is competitive with large LMs for Indian language tasks. [paper][code][web]
IndicNER: Named Entity Recognizer models for multiple Indian languages. The models are trained on the Naampadam NER dataset mined from Samanantar parallel corpora. [paper][huggingface][web]
IndicBERTv2: Language model trained on IndicCorp v2 with competitive performance on IndicXTREME [paper][code][web]
Oriya-BERT-SQuAD: A highly capable 0.2B parameter OdiaBERT model rigorously fine-tuned directly on the IndicSQuAD dataset. [model]

Word Embedding

FastText (CommonCrawl + Wikipedia) : Pretrained Word vector (CommonCrawl + Wikipedia). Trained on Common Crawl and Wikipedia using fastText. Select the language "oriya" from the model list.
FastText (Wikipedia) : Pretrained Word vector (Wikipedia). Trained on Wikipedia using fastText. Select the language "oriya" from the model list.
IndicFT: Word embeddings for 11 Indian languages trained on IndicCorp. The embeddings are based on the fastText model and are well suited for the morphologically rich nature of Indic languages. [paper][code][web]

Morphanalyzers

IndicNLP Morphanalyzers : Unsupervised morphanalyzers for 10 Indian languages including Odia learnt using morfessor.

Language Generation

Instruction Set

Odia master data llama2: This dataset contains 180k Odia instruction sets translated from open-source instruction sets and Odia domain knowledge instruction sets.
Odia context 10k llama2 set: This dataset contains 10K instructions that span various facets of Odisha's unique identity. The instructions cover a wide array of subjects, ranging from the culinary delights in 'RECIPES,' the historical significance of 'HISTORICAL PLACES,' and 'TEMPLES OF ODISHA,' to the intellectual pursuits in 'ARITHMETIC,' 'HEALTH,' and 'GEOGRAPHY.' It also explores the artistic tapestry of Odisha through 'ART AND CULTURE,' which celebrates renowned figures in 'FAMOUS ODIA POETS/WRITERS', and 'FAMOUS ODIA POLITICAL LEADERS'. Furthermore, it encapsulates 'SPORTS' and the 'GENERAL KNOWLEDGE OF ODISHA,' providing an all-encompassing representation of the state.
Roleplay Odia: This dataset contains 1k Odia role play instruction set in conversation format.
OdiEnCorp translation instructions 25k: This dataset contains 25k English-to-Odia translation instruction set.
Odia Reasoning Datasets (GU, OD, OpenThoughts): A suite of newly compiled, highly validated datasets designed to transition LLMs from basic conversational generation to step-by-step logical reasoning.

Pe-train Dataset

CulturaX: It is a multilingual dataset contains monolingual data for several Indic languages (Hindi, Bangla, Tamil, Malayalam, Marathi, Telugu, Kannada, Gujarati, Punjabi, Odia, Assamese, etc.). Paper
Varta: The dataset contains 41.8 million news articles in 14 Indic languages and English, crawled from DailyHunt, a popular news aggregator in India that pulls high-quality articles from multiple trusted and reputed news publishers.

Foundation LLM

Qwen 1.5 Odia 7B: This is a pre-trained Odia large language model with 7 billion parameters, and it is based on Qwen 1.5-7B. The model is pre-trained on the Culturex-Odia dataset, a filtered version of the original CulturaX dataset for Odia text. As per the authors, it is a model is a base model and not meant to be used as is. It is recommended to first finetune it on downstream tasks. Blog
Odia-Gemma-2B-Base : Pre-trained 2B-parameter decoder-only Odia LLM (Gemma 2B based).

Fine-Tuned LLM

Odia llama2 7B base: odia_llama2_7B_base is based on Llama2-7b and finetuned with 180k Odia instruction set. Paper
Llama3_8B_Odia_Unsloth & Llama 3.x R1 Scalable Series: Advanced Llama-3 based generative models optimized using the Unsloth library with 4-bit bnb quantization.
odia-t5-base : Multilingual Text-to-Text Transformer (mT5-based) fine-tuned for Odia translation, summarization, and QA.

Benchmarking Set

Airavata Evaluation Suite: A collection of benchmarks used for evaluation of Airavata, a Hindi instruction-tuned model on top of Sarvam's OpenHathi base model.
Indic LLM Benchmark: A collection of LLM benchmark data in Gujurati, Nepali, Malayalam, Hindi, Telugu, Marathi, Kannada, Bengali.
MILU: Multi-task Indic Language Understanding: A comprehensive, culturally grounded evaluation framework featuring 4,525 highly verified Odia questions spanning 41 distinct domains.

Text Dataset

Parallel Translation Corpus

OdiEnCorp 2.0 : This dataset contains 97K English-Odia parallel sentences and serving in WAT2020 for Odia-English machine translation task. Paper
OPUS Corpus : It contains parallel sentences of other languages with Odia. The collection of data are domain-specific and noisy.
OdiEnCorp 1.0 : This dataset contains 30K English-Odia parallel sentences. Paper
IndoWordnet Parallel Corpus : Parallel corpora mined from IndoWordNet gloss and/or examples for Indian-Indian language corpora (6.3 million segments, 18 languages including Odia). Paper
PMIndia : Parallel corpus for En-Indian languages mined from Mann ki Baat speeches of the PM of India. It contains 38K English-Odia parallel sentences.Paper
CVIT PIB : Parallel corpus for En-Indian languages mined from press information bureau website of India. It contains 60K English-Odia parallel sentences.
Samanantar is the largest publicly available parallel corpora collection for Indic languages. The corpus has 49.6M sentence pairs between English to Indian Languages.
BPCC is a comprehensive and publicly available parallel corpus containing a mix of Human labelled data and automatically mined data; totaling to approximately 230 million bitext pairs[Paper]].
English–Odia Entertainment Parallel Corpus : Domain-specific English–Odia sentence-aligned parallel corpus (50k+ pairs) focused on entertainment content.

Monolingual Corpus

Odia News Corpus Odia Monolingual News Corpus of more than 1.5GB. Dataset
EMILLE Corpus : It contains fourteen monolingual corpora for Indian languages including Odia.Manual
OdiEnCorp 1.0 : This dataset contains 221K Odia sentences.Paper
AI4Bharat-IndicNLP Corpus : The text corpus not available now (will be available later). It used 3.5M Odia sentences to build the embedding. Vocabulary frequency files are available.Paper
OSCAR Corpus : It contains around 300K Odia sentences.
Large Odia LLM Pre-training Corpus (pre_train_odia_data_processed): A uniformly processed, aggressively deduplicated amalgamation of massive Odia text data of 6.4GB.
IndicDialogue (Odia Subtitles) : Large subtitle and dialogue corpus from OpenSubtitles in 10 Indic languages including Odia, with 6.8M+ Odia dialogues.
Odia-data-collection : Aggregated Odia text dataset on Hugging Face created for language training.

Lexical Resources

IndoWordNet : Wordnet for Indian languages including Odia.

POS Tagged corpus

Indian Language Corpora Initiative : It contains parallel annotated corpora in 12 Indian languages including Odia (tourism and health domain).
Odia Treebank : The treebank contains approx. 1082 tokens (100 sentences) in Odia. Paper

Dialect Detection corpus

Odia-Santali Dialect Detection Corpus : This corpus contains text data of Odia and Santali written in Odia script.

Text Classification

Odia News Article Classification : This dataset contains approxmately 19,000 news article headlines collected from Odia news websites. The labeled dataset is splitted into training and testset suitable for supervised text classification.
AI4Bharat IndicNLP News Articles : This datasets comprising news articles and their categories for 9 languages including Odia. For Odia language, it has 4 classes (business, crime, entertainment, sports) and each class contains 7.5K news articles. The dataset is balanced across classes. Paper
MTEB: OdiaNewsClassification : A heavily curated 3-class classification testbed comprising over 17,200 journalistic articles. Officially integrated into the MTEB standard.
Odia Sentiment MuRIL v4 : An advanced sentiment classifier built natively upon the MuRIL framework.

NLP Libraries / Tools

Indic NLP Library : It is a python based NLP library for Indian language text processing including Odia.
Odia Romanization Script : The perl script "odiaroman" maps the Devnagri (Odia) to Latin.

OpenOdia : Tools for Odia language

Other NLP Resources

TDIL : It contains language application, resources, and tools for Indian languages including Odia. It contains many language applications, resources, and tools for Odia such as Odia terminology application, Odia language search engine, wordnet, English-Odia parallel text corpus, English-Odia machine-assisted translation, text-to-speech software, and many more.

Audio

Speech Recognition

IndicWav2Vec-Odia : A foundational acoustic architecture applying the self-supervised wav2vec 2.0 framework directly to unannotated Odia phonetics.
wav2vec2-large-xlsr-53-odia : Fine-tuned Wav2Vec2-Large-XLSR-53 model for Odia ASR.
whisper-small-odia-finetuned : Whisper-small model fine-tuned with LoRA on an Odia-English bilingual ASR dataset.
Olive_Odia_ASR : OdiaGenAI toolkit and scripts for fine-tuning and serving Whisper-based Odia ASR models.

Text-to-Speech

Indic-TTS [Paper][Code] [Try It Live]][Video]

Speech Dataset

IIT Madras IndicTTS : The Indic TTS project develops the text-to-speech (TTS) synthesis system for Indian languages including Odia. The database contains spoken sentences/utterances recorded by both Male and Female native speakers.
LDC-IL : It includes Odia annotated speech corpora which has voices of 450 different native speakers.

Mozilla Common Voice : The Mozilla Common Voice project is a community-led project to build a large multilingual dataset for speech recognition.
Odia text to speech dataset : 55000 odia words pronunced in various dialets of Odisha like Baleswari, Puri, Cuttack, etc.
Odia ASR Benchmark Dataset for Noisy Speech Recognition : About 500 MB of CC-BY-4.0 data.
ODEN-speech Corpus : A paradigm-shifting 462-hour speech ensemble derived from seamlessly merging 8 disparate corpora, standardized to 16kHz mono audio of size 51.9GB.
Odia General Conversation Speech Dataset : Real-world, unscripted conversational Odia speech with detailed metadata.
Odia Speech Datasets Collection : Commercial suite of Odia speech datasets for ASR, TTS, and voice-assistant training.

Computer Vision

OCR

Indic-OCR : OCR tools for Indic scripts including Odia. Also, supports Ol Chiki (Santali).
IndicPhotoOCR (Scene Text Recognition) : A high-velocity, end-to-end scene text recognition toolkit capable of parsing complex, visually noisy Odia script in organic environments.
DocuExtract: Odia Handwritten OCR : An advanced computational vision system employing progressive vision-language fine-tuning to transcribe stochastic, highly erratic handwritten Odia text.
Odia Lipi OCR Data : Open Odia OCR benchmark dataset of scanned Odia text with validated Unicode annotations.
Odia OCR Merged Multi-Source Dataset : Merged OCR dataset combining OdiaGenAIOCR and other sources (≈192k samples) for printed and handwritten Odia OCR.
odia-ocr-qwen-finetuned : Production-ready Qwen2.5-VL-3B-Instruct VLM fine-tuned on 58,720 Odia text-image pairs for robust OCR.
odia-ocr-qwen-finetuned_v2 : Updated Odia OCR Qwen model trained on ~73k word-level Odia crops spanning diverse fonts and print qualities.

Multimodal

Odia Visual Genome (OVG) : Multimodal English-Odia dataset of Visual Genome image-caption pairs.
Odia Image Captioning Dataset : FutureBeeAI multimodal dataset of diverse images with Odia captions and rich metadata.
odia-llava-dataset : Hugging Face dataset of Odia instruction-style image-text pairs suitable for training LLaVA-style multimodal Odia models.

Applications

Odia Lingua Chatbot : AI-powered Odia language chatbot built with Groq API, Odia TTS (MMS-TTS-ory), and Google Search integration.

Events

Global Conference: ICON 2024 | WAT 2025 | Odisha AI Summit 2025 & Conference 2024 | 2023 Pt 2 2023 Pt 1 | 2022 | 2021 | 2020
Summer School / Workshop: OdiaGenAI Generative AI Workshops 2025 | 2022

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
.github/workflows		.github/workflows
_static		_static
LICENSE		LICENSE
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Awesome-Odia-AI

NLP

Translation

Transliteration

Language Understanding

Datasets

Language Model

Word Embedding

Morphanalyzers

Language Generation

Instruction Set

Pe-train Dataset

Foundation LLM

Fine-Tuned LLM

Benchmarking Set

Text Dataset

Parallel Translation Corpus

Monolingual Corpus

Lexical Resources

POS Tagged corpus

Dialect Detection corpus

Text Classification

NLP Libraries / Tools

Other NLP Resources

Audio

Speech Recognition

Text-to-Speech

Speech Dataset

Computer Vision

OCR

Multimodal

Applications

Events

Community

Policy & Ecosystem

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Packages