-
- Global Benchmarks
- Multilingual Benchmarks
- Supported Languages
-
- Custom Benchmarks
- Quantization
- Multi-GPU
- Debug Mode
Eka-Eval is the official evaluation pipeline for the EKA project, designed to provide reliable, reproducible, and low-resource multilingual evaluation of LLMs.
It combines:
- Global benchmarks
- Low-resource Multilingual benchmarks
- Long-context evaluation
- Code, math, reasoning, QA
Eka-Eval provides a uniform interface, structured results, and production-ready performance features.
- 30+ Global benchmarks: MMLU, GSM8K, ARC-Challenge, HumanEval, HellaSwag, etc.
- 23 Low-resource Multilingual benchmarks: MMLU-IN, BoolQ-IN, ARC-IN, MILU, Flores-IN, etc.
- Long-context: ZeroSCROLLS, InfiniteBench, Multi-Needle
- Code generation with pass@k
- Math & logical reasoning
- Multilingual evaluation across 11 languages
- 120+ languages
- Smart transliteration
- Per-language scores
- Unified prompt templates
- Multi-GPU distributed evaluation
- 4-bit / 8-bit quantization
- Efficient batching
- Automatic CUDA memory cleanup
- Modular task registry
- Easy custom-benchmark integration
- JSON-based configs
- Clear logging + progress tracking
- CSV summary
- JSONL detailed results
- Per-language metrics
- Error analysis
- Full reproducibility with configuration dump
| Category | Count | Benchmarks | Metrics |
|---|---|---|---|
| 🌍 Multilingual & Low-Resource Suite | 23 | Knowledge: IndicMMLU-Pro, MMLU-IN, TriviaQA-IN, MILU, Reasoning: HellaSwag-IN, ARC-C-IN, IndicCOPA, XCOPA, GSM8K-IN Reading & QA: Belebele, BoolQ-IN, XQuAD-IN, XorQA-IN, Indic-QA, Generation (NLG): Flores-IN, IndicParaphrase, IndicWikiBio, IndicQuestionGeneration, IndicSentenceSummarization, IndicHeadlineGeneration NLU: IndicNER, IndicSentiment, IndicGLUE, XNLI |
Accuracy, F1, BLEU, chrF++, ROUGE-L |
| 🧠 Reasoning | 10 | ARC-Challenge, ARC-Easy, HellaSwag, PIQA, SIQA, WinoGrande, OpenBookQA, CommonSenseQA, BBH, AGI-Eval | Accuracy, Normalized Accuracy |
| 📚 Knowledge | 4 | MMLU, MMLU-Pro, TriviaQA, NaturalQuestions | Accuracy, Exact Match |
| 🧮 Math & Code | 7 | Math: GSM8K, MATH, GPQA Code: HumanEval, MBPP, HumanEval+, MBPP+, PythonSaga |
Accuracy, pass@1 |
| 📖 Reading | 3 | SQuAD, QuAC, BoolQ | F1, Exact Match |
| 🛠️ Tool & Context | 6 | Long Context: InfiniteBench, ZeroSCROLLS, NeedleInAHaystack Tool Use: API-Bank, API-Bench, ToolBench |
Retrieval Acc, Success Rate |
| Benchmark | Description | Metric |
|---|---|---|
| MMLU-IN | Indian-subject knowledge | Accuracy |
| ARC-Challenge-IN | Indian science reasoning | Accuracy |
| BoolQ-IN | Indic yes/no QA | Accuracy |
| MILU | Multilingual Indic understanding | Accuracy |
| Flores-IN | Translation | BLEU, ChrF |
| XQuAD-IN | Reading Comprehension | F1, EM |
Arabic (ar), Swahili (sw), Hindi (hi), Bengali (bn), Gujarati (gu), Kannada (kn), Malayalam (ml), Marathi (mr), Odia (or), Punjabi (pa), Tamil (ta), Telugu (te), Assamese (as), Urdu (ur), Indonesian (id), Greek (el), Quechua (qu), Yoruba (yo), Oromo (om), English (en)
git clone https://github.com/lingo-iitgn/eka-eval.git
cd eka-evalWe use Conda to manage Python 3.10 environments to ensure compatibility across macOS, Linux, and Windows.
Run this on any system:
# Create environment with Python 3.10
conda create -n eka-env python=3.10 pip -y
# Activate the environment
conda activate eka-envChoose the option that matches your hardware:
Uses the clean file without NVIDIA/CUDA packages.
pip install -r requirements-cpu.txtUses the file with bitsandbytes, CUDA extensions, and quantization support.
pip install -r requirements-gpu.txtInstall the project in editable (-e) mode:
pip install -e .Some models require authentication (e.g., Llama 3, gemma). Create a token at Hugging Face → Settings → Access Tokens (usually “Read” or “Write”). Log in:
huggingface-cli loginpython3 scripts/run_benchmarks.pydemonstration.mp4
eka-eval includes a fully interactive CLI wizard for evaluating models across English and Indic benchmark suites.
To start the wizard, simply run:
python scripts/run_benchmarks.pyThis will launch a guided, step-by-step interface.
--- Model Selection ---
1. Hugging Face / Local Model
2. API Model (OpenAI, Anthropic, etc.)
Enter choice: 1
Enter model name: google/gemma-2-2b
--- Available Benchmark Task Groups ---
1. CODE GENERATION
2. Tool use
3. MATH
4. READING COMPREHENSION
5. COMMONSENSE REASONING
6. WORLD KNOWLEDGE
7. LONG CONTEXT
8. General
9. LOW-RESOURCE MULTILNGUAL BENCHMARKS
10. ALL Task Groups
Select task group #(s) (e.g., '1', '1 3', 'ALL'): 2 12
You can select multiple groups by entering space-separated numbers (e.g., 2 9).
--- Select benchmarks for INDIC BENCHMARKS ---
1. MMLU-IN 4. ARC-Challenge-IN
2. BoolQ-IN 5. ALL
3. Flores-IN 6. SKIP
Select benchmark #(s): 4 5
Again, multiple selections are supported.
... Evaluation Complete ...
| Model | Task | Benchmark | Score |
|-------------------|-------------------------|-----------|-------|
| google/gemma-2-2b | MULTILINGUAL BENCHMARKS | ARC-IN | 33.5% |
When prompted:
Do you want to create visualizations for the results? (yes/no): yes
The system will generate plots:
✅ Visualizations created successfully!
Saved to: results_output/visualizations
eka-eval/
├─ eka_eval/
│ ├─ benchmarks/
│ │ ├─ tasks/
│ │ │ ├─ code/
│ │ │ ├─ math/
│ │ │ ├─ multilingual/
│ │ │ ├─ reasoning/
│ │ │ ├─ long_context/
│ │ │ └─ general/
│ │ └─ benchmark_registry.py
│ ├─ core/
│ ├─ utils/
│ └─ config/
├─ prompts/
├─ scripts/
│ ├─ run_benchmarks.py
│ └─ evaluation_worker.py
└─ results_output/
Eka-Eval provides extensive customization for Indic languages, few-shot settings, prompt templates, and even fully custom benchmarks.
Benchmarks like MILU or ARC-Challenge-Indic can be restricted to specific languages by modifying:
eka_eval/config/benchmark_config.py
"MILU": {
"description": "Accuracy on the Massive Indic Language Understanding benchmark",
"evaluation_function": "indic.milu_in.evaluate_milu_in",
"task_args": {
"dataset_name": "ai4bharat/MILU",
"target_languages": ["Bengali"], # restrict to one language
"dataset_split": "test",
"max_new_tokens": 5,
"save_detailed": False,
"prompt_file_benchmark_key": "milu_in"
}
}Control the number of demonstration examples and batch sizes directly via task_args.
"ARC-Challenge-Indic": {
"description": "Zero-shot ARC-Challenge-Indic evaluation across 11 languages",
"evaluation_function": "indic.arc_c_in.evaluate_arc_c_in",
"task_args": {
"dataset_name": "sarvamai/arc-challenge-indic",
"target_languages": ["bn"], # only Bengali
"dataset_split": "validation",
"num_few_shot": 0, # Zero-shot; set >0 for few-shot
"max_new_tokens": 10,
"generation_batch_size": 8,
# switch prompt templates
"prompt_template_name_zeroshot": "arc_c_in_0shot",
"prompt_template_name_fewshot": "arc_c_in_5shot",
"prompt_file_benchmark_key": "arc_c_in",
"prompt_file_category": "indic"
}
}Prompts are stored under the prompts/ directory and can be fully customized.
{
"boolq_in_0shot": {
"template": "Passage: {passage}\nQuestion: {question}\nAnswer (Yes/No):",
"description": "Standard zero-shot prompt"
},
"default_few_shot_examples_boolq_in": [
{
"passage": "भारत दक्षिण एशिया में स्थित एक देश है। यह दुनिया का सातवां सबसे बड़ा देश है।",
"question": "क्या भारत एशिया में है?",
"answer": "हाँ"
},
{
"passage": "सूर्य पृथ्वी के चारों ओर घूमता है। यह हमारे सौर मंडल का केंद्र है।",
"question": "क्या सूर्य पृथ्वी के चारों ओर घूमता है?",
"answer": "नहीं"
}
]
}You can edit:
- the template
- instructions
- placeholders
{question},{context},{choices} - the few-shot examples list
You can add entirely new datasets and evaluators.
File: eka_eval/benchmarks/tasks/custom/my_task.py
def evaluate_my_task(pipe, tokenizer, model_name_for_logging, device, **kwargs):
score = 85.5 # your logic here
return {"MyTask": score}File: prompts/custom/my_task.json
{
"my_task_0shot": {
"template": "Question: {question}\nAnswer:"
}
}Add to benchmark_config.py:
"MyTask": {
"evaluation_function": "custom.my_task.evaluate_my_task",
"task_args": {
"dataset_name": "my_org/custom_dataset",
"prompt_file_category": "custom"
}
}Eka-Eval supports quantization and multi-GPU evaluation out of the box.
Useful for running 33B–70B models on consumer GPUs.
python scripts/run_benchmarks.py \
--model_name "meta-llama/Llama-2-70b-hf" \
--quantization "4bit"Distribute workload across multiple GPUs.
export CUDA_VISIBLE_DEVICES=0,1,2
python scripts/run_benchmarks.py --num_gpus 3results_output/calculated.csv
model,task,benchmark,score
gemma-2b,MATH,GSM8K,42.3
results_output/detailed_results/*.json
{
"id": 123,
"question": "...",
"predicted": "4",
"correct": true
}{
"BoolQ-IN_hi": 65.2,
"BoolQ-IN_bn": 70.1
}| Issue | Fix |
|---|---|
| CUDA OOM | Reduce batch size, use quantization |
| HF 404 | Wrong model name or missing token |
| Missing prompt template | Check prompts folder |
| Code evaluator error | Set export HF_ALLOW_CODE_EVAL=1 |
We welcome contributions!
- Report issues
- Add new benchmarks
- Improve documentation
- Submit PRs
- MMLU
- GSM8K
- HumanEval
- BBH
- AGIEval
- AI4Bharat Indic datasets
@misc{sinha2025ekaevalcomprehensiveevaluation,
title={Eka-Eval : A Comprehensive Evaluation Framework for Large Language Models in Indian Languages},
author={Samridhi Raj Sinha and Rajvee Sheth and Abhishek Upperwal and Mayank Singh},
year={2025},
eprint={2507.01853},
archivePrefix={arXiv},
primaryClass={cs.CL}
}