Eka-Eval

Comprehensive Evaluation Framework for Large Language Models with a focus on low-resource languages

📌 Table of Contents

Overview
Key Features
Supported Benchmarks
- Global Benchmarks
- Multilingual Benchmarks
- Supported Languages
Installation
Quick Start
Project Structure
Advanced Usage
- Custom Benchmarks
- Quantization
- Multi-GPU
- Debug Mode
Results & Reporting
Troubleshooting
Contributing
References
Citation

Overview

Eka-Eval is the official evaluation pipeline for the EKA project, designed to provide reliable, reproducible, and low-resource multilingual evaluation of LLMs.

It combines:

Global benchmarks
Low-resource Multilingual benchmarks
Long-context evaluation
Code, math, reasoning, QA

Eka-Eval provides a uniform interface, structured results, and production-ready performance features.

Key Features

✔️ Benchmark Coverage

30+ Global benchmarks: MMLU, GSM8K, ARC-Challenge, HumanEval, HellaSwag, etc.
23 Low-resource Multilingual benchmarks: MMLU-IN, BoolQ-IN, ARC-IN, MILU, Flores-IN, etc.
Long-context: ZeroSCROLLS, InfiniteBench, Multi-Needle
Code generation with pass@k
Math & logical reasoning
Multilingual evaluation across 11 languages

✔️ Multilingual Support

120+ languages
Smart transliteration
Per-language scores
Unified prompt templates

✔️ Performance & Scalability

Multi-GPU distributed evaluation
4-bit / 8-bit quantization
Efficient batching
Automatic CUDA memory cleanup

✔️ Developer Friendly

Modular task registry
Easy custom-benchmark integration
JSON-based configs
Clear logging + progress tracking

✔️ Reporting & Analysis

CSV summary
JSONL detailed results
Per-language metrics
Error analysis
Full reproducibility with configuration dump

Supported Benchmarks

Category	Count	Benchmarks	Metrics
🌍 Multilingual & Low-Resource Suite	23	Knowledge: IndicMMLU-Pro, MMLU-IN, TriviaQA-IN, MILU, Reasoning: HellaSwag-IN, ARC-C-IN, IndicCOPA, XCOPA, GSM8K-IN Reading & QA: Belebele, BoolQ-IN, XQuAD-IN, XorQA-IN, Indic-QA, Generation (NLG): Flores-IN, IndicParaphrase, IndicWikiBio, IndicQuestionGeneration, IndicSentenceSummarization, IndicHeadlineGeneration NLU: IndicNER, IndicSentiment, IndicGLUE, XNLI	Accuracy, F1, BLEU, chrF++, ROUGE-L
🧠 Reasoning	10	ARC-Challenge, ARC-Easy, HellaSwag, PIQA, SIQA, WinoGrande, OpenBookQA, CommonSenseQA, BBH, AGI-Eval	Accuracy, Normalized Accuracy
📚 Knowledge	4	MMLU, MMLU-Pro, TriviaQA, NaturalQuestions	Accuracy, Exact Match
🧮 Math & Code	7	Math: GSM8K, MATH, GPQA Code: HumanEval, MBPP, HumanEval+, MBPP+, PythonSaga	Accuracy, pass@1
📖 Reading	3	SQuAD, QuAC, BoolQ	F1, Exact Match
🛠️ Tool & Context	6	Long Context: InfiniteBench, ZeroSCROLLS, NeedleInAHaystack Tool Use: API-Bank, API-Bench, ToolBench	Retrieval Acc, Success Rate

🇮🇳 Low-resource Multilingual Benchmarks

Benchmark	Description	Metric
MMLU-IN	Indian-subject knowledge	Accuracy
ARC-Challenge-IN	Indian science reasoning	Accuracy
BoolQ-IN	Indic yes/no QA	Accuracy
MILU	Multilingual Indic understanding	Accuracy
Flores-IN	Translation	BLEU, ChrF
XQuAD-IN	Reading Comprehension	F1, EM

🗣️ Supported Languages

Arabic (ar), Swahili (sw), Hindi (hi), Bengali (bn), Gujarati (gu), Kannada (kn), Malayalam (ml), Marathi (mr), Odia (or), Punjabi (pa), Tamil (ta), Telugu (te), Assamese (as), Urdu (ur), Indonesian (id), Greek (el), Quechua (qu), Yoruba (yo), Oromo (om), English (en)

Installation

1. Clone Repo

git clone https://github.com/lingo-iitgn/eka-eval.git
cd eka-eval

2. Create Environment (Conda)

We use Conda to manage Python 3.10 environments to ensure compatibility across macOS, Linux, and Windows.

Step 1: Create and Activate

Run this on any system:

# Create environment with Python 3.10
conda create -n eka-env python=3.10 pip -y

# Activate the environment
conda activate eka-env

Step 2: Install Dependencies

Choose the option that matches your hardware:

Option A — macOS (M1/M2/M3) or CPU-only

Uses the clean file without NVIDIA/CUDA packages.

pip install -r requirements-cpu.txt

Option B — GPU Server (Linux + NVIDIA)

Uses the file with bitsandbytes, CUDA extensions, and quantization support.

pip install -r requirements-gpu.txt

Step 3: Install Project

Install the project in editable (-e) mode:

pip install -e .

3. (Optional) HuggingFace Login

Some models require authentication (e.g., Llama 3, gemma). Create a token at Hugging Face → Settings → Access Tokens (usually “Read” or “Write”). Log in:

huggingface-cli login

Quick Start

Run the Interactive Evaluator

python3 scripts/run_benchmarks.py

Video Demonstration

demonstration.mp4

Running Benchmarks (Interactive Wizard Mode)

eka-eval includes a fully interactive CLI wizard for evaluating models across English and Indic benchmark suites.

To start the wizard, simply run:

python scripts/run_benchmarks.py

This will launch a guided, step-by-step interface.

🧩 1. Select Model Source

--- Model Selection ---

1. Hugging Face / Local Model
2. API Model (OpenAI, Anthropic, etc.)

Enter choice: 1
Enter model name: google/gemma-2-2b

📚 2. Select Task Groups


--- Available Benchmark Task Groups ---

1. CODE GENERATION
2. Tool use
3. MATH
4. READING COMPREHENSION
5. COMMONSENSE REASONING
6. WORLD KNOWLEDGE
7. LONG CONTEXT
8. General
9. LOW-RESOURCE MULTILNGUAL BENCHMARKS
10. ALL Task Groups

Select task group #(s) (e.g., '1', '1 3', 'ALL'): 2 12

You can select multiple groups by entering space-separated numbers (e.g., 2 9).

🎯 3. Select Specific Benchmarks

--- Select benchmarks for INDIC BENCHMARKS ---

1. MMLU-IN                 4. ARC-Challenge-IN
2. BoolQ-IN                5. ALL
3. Flores-IN               6. SKIP

Select benchmark #(s): 4 5

Again, multiple selections are supported.

📊 4. View Results & Visualize

... Evaluation Complete ...

| Model             | Task                    | Benchmark | Score |
|-------------------|-------------------------|-----------|-------|
| google/gemma-2-2b | MULTILINGUAL BENCHMARKS | ARC-IN    | 33.5% |

When prompted:

Do you want to create visualizations for the results? (yes/no): yes

The system will generate plots:

✅ Visualizations created successfully! 
Saved to: results_output/visualizations

Project Structure

eka-eval/
├─ eka_eval/
│  ├─ benchmarks/
│  │  ├─ tasks/
│  │  │  ├─ code/
│  │  │  ├─ math/
│  │  │  ├─ multilingual/
│  │  │  ├─ reasoning/
│  │  │  ├─ long_context/
│  │  │  └─ general/
│  │  └─ benchmark_registry.py
│  ├─ core/
│  ├─ utils/
│  └─ config/
├─ prompts/
├─ scripts/
│  ├─ run_benchmarks.py
│  └─ evaluation_worker.py
└─ results_output/

🎯 Advanced Configuration & Usage

Eka-Eval provides extensive customization for Indic languages, few-shot settings, prompt templates, and even fully custom benchmarks.

1. Configuring Indic Languages & Splits

Benchmarks like MILU or ARC-Challenge-Indic can be restricted to specific languages by modifying:

eka_eval/config/benchmark_config.py

Example: Run MILU only for Bengali

"MILU": {
    "description": "Accuracy on the Massive Indic Language Understanding benchmark",
    "evaluation_function": "indic.milu_in.evaluate_milu_in",
    "task_args": {
        "dataset_name": "ai4bharat/MILU",
        "target_languages": ["Bengali"],   # restrict to one language
        "dataset_split": "test",
        "max_new_tokens": 5,
        "save_detailed": False,
        "prompt_file_benchmark_key": "milu_in"
    }
}

2. Customizing Few-Shot & Zero-Shot Settings

Control the number of demonstration examples and batch sizes directly via task_args.

Example: Zero-Shot ARC-Challenge-Indic

"ARC-Challenge-Indic": {
    "description": "Zero-shot ARC-Challenge-Indic evaluation across 11 languages",
    "evaluation_function": "indic.arc_c_in.evaluate_arc_c_in",
    "task_args": {
        "dataset_name": "sarvamai/arc-challenge-indic",
        "target_languages": ["bn"],     # only Bengali
        "dataset_split": "validation",
        "num_few_shot": 0,              # Zero-shot; set >0 for few-shot
        "max_new_tokens": 10,
        "generation_batch_size": 8,

        # switch prompt templates
        "prompt_template_name_zeroshot": "arc_c_in_0shot",
        "prompt_template_name_fewshot": "arc_c_in_5shot",

        "prompt_file_benchmark_key": "arc_c_in",
        "prompt_file_category": "indic"
    }
}

3. Modifying Prompt Templates & Few-Shot Examples

Prompts are stored under the prompts/ directory and can be fully customized.

Example File: `prompts/indic/boolq_in.json`

{
  "boolq_in_0shot": {
    "template": "Passage: {passage}\nQuestion: {question}\nAnswer (Yes/No):",
    "description": "Standard zero-shot prompt"
  },
  "default_few_shot_examples_boolq_in": [
    {
      "passage": "भारत दक्षिण एशिया में स्थित एक देश है। यह दुनिया का सातवां सबसे बड़ा देश है।",
      "question": "क्या भारत एशिया में है?",
      "answer": "हाँ"
    },
    {
      "passage": "सूर्य पृथ्वी के चारों ओर घूमता है। यह हमारे सौर मंडल का केंद्र है।",
      "question": "क्या सूर्य पृथ्वी के चारों ओर घूमता है?",
      "answer": "नहीं"
    }
  ]
}

You can edit:

the template
instructions
placeholders {question}, {context}, {choices}
the few-shot examples list

4. Adding a Completely Custom Benchmark

You can add entirely new datasets and evaluators.

Step 1: Create Evaluator Logic

File: eka_eval/benchmarks/tasks/custom/my_task.py

def evaluate_my_task(pipe, tokenizer, model_name_for_logging, device, **kwargs):
    score = 85.5  # your logic here
    return {"MyTask": score}

Step 2: Add Prompt Templates

File: prompts/custom/my_task.json

{
  "my_task_0shot": {
    "template": "Question: {question}\nAnswer:"
  }
}

Step 3: Register in Configuration

Add to benchmark_config.py:

"MyTask": {
  "evaluation_function": "custom.my_task.evaluate_my_task",
  "task_args": {
      "dataset_name": "my_org/custom_dataset",
      "prompt_file_category": "custom"
  }
}

⚙️ Hardware Optimization

Eka-Eval supports quantization and multi-GPU evaluation out of the box.

4-bit / 8-bit Quantization

Useful for running 33B–70B models on consumer GPUs.

python scripts/run_benchmarks.py \
    --model_name "meta-llama/Llama-2-70b-hf" \
    --quantization "4bit"

Multi-GPU Parallel Evaluation

Distribute workload across multiple GPUs.

export CUDA_VISIBLE_DEVICES=0,1,2
python scripts/run_benchmarks.py --num_gpus 3

Results & Reporting

CSV Summary

results_output/calculated.csv

model,task,benchmark,score
gemma-2b,MATH,GSM8K,42.3

JSONL Detailed Results

results_output/detailed_results/*.json

{
  "id": 123,
  "question": "...",
  "predicted": "4",
  "correct": true
}

Per-Language Metrics

{
  "BoolQ-IN_hi": 65.2,
  "BoolQ-IN_bn": 70.1
}

Troubleshooting

Issue	Fix
CUDA OOM	Reduce batch size, use quantization
HF 404	Wrong model name or missing token
Missing prompt template	Check prompts folder
Code evaluator error	Set `export HF_ALLOW_CODE_EVAL=1`

Contributing

We welcome contributions!

Report issues
Add new benchmarks
Improve documentation
Submit PRs

References

MMLU
GSM8K
HumanEval
BBH
AGIEval
AI4Bharat Indic datasets

Citation

@misc{sinha2025ekaevalcomprehensiveevaluation,
      title={Eka-Eval : A Comprehensive Evaluation Framework for Large Language Models in Indian Languages}, 
      author={Samridhi Raj Sinha and Rajvee Sheth and Abhishek Upperwal and Mayank Singh},
      year={2025},
      eprint={2507.01853},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Name		Name	Last commit message	Last commit date
Latest commit History 151 Commits
.github/workflows		.github/workflows
docs		docs
eka_eval		eka_eval
prompts		prompts
results		results
results_output		results_output
scripts		scripts
utils		utils
.gitattributes		.gitattributes
.gitignore		.gitignore
.readthedocs.yaml		.readthedocs.yaml
LICENSE		LICENSE
README.md		README.md
architecture.png		architecture.png
pyproject.toml		pyproject.toml
requirements-cpu.txt		requirements-cpu.txt
requirements-gpu.txt		requirements-gpu.txt

Folders and files

Latest commit

History

Repository files navigation

Eka-Eval

Comprehensive Evaluation Framework for Large Language Models with a focus on low-resource languages

📌 Table of Contents

Overview

Key Features

✔️ Benchmark Coverage

✔️ Multilingual Support

✔️ Performance & Scalability

✔️ Developer Friendly

✔️ Reporting & Analysis

Supported Benchmarks

🇮🇳 Low-resource Multilingual Benchmarks

🗣️ Supported Languages

Installation

1. Clone Repo

2. Create Environment (Conda)

Step 1: Create and Activate

Step 2: Install Dependencies

Option A — macOS (M1/M2/M3) or CPU-only

Option B — GPU Server (Linux + NVIDIA)

Step 3: Install Project

3. (Optional) HuggingFace Login

Quick Start

Run the Interactive Evaluator

Video Demonstration

Running Benchmarks (Interactive Wizard Mode)

🧩 1. Select Model Source

📚 2. Select Task Groups

🎯 3. Select Specific Benchmarks

📊 4. View Results & Visualize

Project Structure

🎯 Advanced Configuration & Usage

1. Configuring Indic Languages & Splits

Example: Run MILU only for Bengali

2. Customizing Few-Shot & Zero-Shot Settings

Example: Zero-Shot ARC-Challenge-Indic

3. Modifying Prompt Templates & Few-Shot Examples

Example File: prompts/indic/boolq_in.json

4. Adding a Completely Custom Benchmark

Step 1: Create Evaluator Logic

Step 2: Add Prompt Templates

Step 3: Register in Configuration

⚙️ Hardware Optimization

4-bit / 8-bit Quantization

Multi-GPU Parallel Evaluation

Results & Reporting

CSV Summary

JSONL Detailed Results

Per-Language Metrics

Troubleshooting

Contributing

References

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Example File: `prompts/indic/boolq_in.json`

Packages