Jeopardy Language Model Benchmarking System
๐ฎ Benchmark LLMs with Jeopardy! questions. Tournament-style testing for large language models. What is... your model's true performance?
A lightning-fast benchmarking system that evaluates language models using authentic Jeopardy! questions, delivering tournament-style competition results with statistical precision and entertaining flair.
Note: This project now uses Bun as the JavaScript runtime and package manager for blazing-fast performance and an improved developer experience.
This system transforms LLM evaluation into an engaging tournament experience:
- โ Tournament Mode: Pit multiple language models against each other in head-to-head Jeopardy! competition
- โ Smart Evaluation: Multiple answer-matching strategies handle the quirks of "What is..." format responses
- โ Lightning Fast: Powered by Bun runtime with intelligent caching to minimize costs and maximize speed
- โ Real-time Drama: Watch models compete with live progress bars and instant scoring
- โ Cost Conscious: Smart caching and sampling keep your API budget happy
- โ Professional Results: Generate comprehensive reports worthy of a game show finale
- Multi-Model Showdowns: Test 2-10 language models simultaneously via OpenRouter
- Smart Question Selection: Automatic sampling from authentic Jeopardy! datasets
- Live Competition Feed: Real-time terminal updates as models battle for supremacy
- Podium Rankings: Clear winner determination with accuracy, speed, and cost metrics
- Replay System: Intelligent result caching avoids duplicate API calls
The system uses tournament-grade evaluation with multiple strategies:
- ๐ฏ Exact Match: Perfect accuracy for precise responses
- ๐ช Jeopardy Format: Handles "What is..." and "Who is..." responses like a pro
- ๐ Substring Detection: Finds correct answers buried in verbose responses
- ๐ Word Matching: Matches significant terms (70% threshold)
- ๐ Fuzzy Logic: Character similarity matching (80% threshold)
Track what matters in the tournament:
- Accuracy Rate: Correct responses / total questions
- Response Speed: Average time per question
- Cost Efficiency: API costs per correct answer
- Token Usage: Input/output token consumption
- Consistency: Performance variance across question types
- Bun runtime (latest version recommended)
- OpenRouter API key (get one here)
- Internet connection for the competition
# Clone the tournament system
git clone <repository-url>
cd alex-treBENCH
# Install with lightning speed
bun install# Set your API key for tournament access
export OPENROUTER_API_KEY=your_api_key_here
# Or add to .env file for convenience
echo "OPENROUTER_API_KEY=your_api_key_here" > .env# Download sample questions for a quick match
bun run dev download --sample 50
# Start your first tournament!
bun run dev benchmark# Download sample questions (perfect for testing)
bun run dev download --sample 50
# Focus on specific categories (like the real show!)
bun run dev download --sample 30 --category "SCIENCE"
# Force fresh download (bypass cache)
bun run dev download --force# Quick championship match with default contenders
bun run dev benchmark
# Custom tournament with your favorite models
bun run dev benchmark --models gpt-4o-mini claude-3-haiku gemini-2.0-flash
# Extended tournament with more questions
bun run dev benchmark --sample 100
# Category-specific showdown
bun run dev benchmark --category "HISTORY" --sample 25
# High-speed competition (disable caching for fresh results)
bun run dev benchmark --no-cache
# Tournament on steroids (increase concurrency)
bun run dev benchmark --concurrency 10# View all available tournament contenders
bun run dev models๐ฏ alex-treBENCH Tournament Results
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
๐ฎ Tournament: Quick Championship
๐ Questions: 25 | Categories: Mixed | Duration: 2m 15s
๐ FINAL STANDINGS ๐
๐ฅ CHAMPION: gpt-4o-mini
๐ Accuracy: 84.0% (21/25 correct)
โก Speed: 1250ms average response
๐ฐ Cost: $0.0023 (most efficient!)
๐ฅ RUNNER-UP: gemini-2.0-flash
๐ Accuracy: 80.0% (20/25 correct)
โก Speed: 1100ms average response
๐ฐ Cost: $0.0019
๐ฅ THIRD PLACE: claude-3-haiku
๐ Accuracy: 76.0% (19/25 correct)
โก Speed: 980ms average response (fastest!)
๐ฐ Cost: $0.0015 (most economical!)
๐ช Tournament Highlights:
โข Most challenging category: SCIENCE (62% avg accuracy)
โข Easiest category: POTPOURRI (88% avg accuracy)
โข Closest match: Questions 12-15 (all models within 5%)
โข Speed demon: claude-3-haiku dominated response times
Results saved to: ./results/tournament_2024_01_15_143022.json
alex-treBENCH combines the best elements from multiple benchmarking approaches:
๐ฏ From Professional Jeopardy! Systems:
- Robust question downloading and intelligent caching
- Multiple answer evaluation strategies for real-world accuracy
- Professional error handling and recovery
โก From Modern Benchmarking Tools:
- Clean, functional TypeScript architecture with AI SDK integration
- Real-time progress feedback and tournament atmosphere
- Smart result caching to minimize API costs
๐ Simplified Excellence:
- Reduced complexity while maintaining tournament-grade functionality
- Focus on user experience and entertainment value
- Clear separation of concerns with modular design
alex-treBENCH/
โโโ src/
โ โโโ data/
โ โ โโโ downloader.ts # Question acquisition system
โ โโโ models/
โ โ โโโ config.ts # Tournament competitor configurations
โ โโโ bench/
โ โ โโโ evaluator.ts # Answer evaluation and scoring
โ โ โโโ runner.ts # Tournament engine and orchestration
โ โโโ index.ts # Tournament command center
โโโ results/ # Tournament archives
โโโ cache/ # Question and result cache
โโโ package.json # Tournament manifest
โโโ README.md # This guide to glory!
Currently supporting tournament-ready models via OpenRouter:
- OpenAI: GPT-4 Turbo, GPT-4o, O1 Preview/Mini
- Anthropic: Claude 3.5 Sonnet, Claude 3 Opus
- Google: Gemini 2.0 Flash, Gemini 1.5 Pro
- OpenAI: GPT-3.5 Turbo, GPT-4o Mini
- Anthropic: Claude 3 Haiku, Claude 3 Sonnet
- Meta: Llama 3.1 405B, Llama 3.1 70B
- Google: Gemini 1.5 Flash
- DeepSeek: DeepSeek V2.5, DeepSeek Coder
- Qwen: Qwen 2.5 72B, Qwen 2.5 32B
- Mistral: Mistral Large, Mistral 7B
- Others: Many more available through OpenRouter
Keep your tournament costs under control:
- ๐ฏ Smart Caching: Results cached by model + question hash
- ๐ Strategic Sampling: Start with smaller question sets
- โก Concurrency Control: Limit simultaneous API requests
- ๐ Usage Tracking: Monitor costs per model in real-time
- ๐ฎ Quick Matches: Use
--sample 25for fast, cheap tournaments
- Quick Tournament (25 questions, 3 models): ~$0.05-0.15
- Standard Tournament (100 questions, 5 models): ~$0.20-0.60
- Championship (500 questions, 10 models): ~$1.00-3.00
Key settings in src/models/config.ts:
export const CONFIG = {
maxConcurrency: 5, // Simultaneous API calls
testRunsPerModel: 1, // Repetitions for consistency
timeoutSeconds: 30, // Request timeout
defaultSampleSize: 50, // Default question count
cacheResults: true, // Enable intelligent caching
showProgress: true // Live tournament feed
}# Install dependencies at lightning speed
bun install
# Run in development mode
bun run dev
# Build for production tournaments
bun run build
# Run the championship build
bun start
# Test the tournament system
bun testOPENROUTER_API_KEY: Your tournament access key (required)JEOPARDY_CACHE_DIR: Custom cache location (optional)TOURNAMENT_LOG_LEVEL: Logging verbosity (optional)
| Feature | Original Complex | Simple Bench | alex-treBENCH |
|---|---|---|---|
| Setup Complexity | Very High | Simple | Super Simple |
| Tournament Feel | Technical | Basic | ๐ฏ Engaging & Fun |
| Evaluation Quality | 6 strategies | 1 strategy | ๐ 4 Smart Strategies |
| Performance | Slow | Fast | โก Lightning Fast |
| Cost Control | Basic | Good | ๐ฐ Excellent |
| User Experience | CLI | Terminal | ๐ฎ Tournament Drama |
| Result Presentation | Text | Plain | ๐ Championship Style |
- Category Championships: Specialized tournaments by subject
- Speed Rounds: Lightning-fast evaluation challenges
- Historical Tracking: Tournament season statistics
- Web Interface: Browser-based tournament viewing
- Custom Datasets: Import your own question sets
- ๐ Jeopardy!: For creating the ultimate question-and-answer format
- ๐ Kaggle: For providing authentic Jeopardy! datasets
- ๐ OpenRouter: For unified access to tournament-worthy language models
- โก Bun: For making JavaScript fast enough for real-time tournaments
Need help running your tournaments?
- ๐ฏ Quick Start Issues: Check your API key and internet connection
- ๐ Tournament Questions: Review the command examples above
- ๐ Feature Requests: Open an issue in the GitHub repository
- ๐ฌ Bug Reports: Include your tournament logs and configuration
๐ Ready to Tournament!
What is... the champion in your next LLM tournament? Fire up alex-treBENCH and find out! ๐
"Thank you for playing alex-treBENCH! We have some lovely parting gifts..." ๐