Skip to content

Kilo-Org/alex-treBENCH

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

73 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

alex-treBENCH! ๐ŸŽฏ

Jeopardy Language Model Benchmarking System

๐ŸŽฎ Benchmark LLMs with Jeopardy! questions. Tournament-style testing for large language models. What is... your model's true performance?

Bun Ready License: MIT TypeScript

A lightning-fast benchmarking system that evaluates language models using authentic Jeopardy! questions, delivering tournament-style competition results with statistical precision and entertaining flair.

Note: This project now uses Bun as the JavaScript runtime and package manager for blazing-fast performance and an improved developer experience.

๐Ÿ† What is... alex-treBENCH?

This system transforms LLM evaluation into an engaging tournament experience:

  • โœ… Tournament Mode: Pit multiple language models against each other in head-to-head Jeopardy! competition
  • โœ… Smart Evaluation: Multiple answer-matching strategies handle the quirks of "What is..." format responses
  • โœ… Lightning Fast: Powered by Bun runtime with intelligent caching to minimize costs and maximize speed
  • โœ… Real-time Drama: Watch models compete with live progress bars and instant scoring
  • โœ… Cost Conscious: Smart caching and sampling keep your API budget happy
  • โœ… Professional Results: Generate comprehensive reports worthy of a game show finale

๐ŸŽฏ Key Features

Tournament Capabilities

  • Multi-Model Showdowns: Test 2-10 language models simultaneously via OpenRouter
  • Smart Question Selection: Automatic sampling from authentic Jeopardy! datasets
  • Live Competition Feed: Real-time terminal updates as models battle for supremacy
  • Podium Rankings: Clear winner determination with accuracy, speed, and cost metrics
  • Replay System: Intelligent result caching avoids duplicate API calls

Evaluation Excellence

The system uses tournament-grade evaluation with multiple strategies:

  1. ๐ŸŽฏ Exact Match: Perfect accuracy for precise responses
  2. ๐ŸŽช Jeopardy Format: Handles "What is..." and "Who is..." responses like a pro
  3. ๐Ÿ” Substring Detection: Finds correct answers buried in verbose responses
  4. ๐Ÿ“ Word Matching: Matches significant terms (70% threshold)
  5. ๐ŸŒŸ Fuzzy Logic: Character similarity matching (80% threshold)

Performance Metrics

Track what matters in the tournament:

  • Accuracy Rate: Correct responses / total questions
  • Response Speed: Average time per question
  • Cost Efficiency: API costs per correct answer
  • Token Usage: Input/output token consumption
  • Consistency: Performance variance across question types

๐Ÿš€ Quick Tournament Setup

Prerequisites

  • Bun runtime (latest version recommended)
  • OpenRouter API key (get one here)
  • Internet connection for the competition

Installation

# Clone the tournament system
git clone <repository-url>
cd alex-treBENCH

# Install with lightning speed
bun install

Competition Setup

# Set your API key for tournament access
export OPENROUTER_API_KEY=your_api_key_here
# Or add to .env file for convenience
echo "OPENROUTER_API_KEY=your_api_key_here" > .env

Your First Tournament

# Download sample questions for a quick match
bun run dev download --sample 50

# Start your first tournament!
bun run dev benchmark

๐ŸŽฎ Tournament Commands

Question Management

# Download sample questions (perfect for testing)
bun run dev download --sample 50

# Focus on specific categories (like the real show!)
bun run dev download --sample 30 --category "SCIENCE"

# Force fresh download (bypass cache)
bun run dev download --force

Running Tournaments

# Quick championship match with default contenders
bun run dev benchmark

# Custom tournament with your favorite models
bun run dev benchmark --models gpt-4o-mini claude-3-haiku gemini-2.0-flash

# Extended tournament with more questions
bun run dev benchmark --sample 100

# Category-specific showdown
bun run dev benchmark --category "HISTORY" --sample 25

# High-speed competition (disable caching for fresh results)
bun run dev benchmark --no-cache

# Tournament on steroids (increase concurrency)
bun run dev benchmark --concurrency 10

Competitor Information

# View all available tournament contenders
bun run dev models

๐Ÿ… Tournament Results

Sample Championship Output

๐ŸŽฏ alex-treBENCH Tournament Results
โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•

๐ŸŽฎ Tournament: Quick Championship
๐Ÿ“Š Questions: 25 | Categories: Mixed | Duration: 2m 15s

๐Ÿ† FINAL STANDINGS ๐Ÿ†

๐Ÿฅ‡ CHAMPION: gpt-4o-mini
   ๐Ÿ“ˆ Accuracy: 84.0% (21/25 correct)
   โšก Speed: 1250ms average response
   ๐Ÿ’ฐ Cost: $0.0023 (most efficient!)
   
๐Ÿฅˆ RUNNER-UP: gemini-2.0-flash  
   ๐Ÿ“ˆ Accuracy: 80.0% (20/25 correct)
   โšก Speed: 1100ms average response
   ๐Ÿ’ฐ Cost: $0.0019
   
๐Ÿฅ‰ THIRD PLACE: claude-3-haiku
   ๐Ÿ“ˆ Accuracy: 76.0% (19/25 correct)
   โšก Speed: 980ms average response (fastest!)
   ๐Ÿ’ฐ Cost: $0.0015 (most economical!)

๐ŸŽช Tournament Highlights:
โ€ข Most challenging category: SCIENCE (62% avg accuracy)
โ€ข Easiest category: POTPOURRI (88% avg accuracy)
โ€ข Closest match: Questions 12-15 (all models within 5%)
โ€ข Speed demon: claude-3-haiku dominated response times

Results saved to: ./results/tournament_2024_01_15_143022.json

๐Ÿ—๏ธ Tournament Architecture

Design Philosophy

alex-treBENCH combines the best elements from multiple benchmarking approaches:

๐ŸŽฏ From Professional Jeopardy! Systems:

  • Robust question downloading and intelligent caching
  • Multiple answer evaluation strategies for real-world accuracy
  • Professional error handling and recovery

โšก From Modern Benchmarking Tools:

  • Clean, functional TypeScript architecture with AI SDK integration
  • Real-time progress feedback and tournament atmosphere
  • Smart result caching to minimize API costs

๐Ÿš€ Simplified Excellence:

  • Reduced complexity while maintaining tournament-grade functionality
  • Focus on user experience and entertainment value
  • Clear separation of concerns with modular design

File Structure

alex-treBENCH/
โ”œโ”€โ”€ src/
โ”‚   โ”œโ”€โ”€ data/
โ”‚   โ”‚   โ””โ”€โ”€ downloader.ts     # Question acquisition system
โ”‚   โ”œโ”€โ”€ models/
โ”‚   โ”‚   โ””โ”€โ”€ config.ts         # Tournament competitor configurations
โ”‚   โ”œโ”€โ”€ bench/
โ”‚   โ”‚   โ”œโ”€โ”€ evaluator.ts      # Answer evaluation and scoring
โ”‚   โ”‚   โ””โ”€โ”€ runner.ts         # Tournament engine and orchestration
โ”‚   โ””โ”€โ”€ index.ts              # Tournament command center
โ”œโ”€โ”€ results/                  # Tournament archives
โ”œโ”€โ”€ cache/                    # Question and result cache
โ”œโ”€โ”€ package.json              # Tournament manifest
โ””โ”€โ”€ README.md                 # This guide to glory!

๐ŸŽฏ Available Competitors

Currently supporting tournament-ready models via OpenRouter:

๐Ÿ† Championship Tier

  • OpenAI: GPT-4 Turbo, GPT-4o, O1 Preview/Mini
  • Anthropic: Claude 3.5 Sonnet, Claude 3 Opus
  • Google: Gemini 2.0 Flash, Gemini 1.5 Pro

๐Ÿฅ‡ Professional Tier

  • OpenAI: GPT-3.5 Turbo, GPT-4o Mini
  • Anthropic: Claude 3 Haiku, Claude 3 Sonnet
  • Meta: Llama 3.1 405B, Llama 3.1 70B
  • Google: Gemini 1.5 Flash

๐Ÿฅ‰ Challenger Tier

  • DeepSeek: DeepSeek V2.5, DeepSeek Coder
  • Qwen: Qwen 2.5 72B, Qwen 2.5 32B
  • Mistral: Mistral Large, Mistral 7B
  • Others: Many more available through OpenRouter

๐Ÿ’ฐ Tournament Budget Management

Keep your tournament costs under control:

  • ๐ŸŽฏ Smart Caching: Results cached by model + question hash
  • ๐Ÿ“Š Strategic Sampling: Start with smaller question sets
  • โšก Concurrency Control: Limit simultaneous API requests
  • ๐Ÿ“ˆ Usage Tracking: Monitor costs per model in real-time
  • ๐ŸŽฎ Quick Matches: Use --sample 25 for fast, cheap tournaments

Cost Examples (Approximate)

  • Quick Tournament (25 questions, 3 models): ~$0.05-0.15
  • Standard Tournament (100 questions, 5 models): ~$0.20-0.60
  • Championship (500 questions, 10 models): ~$1.00-3.00

๐Ÿ› ๏ธ Tournament Configuration

Key settings in src/models/config.ts:

export const CONFIG = {
  maxConcurrency: 5,           // Simultaneous API calls
  testRunsPerModel: 1,         // Repetitions for consistency
  timeoutSeconds: 30,          // Request timeout
  defaultSampleSize: 50,       // Default question count
  cacheResults: true,          // Enable intelligent caching
  showProgress: true           // Live tournament feed
}

๐ŸŽช Development & Contributing

Setup Your Development Tournament

# Install dependencies at lightning speed
bun install

# Run in development mode
bun run dev

# Build for production tournaments
bun run build

# Run the championship build
bun start

# Test the tournament system
bun test

Environment Variables

  • OPENROUTER_API_KEY: Your tournament access key (required)
  • JEOPARDY_CACHE_DIR: Custom cache location (optional)
  • TOURNAMENT_LOG_LEVEL: Logging verbosity (optional)

๐Ÿ† Tournament Comparison

Feature Original Complex Simple Bench alex-treBENCH
Setup Complexity Very High Simple Super Simple
Tournament Feel Technical Basic ๐ŸŽฏ Engaging & Fun
Evaluation Quality 6 strategies 1 strategy ๐Ÿ† 4 Smart Strategies
Performance Slow Fast โšก Lightning Fast
Cost Control Basic Good ๐Ÿ’ฐ Excellent
User Experience CLI Terminal ๐ŸŽฎ Tournament Drama
Result Presentation Text Plain ๐Ÿ… Championship Style

๐ŸŽฏ What is... Coming Next?

๐Ÿš€ Planned Tournament Features

  • Category Championships: Specialized tournaments by subject
  • Speed Rounds: Lightning-fast evaluation challenges
  • Historical Tracking: Tournament season statistics
  • Web Interface: Browser-based tournament viewing
  • Custom Datasets: Import your own question sets

๐Ÿ™ Tournament Acknowledgments

  • ๐Ÿ† Jeopardy!: For creating the ultimate question-and-answer format
  • ๐Ÿ“Š Kaggle: For providing authentic Jeopardy! datasets
  • ๐Ÿš€ OpenRouter: For unified access to tournament-worthy language models
  • โšก Bun: For making JavaScript fast enough for real-time tournaments

๐Ÿ“ž Tournament Support

Need help running your tournaments?

  • ๐ŸŽฏ Quick Start Issues: Check your API key and internet connection
  • ๐Ÿ† Tournament Questions: Review the command examples above
  • ๐Ÿš€ Feature Requests: Open an issue in the GitHub repository
  • ๐Ÿ’ฌ Bug Reports: Include your tournament logs and configuration

๐ŸŽ‰ Ready to Tournament!

What is... the champion in your next LLM tournament? Fire up alex-treBENCH and find out! ๐Ÿ†

"Thank you for playing alex-treBENCH! We have some lovely parting gifts..." ๐ŸŽ

About

๐ŸŽฎ Benchmark LLMs with Jeopardy! questions. Tournament-style testing for large language models. What is... your model's true performance?

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors