idea : implementing a lean and mean evaluation tool #18195

ggerganov · 2025-12-19T07:49:26Z

ggerganov
Dec 19, 2025
Maintainer

Recently I looked at NVIDIA's Evaluator library. The idea is to be able to run various evaluations (AIME, MMLU, GSM8K, etc.) against an OpenAI compatible endpoint. As an idea and user interface I think it is good.

However, I tried to get this running on my Mac and quickly gave up. Among having to provide 3 or 4 API keys for various stuff, the library requires to install docker and download some containers. And then the runs get queued in some sort of manager/scheduler. Overall it looks extremely over-engineered and hard to get running. Eventually, I was not able to make it run so I gave up.

Still, I think the idea for such a tool is good and I think we should build one. It would be very basic python script, with near-zero dependencies, that runs most of the evals against a specified endpoint.

Requirements:

No dependencies beyond what's needed for sending HTTP requests
No Docker
No configuration files
Run the most popular evals with a single one-line command

The best version of this that I know of is the gpt-oss evals, though it is also quite encumbered and over-engineered for my taste. Plus it has some issues.

So I think there is a room for such a basic tool. If anyone is interested in implementing it, I can help guiding.

Update: #21152

JohannesGaessler · 2025-12-19T08:58:01Z

JohannesGaessler
Dec 19, 2025
Collaborator

I am in principle still working on something along those lines: https://github.com/JohannesGaessler/elo_hellm . The intent is to have quality control for the training code but there are other things that I need to take care of first: The benchmarks took a long time even with small models on 6x RTX 4090 so I decided that I need to first work on better server throughput (to reduce GPU time, particularly for multi GPU use) and on better memory allocation (to reduce person time).

From my end I would design it like this:

Convert benchmark problems to some common spec.
Distribute problems to OAI-compatible servers and collect outputs.
Pass problems + outputs to servers with the llama.cpp API running a small model, ask it to extract the answer from the output, and use a regex to force the output to be some simple format that you can analyze programatically (e.g. "A", "B", "C", or "D").
Analyze the results.

Steps 1-3 could very well live within llama.cpp and I would be happy to upstream them. The ETA from my end would be several months at the very least (but I would also be happy to help with reviewing if someone else is interested in working on it in the meantime).

Also, particularly for step 2 I think we have very different ideas of what should be implemented. One could in principle simply send all HTTP requests to a single server. But in my experience that just takes way too long because some of the benchmarks have 10000+ questions. So I would be using a batch system like HTCondor for that part in order to make it scalable but that would very much go against the idea of making the tool "simple". It may make sense to define specs for intermediate formats and to make the way of how to evaluate models modular.

2 replies

JohannesGaessler Dec 19, 2025
Collaborator

I forgot: in principle scripts/server-bench.py already has the bare-bones structure for sending HTTP requests to a server but as of right now the outputs are simply discarded.

ggerganov Dec 19, 2025
Maintainer Author

Yes, server-bench.py is already half-way there and it's pretty close to what I imagine. The distribution to multiple servers is probably nice-to-have, though I don't consider it a hard requirement. Some evals do take a lot of time, but I've been able to run them overnight on my devices. So even without a distribution functionality, it would still be useful.

Pt. 3 is a good idea. It seems most evals send the answer to some old GPT model in the cloud (a.k.a. "judge"). We can simply use a local model instead.

jerkstorecaller · 2025-12-27T09:53:19Z

jerkstorecaller
Dec 27, 2025

Great idea, GG! I found this thread because I was looking for a way to run common benchmarks against llama-server and this was among the first search results. HF's lighteval doesn't seem to support just pointing it at llama-server.

I would love a llama-evaluate --server "http://127.0.0.1:8000/v1" -m "gpt-oss-120b,glm-4.5-air" --benchmarks "aime,hle" that just spits out result tables like llama-bench does.

Futhermore, please consider making this feature easily extensible so we can just point it at some folder containing our own personal benchmark. I have my own list of prompts I use to test out new LLMs.

My prompts don't have a trivially verifiable format like math, code or multiple choice benchmarks. I'd rather not post them here for contamination reasons, but I can email them to you if you wish. But here's a few that require hard-to-automate effort to do properly:

(Evergreen test, I don't mind sharing this) Based on a football match's play-by-play data, determine the starting lineup of both teams. LLMs tend to know the players and might use a different name spelling so another LLM judge, human or script need to validate the answer is equivalent. (for example we shouldn't penalize a model for putting native accents on certain letters of a player's name, or misspelling slightly because it's his native name)
(2nd LLM required to do test and give result) A multi-turn strategy game where the LLM must try to guess something and strategically ask to not run out of points. Would require a 2nd LLM to be told how to interact with the tested LLM and answer yes/no until the game is over, and then tell us if the LLM passed the test.
(Manual verification required) asking a non-trivial question that requires certain specific cultural knowledge to answer correctly, without using multiple-choice answers. It doubles as a hallucination test. Every local LLM I've tried up (up to GLM-4.5-Air) confidently hallucinates here.
(Logit required) One where I write a short scene, then use the logit of the next token to determine if the LLM still fully understands the scene being described.

0 replies

gatbontonpc · 2026-01-09T23:15:13Z

gatbontonpc
Jan 9, 2026

Hi GG. I started taking a look at this problem in relation to this issue. From my understanding, it seems like llama-perplexity is able to run:

Winograd score
ARC
HellaSwag
MMLU
TruthfulQA

I don't think it's currently architected for running multiple tests in a row, but one option is to run llama-perplexity multiple times for what it supports and collate the results. It also uses the common lib rather than what you suggested the OpenAI compatible endpoint. I think doing it all in Python should simplify things, but each dataset may need to be preprocessed. LM-evaluation-harness writes its own preprocessor for each dataset. I will take a look at scripts/server-bench.py and see if we can get a barebones llama-evaluator.py running.

2 replies

gatbontonpc Jan 10, 2026

It looks like we're missing echo logprobs support, which is necessary to use llama-server for benchmarking.

gatbontonpc Jan 17, 2026

@ggerganov, would love to get your thoughts on a this initial take on this idea

idea : implementing a lean and mean evaluation tool #18195

Uh oh!

Uh oh!

ggerganov Dec 19, 2025 Maintainer

Replies: 3 comments · 4 replies

Uh oh!

JohannesGaessler Dec 19, 2025 Collaborator

Uh oh!

Uh oh!

JohannesGaessler Dec 19, 2025 Collaborator

Uh oh!

Uh oh!

ggerganov Dec 19, 2025 Maintainer Author

Uh oh!

jerkstorecaller Dec 27, 2025

Uh oh!

Uh oh!

gatbontonpc Jan 9, 2026

Uh oh!

gatbontonpc Jan 10, 2026

Uh oh!

gatbontonpc Jan 17, 2026

ggerganov
Dec 19, 2025
Maintainer

Replies: 3 comments 4 replies

JohannesGaessler
Dec 19, 2025
Collaborator

JohannesGaessler Dec 19, 2025
Collaborator

ggerganov Dec 19, 2025
Maintainer Author

jerkstorecaller
Dec 27, 2025

gatbontonpc
Jan 9, 2026