Replies: 3 comments 4 replies
-
|
I am in principle still working on something along those lines: https://github.com/JohannesGaessler/elo_hellm . The intent is to have quality control for the training code but there are other things that I need to take care of first: The benchmarks took a long time even with small models on 6x RTX 4090 so I decided that I need to first work on better server throughput (to reduce GPU time, particularly for multi GPU use) and on better memory allocation (to reduce person time). From my end I would design it like this:
Steps 1-3 could very well live within llama.cpp and I would be happy to upstream them. The ETA from my end would be several months at the very least (but I would also be happy to help with reviewing if someone else is interested in working on it in the meantime). Also, particularly for step 2 I think we have very different ideas of what should be implemented. One could in principle simply send all HTTP requests to a single server. But in my experience that just takes way too long because some of the benchmarks have 10000+ questions. So I would be using a batch system like HTCondor for that part in order to make it scalable but that would very much go against the idea of making the tool "simple". It may make sense to define specs for intermediate formats and to make the way of how to evaluate models modular. |
Beta Was this translation helpful? Give feedback.
-
|
Great idea, GG! I found this thread because I was looking for a way to run common benchmarks against llama-server and this was among the first search results. HF's lighteval doesn't seem to support just pointing it at llama-server. I would love a Futhermore, please consider making this feature easily extensible so we can just point it at some folder containing our own personal benchmark. I have my own list of prompts I use to test out new LLMs. My prompts don't have a trivially verifiable format like math, code or multiple choice benchmarks. I'd rather not post them here for contamination reasons, but I can email them to you if you wish. But here's a few that require hard-to-automate effort to do properly:
|
Beta Was this translation helpful? Give feedback.
-
|
Hi GG. I started taking a look at this problem in relation to this issue. From my understanding, it seems like llama-perplexity is able to run:
I don't think it's currently architected for running multiple tests in a row, but one option is to run llama-perplexity multiple times for what it supports and collate the results. It also uses the common lib rather than what you suggested the OpenAI compatible endpoint. I think doing it all in Python should simplify things, but each dataset may need to be preprocessed. LM-evaluation-harness writes its own preprocessor for each dataset. I will take a look at |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Recently I looked at NVIDIA's Evaluator library. The idea is to be able to run various evaluations (AIME, MMLU, GSM8K, etc.) against an OpenAI compatible endpoint. As an idea and user interface I think it is good.
However, I tried to get this running on my Mac and quickly gave up. Among having to provide 3 or 4 API keys for various stuff, the library requires to install docker and download some containers. And then the runs get queued in some sort of manager/scheduler. Overall it looks extremely over-engineered and hard to get running. Eventually, I was not able to make it run so I gave up.
Still, I think the idea for such a tool is good and I think we should build one. It would be very basic python script, with near-zero dependencies, that runs most of the evals against a specified endpoint.
Requirements:
The best version of this that I know of is the gpt-oss evals, though it is also quite encumbered and over-engineered for my taste. Plus it has some issues.
So I think there is a room for such a basic tool. If anyone is interested in implementing it, I can help guiding.
Update: #21152
Beta Was this translation helpful? Give feedback.
All reactions