Top 12 LLM Leaderboards to Help You Choose the Right Model
Here's our pick of the best-rated open LLM leaderboards. With these models, you can choose the right model for your AI model.
What’s the hardest thing about building your custom AI model? Choosing the right LLM. Yes, it’s not the fine-tuning of the model part, deploying the model, but choosing the right one is the hardest part.
How can you make that easier? Well, you can look at the best LLM leaderboards to choose the right model. In this guide, we’ve covered the best LLM leaderboards that you can use to evaluate models.
Best LLM Leaderboards
1. Open LLM Leaderboard
With the rapid growth of LLMs and chatbots emerging left and right, it's tough to separate genuine breakthroughs from marketing buzz. Enter the Open LLM Leaderboard, which steps up to benchmark these models using the Eleuther AI-Language Model Evaluation Harness.
The leaderboard puts models through their paces across six tests: AI2 Reasoning Challenge, HellaSwag, MMLU, TruthfulQA, Winogrande, and GSM8k, targeting key reasoning and general knowledge abilities. If you’re into details, the leaderboard provides numerical breakdowns and model specs, all available on Hugging Face.
2. MTEB Leaderboard: A Deep Dive Into Text Embeddings
Most text embedding evaluations fall into a narrow scope—one task, one dataset—failing to account for the diverse applications these embeddings could be useful for, like clustering or reranking. The Massive Text Embedding Benchmark (MTEB) aims to fix that.
Spanning eight embedding tasks, 58 datasets, and a whopping 112 languages, it’s the largest-scale evaluation of its kind. With 33 models in the ring, MTEB shows that no single method nails all tasks—proving there's still work to be done if we’re after a universal embedding model.
3. Big Code Models Leaderboard
Inspired by the Open LLM Leaderboard, this leaderboard pits multilingual code generation models against one another using the HumanEval and MultiPL-E benchmarks. HumanEval checks how well models can write Python code with 164 challenges, while MultiPL-E kicks it up a notch by translating these problems into 18 different programming languages.
Performance isn’t just about code quality though—it also tracks throughput for batch sizes of 1 and 50. Rankings are determined by pass@1 score and win rate across languages, with bonus points for efficient memory use, all powered by Optimum-Benchmark.
4. SEAL Leaderboards: Elo for AI
The SEAL Leaderboards use an Elo rating system to compare models, kind of like what you’d see in chess rankings. Human evaluators rate AI responses to prompts, determining whether the model wins, ties, or loses each matchup.
Rankings are calculated using the Bradley-Terry model and binary cross-entropy loss to find out which models come out on top. The rankings cover multiple languages, and the results are bootstrapped for statistical confidence. Models from various APIs are put through their paces, ensuring a broad and up-to-date evaluation.
5. Berkeley Function-Calling Leaderboard
The Berkeley Function-Calling Leaderboard (BFCL) exists for one reason: to see how well LLMs can handle function calls. Why? Because function calling is key for powering systems like Langchain and AutoGPT. BFCL uses a rich dataset of 2,000 question-function-answer pairs to test LLMs across various languages and complexity levels, from simple single calls to parallel execution scenarios.
Models are ranked based on their ability to detect the right function, execute it properly, and deliver accurate results, with metrics also covering cost and speed.
6. Occiglot Euro LLM Leaderboard
Think of the Occiglot Euro LLM Leaderboard as the Open LLM Leaderboard with a European flair. It applies the Eleuther AI-Language Model Evaluation Harness to five benchmarks—AI2 Reasoning Challenge, HellaSwag, MMLU, TruthfulQA, and Belebele—while focusing on multilingual performance.
It’s a great tool for comparing models across languages and tasks, with detailed results and specs available on Hugging Face. But beware of flagged models, which are marked for caution.
7. LMSYS Chatbot Arena Leaderboard
LMSYS Chatbot Arena is a crowdsourced, open platform where human voters decide which LLMs reign supreme. Over a million pairwise comparisons have been logged, and models are ranked using the Bradley-Terry model on an Elo scale.
With 102 models and nearly 1.15 million votes cast (as of May 2024), it’s a serious contender in the chatbot comparison game. New categories—like coding and long-form responses—are in the works, and if you’re itching to have your say, you can contribute votes directly at chat.lmsys.org.
8. Artificial Analysis LLM Performance Leaderboard
The Artificial Analysis Leaderboard takes a more customer-centric approach by benchmarking LLMs based on their serverless API performance. Pricing is per token, with separate rates for input and output tokens. Key performance metrics include Time to First Token (TTFT), throughput (tokens per second), and overall response time for 100 tokens.
On the quality side, the leaderboard weighs normalized scores from MMLU, MT-Bench, and Chatbot Arena Elo ratings. And in case you’re wondering, these tests run daily under various prompt and load conditions to give you a real-world picture of what you can expect.
9. Open Medical LLM Leaderboard
The Open Medical LLM Leaderboard focuses on a critical area: medical question answering. This leaderboard ranks LLMs using datasets like MedQA (USMLE), PubMedQA, MedMCQA, and specialized subsets of MMLU dealing with medical topics.
These datasets cover everything from clinical knowledge to anatomy and genetics, and the models are put through both multiple-choice and open-ended tests. Accuracy (ACC) is the primary metric here, and any models you want to submit can be evaluated automatically via the "Submit" page.
10. Hughes Hallucination Evaluation Model (HHEM) Leaderboard
Nobody likes a hallucinating AI, and that’s where the Hughes Hallucination Evaluation Model (HHEM) comes in. This leaderboard measures how often LLMs go off the rails and generate factually incorrect or irrelevant content in their summaries. Using a dataset of 1,006 documents (like the CNN/Daily Mail Corpus), it assigns a hallucination score between 0 and 1.
Other metrics include Hallucination Rate (the percentage of summaries scoring below 0.5), Factual Consistency, and Answer Rate. Even models not hosted on Hugging Face get evaluated here, including GPT variants.
11. OpenVLM Leaderboard
The Open Vision-Language Model (VLM) Leaderboard evaluates 63 models using the OpenSource Framework VLMEvalKit. Covering 23 multi-modal benchmarks, including MMBench_V11 and MathVista, it provides a thorough comparison of models like GPT-4v, Gemini, QwenVLPlus, and LLaVA.
Average scores (normalized 0-100) and ranks give a clear picture of model performance, with frequent updates to keep everything fresh.
12. LLM-Perf Leaderboard 🏋️
The 🤗 LLM-Perf Leaderboard zeroes in on hardware performance, benchmarking LLMs for latency, throughput, memory, and energy efficiency across different configurations.
Using Optimum-Benchmark, the leaderboard ensures consistency by evaluating models on a single GPU with specific prompt and token parameters. Memory usage, energy consumption (measured in kWh), and other factors are monitored using CodeCarbon to give a comprehensive view of each model’s efficiency.