Leaderboards and benchmarks ✨

clefourrier 's Collections

LLM evaluation datasets

updated Feb 28

Cool leaderboard spaces collection for models across modalities! Text, vision, audio, ...

Upvote

102

Running on CPU Upgrade

13k

13k

Open LLM Leaderboard

🏆

Track, rank and evaluate open LLMs and chatbots

Note The reference leaderboard for Open LLMs! Find the best LLM for your size and precision needs, compare your models to the others! (Evaluates on ARC, HellaSwag, TruthfulQA, and MMLU)
Running

1.24k

1.24k

Big Code Models Leaderboard

📈

Submit code models for evaluation on benchmarks

Note Specialized leaderboard for models with coding capabilities 🖥️ (Evaluates on HumanEval and MultiPL-E)
Running

4.3k

4.3k

Chatbot Arena Leaderboard

🏆

Display chatbot leaderboard and stats

Note Pitches chatbots against one another to compare their output quality (Evaluates on MTBench, an Elo score, and MMLU)
Running

463

463

LLM-Perf Leaderboard

🏆

Explore LLM performance across hardware

Note Do you want to know which model to use for which hardware? This leaderboard is for you! (Looks at the throughput of many LLMs in different hardware settings)
EleutherAI: Going Beyond "Open Science" to "Science in the Open"

Paper • 2210.06413 • Published Oct 12, 2022

Note This paper introduces (among other things) the Eleuther AI Harness, a reference evaluation suite which is simple to use and quite complete!
Holistic Evaluation of Language Models

Paper • 2211.09110 • Published Nov 16, 2022

Note The HELM paper! A super cool reference paper on the many axes to look at when creating an LLM benchmark or evaluation suite. Super exhaustive and interesting to read.
Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

Paper • 2206.04615 • Published Jun 9, 2022 • 5

Note The BigBench paper! A bunch of tasks to evaluate edge cases and random unusual LLM capabilities. The associated benchmark has since been completed with a lot of fun crowdsourced tasks.
Running on CPU Upgrade

5.43k

5.43k

MTEB Leaderboard

🥇

Embedding Leaderboard

Note Text Embeddings benchmark across 58 tasks and 112 languages!
Running on CPU Upgrade

388

388

GAIA Leaderboard

🦾

Submit models for evaluation and view leaderboard results

Note A leaderboard for tool augmented LLMs!
Running

95

95

OpenCompass LLM Leaderboard

🚀

Display a web page

Note An LLM leaderboard for Chinese models on many metric axes - super complete
Restarting on CPU Upgrade

542

542

Open Ko-LLM Leaderboard

📉

Explore and filter language model benchmark results

Note An Open LLM Leaderboard specially for Korean models by our friends at Upstage!
Configuration error

54

54

Hallucination Evaluation Leaderboard

⚡

Note A leaderboard to evaluate the propensy of LLMs to hallucinate
Running on CPU Upgrade

136

136

Hallucinations Leaderboard

🔥

View and submit LLM evaluations

Note A lot of metrics if you are interested in the propensity of LLMs to hallucinate!
Running

92

92

Nexus Function Calling Leaderboard

🐠

Visualize model performance on function calling tasks

Note Tests LLM API usage and calls (few models atm)
Running

64

64

CyberSecEvalTest

📈

Evaluate LLM cybersecurity risks

Note How likely is your LLM to help cyber attacks?
Running

187

187

Yet Another LLM Leaderboard

🌖

Run a Streamlit web app

Note An aggregation of benchmarks well correlated with human preferences
Running on CPU Upgrade

91

91

LLM Safety Leaderboard

🥇

View and submit machine learning model evaluations

Note Bias, safety, toxicity, all those things that are important to test when your chatbot actually interacts with users
Running

33

33

EvalCrafter

⚡

Display leaderboard data for video generation models

Note Text to video generation leaderboard
Running

437

437

Can Ai Code Results

🏆

Generate animated avatars from images

Note Coding benchmark
Running

145

145

Ocrbench Leaderboard

🏆

Display OCRBench leaderboard for model evaluations

Note An OCR benchmark
Running

53

53

NPHardEval Leaderboard

🥇

Explore and compare LLM models through a leaderboard

Note Dynamic leaderboard using complexity classes to create reasoning problems for LLMs - quite a cool one
Running

41

41

Redteaming Resistance Leaderboard

💻

Display model benchmark results

Note Red teaming datasets success against models
Running

20

20

Subquadratic LLM Leaderboard

🏆

Submit and filter LLM models for evaluation

Note The Open LLM Leaderboard, but for structured state models!
Running

542

542

Vision Arena (Testing VLMs side-by-side)

🖼

Analyze images to detect and label objects

Note A multimodal arena!
Running

262

262

VBench Leaderboard

📊

Upload a video model evaluation and view results
Running on CPU Upgrade

190

190

Open Portuguese LLM Leaderboard

🏆

Track, rank and evaluate open LLMs in Portuguese

Note An LLM leaderboard for Portuguese
Running on CPU Upgrade

76

76

Open Ita Llm Leaderboard

🏆

Track, rank and evaluate open LLMs in the italian language!

Note An LLM leaderboard for Italian
Running on Zero

280

280

GenAI Arena

📈

Realtime Image/Video Gen AI Arena

Note An arena for image generation!
Running

10

10

Q-Bench+ Leaderboard

📊

Browse Q-Bench leaderboard for vision model performance
Running on CPU Upgrade

34

34

Parti Prompts Leaderboard

📊

Display leaderboard for text-to-image model evaluations
Running on CPU Upgrade

130

130

HHEM Leaderboard

🥇

Browse and submit language model benchmarks

Note An hallucination leaderboard, focused on a different set of tasks
Running on CPU Upgrade

65

65

Open PL LLM Leaderboard

🏆

View and filter LLM leaderboard data
Running on CPU Upgrade

90

90

OpenLLM Turkish leaderboard

🥇

Browse and filter leaderboard of language models
Running

223

223

AI2 WildBench Leaderboard (V2)

🦁

Display and explore model leaderboards and chat history
Running on CPU Upgrade

722

722

Open ASR Leaderboard

🏆

Request evaluation for new speech models
Running on CPU Upgrade

714

714

Open VLM Leaderboard

🌎

VLMEvalKit Evaluation Results Collection
Running

357

357

Reward Bench Leaderboard

📐

Explore and analyze RewardBench leaderboard data
Running on CPU Upgrade

705

705

TTS Arena

🏆

Vote on the latest TTS models!
Running

16

16

Prompt Injection Detection Benchmark

📝

detect prompt injection risks
Running

36

36

Long Code Arena

🏟

Browse and view leaderboard results for coding tasks
Running

8

8

ML.ENERGY Leaderboard

⚡

Explore GenAI model efficiency on ML.ENERGY leaderboard
Running

786

786

UGI Leaderboard

📢

Uncensored General Intelligence Leaderboard
Configuration error

98

98

Berkeley Function Calling Leaderboard

🏃
Running on CPU Upgrade

54

54

Open CoT Leaderboard

🥇

Track, rank and evaluate open LLMs' CoT quality
Running

22

22

URIAL Bench (Eval Base LLMs on MT-Bench)

🐑

Display a leaderboard of models
Running

24

24

Indic Llm Leaderboard

🔥

Browse and compare Indic language LLMs on a leaderboard
Sleeping

8

8

Meta Open LLM Leaderboard

🏆
Running

10

10

Science Leaderboard

👁

Leaderboard for LLM for Science Reasoning
Running on CPU Upgrade

368

368

Open Medical-LLM Leaderboard

🥇

Browse and submit LLM evaluations
Runtime error

28

28

Open RL Leaderboard

🥇
Running

19

19

LLM Leaderboard for SEA

🥇

View and compare model evaluation results
Running on CPU Upgrade

33

33

Hebrew LLM Leaderboard

🥇

Browse and evaluate language models
Runtime error

151

151

Open LLM Progress Tracker

🔬

Visualize Open vs. Proprietary LLM Progress
Running

168

168

Low-bit Quantized Open LLM Leaderboard

🏆

Track, rank and evaluate open LLMs and chatbots
Running on CPU Upgrade

70

70

AIR-Bench Leaderboard

🥇

Explore benchmark results for QA and long doc models
Running on CPU Upgrade

147

147

Open Arabic LLM Leaderboard

🏆

Track, rank and evaluate open Arabic LLMs and chatbots
Running on CPU Upgrade

115

115

Open Chinese LLM Leaderboard

🏆

Display and filter LLM benchmark results
Running

296

296

3D Arena

🏢

Vote on and view 3D leaderboard entries
Running

198

198

BigCodeBench Leaderboard

🥇

Explore and analyze code evaluation data
Running

21

21

Open Tw Llm Leaderboard

🥇

Browse and submit LLM evaluations
Running

87

87

Zebra Logic Bench

🦓

Render a leaderboard for model evaluation
Running

95

95

European Leaderboard

🌍

Explore multilingual LLM accuracy and translation benchmarks
Running

21

21

🇨🇿 BenCzechMark

📊

Submit and track model performance on benchmarks
Building

46

46

Leaderboard

🥇

Browse and submit evaluation results for AI benchmarks
Running

43

43

Stick To Your Role! Leaderboard

🎭

Benchmarking LLMs on the stability of simulated populations
Running

206

206

GPU Poor LLM Arena

🏆

Compact LLM Battle Arena: Frugal AI Face-Off!
Running on CPU Upgrade

73

73

La Leaderboard

🌸

Evaluate open LLMs in the languages of LATAM and Spain.
Running on CPU Upgrade

35

35

OpenLLM French leaderboard 🇫🇷

🥇

Explore and compare LLM benchmarks and submit models for evaluation
Running

70

70

GIFT Eval

🥇

GIFT-Eval: A Benchmark for General Time Series Forecasting
Running

100

100

Judge Arena

💻

Compare AI models by voting on responses
Running

64

64

Open Persian LLM Leaderboard

🏅

Open Persian LLM Leaderboard
Running

35

35

Japanese Chatbot Arena Leaderboard

🌖

Compare two chatbots and vote on the better one
Running on CPU Upgrade

79

79

Open Japanese LLM Leaderboard

🌸

Explore and compare LLM models through interactive leaderboards and submissions
Running

7

7

Leaderboard2024

🏅

Submit protein prediction models to MLSB 2024 leaderboard
Runtime error

11

11

Toxicity Benchmarking

🥇

Explore toxicity scores of models
Running

63

63

Background Removal Arena

⚡

Vote on background-removed images to rank models
Running

14

14

Fev Leaderboard

🥇

Display model benchmark metrics
Running

38

38

AI Phone Leaderboard

📱

AI Phone Leaderboard
Running

9

9

Icelandic LLM leaderboard

🥇

Explore and filter LLM benchmark data
Running

10

10

Polish EQ-Bench Leaderboard

🏆

Display and analyze model leaderboard data
Runtime error

7

7

Polish Medical Leaderboard

🇵
Running

7

7

CPTU-Bench

🧠

Analyze complex Polish text with a benchmark app
Running

24

24

MT Bench PL

📊

Browse and evaluate model answers and comparisons
Running on CPU Upgrade

48

48

DABstep Leaderboard

🕺

DABstep Reasoning Benchmark Leaderboard

Upvote

102

Leaderboards and benchmarks ✨

Open LLM Leaderboard

Big Code Models Leaderboard

Chatbot Arena Leaderboard

LLM-Perf Leaderboard

MTEB Leaderboard

GAIA Leaderboard

OpenCompass LLM Leaderboard

Open Ko-LLM Leaderboard

Hallucination Evaluation Leaderboard

Hallucinations Leaderboard

Nexus Function Calling Leaderboard

CyberSecEvalTest

Yet Another LLM Leaderboard

LLM Safety Leaderboard

EvalCrafter

Can Ai Code Results

Ocrbench Leaderboard

NPHardEval Leaderboard

Redteaming Resistance Leaderboard

Subquadratic LLM Leaderboard

Vision Arena (Testing VLMs side-by-side)

VBench Leaderboard

Open Portuguese LLM Leaderboard

Open Ita Llm Leaderboard

GenAI Arena

Q-Bench+ Leaderboard

Parti Prompts Leaderboard

HHEM Leaderboard

Open PL LLM Leaderboard

OpenLLM Turkish leaderboard

AI2 WildBench Leaderboard (V2)

Open ASR Leaderboard

Open VLM Leaderboard

Reward Bench Leaderboard

TTS Arena

Prompt Injection Detection Benchmark

Long Code Arena

ML.ENERGY Leaderboard

UGI Leaderboard

Berkeley Function Calling Leaderboard

Open CoT Leaderboard

URIAL Bench (Eval Base LLMs on MT-Bench)

Indic Llm Leaderboard

Meta Open LLM Leaderboard

Science Leaderboard

Open Medical-LLM Leaderboard

Open RL Leaderboard

LLM Leaderboard for SEA

Hebrew LLM Leaderboard

Open LLM Progress Tracker

Low-bit Quantized Open LLM Leaderboard

AIR-Bench Leaderboard

Open Arabic LLM Leaderboard

Open Chinese LLM Leaderboard

3D Arena

BigCodeBench Leaderboard

Open Tw Llm Leaderboard

Zebra Logic Bench

European Leaderboard

🇨🇿 BenCzechMark

Leaderboard

Stick To Your Role! Leaderboard

GPU Poor LLM Arena

La Leaderboard

OpenLLM French leaderboard 🇫🇷

GIFT Eval

Judge Arena

Open Persian LLM Leaderboard

Japanese Chatbot Arena Leaderboard

Open Japanese LLM Leaderboard

Leaderboard2024

Toxicity Benchmarking

Background Removal Arena

Fev Leaderboard

AI Phone Leaderboard

Icelandic LLM leaderboard

Polish EQ-Bench Leaderboard

Polish Medical Leaderboard

CPTU-Bench