Open LLM Leaderboard
Track, rank and evaluate open LLMs and chatbots
Cool leaderboard spaces collection for models across modalities! Text, vision, audio, ...
Track, rank and evaluate open LLMs and chatbots
Note The reference leaderboard for Open LLMs! Find the best LLM for your size and precision needs, compare your models to the others! (Evaluates on ARC, HellaSwag, TruthfulQA, and MMLU)
Submit code models for evaluation on benchmarks
Note Specialized leaderboard for models with coding capabilities π₯οΈ (Evaluates on HumanEval and MultiPL-E)
Display chatbot leaderboard and stats
Note Pitches chatbots against one another to compare their output quality (Evaluates on MTBench, an Elo score, and MMLU)
Explore LLM performance across hardware
Note Do you want to know which model to use for which hardware? This leaderboard is for you! (Looks at the throughput of many LLMs in different hardware settings)
Note This paper introduces (among other things) the Eleuther AI Harness, a reference evaluation suite which is simple to use and quite complete!
Note The HELM paper! A super cool reference paper on the many axes to look at when creating an LLM benchmark or evaluation suite. Super exhaustive and interesting to read.
Note The BigBench paper! A bunch of tasks to evaluate edge cases and random unusual LLM capabilities. The associated benchmark has since been completed with a lot of fun crowdsourced tasks.
Embedding Leaderboard
Note Text Embeddings benchmark across 58 tasks and 112 languages!
Submit models for evaluation and view leaderboard results
Note A leaderboard for tool augmented LLMs!
Display a web page
Note An LLM leaderboard for Chinese models on many metric axes - super complete
Explore and filter language model benchmark results
Note An Open LLM Leaderboard specially for Korean models by our friends at Upstage!
Note A leaderboard to evaluate the propensy of LLMs to hallucinate
View and submit LLM evaluations
Note A lot of metrics if you are interested in the propensity of LLMs to hallucinate!
Visualize model performance on function calling tasks
Note Tests LLM API usage and calls (few models atm)
Evaluate LLM cybersecurity risks
Note How likely is your LLM to help cyber attacks?
Run a Streamlit web app
Note An aggregation of benchmarks well correlated with human preferences
View and submit machine learning model evaluations
Note Bias, safety, toxicity, all those things that are important to test when your chatbot actually interacts with users
Display leaderboard data for video generation models
Note Text to video generation leaderboard
Generate animated avatars from images
Note Coding benchmark
Display OCRBench leaderboard for model evaluations
Note An OCR benchmark
Explore and compare LLM models through a leaderboard
Note Dynamic leaderboard using complexity classes to create reasoning problems for LLMs - quite a cool one
Display model benchmark results
Note Red teaming datasets success against models
Submit and filter LLM models for evaluation
Note The Open LLM Leaderboard, but for structured state models!
Analyze images to detect and label objects
Note A multimodal arena!
Upload a video model evaluation and view results
Track, rank and evaluate open LLMs in Portuguese
Note An LLM leaderboard for Portuguese
Track, rank and evaluate open LLMs in the italian language!
Note An LLM leaderboard for Italian
Realtime Image/Video Gen AI Arena
Note An arena for image generation!
Browse Q-Bench leaderboard for vision model performance
Display leaderboard for text-to-image model evaluations
Browse and submit language model benchmarks
Note An hallucination leaderboard, focused on a different set of tasks
View and filter LLM leaderboard data
Browse and filter leaderboard of language models
Display and explore model leaderboards and chat history
Request evaluation for new speech models
VLMEvalKit Evaluation Results Collection
Explore and analyze RewardBench leaderboard data
Vote on the latest TTS models!
detect prompt injection risks
Browse and view leaderboard results for coding tasks
Explore GenAI model efficiency on ML.ENERGY leaderboard
Uncensored General Intelligence Leaderboard
Track, rank and evaluate open LLMs' CoT quality
Display a leaderboard of models
Browse and compare Indic language LLMs on a leaderboard
Leaderboard for LLM for Science Reasoning
Browse and submit LLM evaluations
View and compare model evaluation results
Browse and evaluate language models
Visualize Open vs. Proprietary LLM Progress
Track, rank and evaluate open LLMs and chatbots
Explore benchmark results for QA and long doc models
Track, rank and evaluate open Arabic LLMs and chatbots
Display and filter LLM benchmark results
Vote on and view 3D leaderboard entries
Explore and analyze code evaluation data
Browse and submit LLM evaluations
Render a leaderboard for model evaluation
Explore multilingual LLM accuracy and translation benchmarks
Submit and track model performance on benchmarks
Browse and submit evaluation results for AI benchmarks
Benchmarking LLMs on the stability of simulated populations
Compact LLM Battle Arena: Frugal AI Face-Off!
Evaluate open LLMs in the languages of LATAM and Spain.
Explore and compare LLM benchmarks and submit models for evaluation
GIFT-Eval: A Benchmark for General Time Series Forecasting
Compare AI models by voting on responses
Open Persian LLM Leaderboard
Compare two chatbots and vote on the better one
Explore and compare LLM models through interactive leaderboards and submissions
Submit protein prediction models to MLSB 2024 leaderboard
Explore toxicity scores of models
Vote on background-removed images to rank models
Display model benchmark metrics
AI Phone Leaderboard
Explore and filter LLM benchmark data
Display and analyze model leaderboard data
Analyze complex Polish text with a benchmark app
Browse and evaluate model answers and comparisons
DABstep Reasoning Benchmark Leaderboard