QIMMA قِمّة ⛰: A Quality-First Arabic LLM Leaderboard
QIMMA validates benchmarks before evaluating models, ensuring reported scores reflect genuine Arabic language capability in LLMs.

The world of Arabic large language models (LLMs) is rapidly expanding, but a growing concern has emerged: are we actually measuring what we think we're measuring? The number of benchmarks and leaderboards is increasing, but their quality and effectiveness are often called into question. To address this issue, we've developed QIMMA (Arabic for "summit"), a quality-first Arabic LLM leaderboard that rigorously validates benchmarks before evaluating models.
The need for QIMMA arises from several pain points in the Arabic NLP evaluation landscape. With over 400 million Arabic speakers across diverse dialects and cultural contexts, the language's NLP evaluation landscape remains fragmented. Many Arabic benchmarks are translations from English, introducing distributional shifts and making benchmark data less representative of natural Arabic usage.
Moreover, even native Arabic benchmarks often lack rigorous quality checks, leading to annotation inconsistencies, incorrect gold answers, and cultural bias. Reproducibility gaps and coverage fragmentation also plague the field. QIMMA sets out to change this.
Our platform combines five key properties: open source, predominantly native Arabic content, systematic quality validation, code evaluation, and public per-sample inference outputs. We consolidate 109 subsets from 14 source benchmarks into a unified evaluation suite of over 52,000 samples, spanning 7 domains. Before running a single model, we applied a multi-stage validation pipeline to every sample in every benchmark.
This pipeline involves evaluating each sample with two state-of-the-art LLMs, which score samples against a 10-point rubric. Samples are eliminated if they fail to meet the criteria, and human review is conducted for flagged samples. The validation pipeline revealed recurring quality issues across benchmarks, including systematic patterns of errors, corrupt text, and stereotype reinforcement.
We addressed these issues by refining problem statements and eliminating low-quality samples. QIMMA uses LightEval, EvalPlus, and FannOrFlop as its evaluation framework, standardizing prompting by question format with six template types. Our results, as of April 2026, show a clear but imperfect size-performance correlation across the top 10 evaluated models.
The QIMMA leaderboard provides a more accurate representation of Arabic language capability in LLMs. By validating benchmarks before evaluating models, we ensure that reported scores reflect genuine progress in the field. Our platform offers a comprehensive evaluation suite, and we invite researchers to explore the live leaderboard for current rankings.
In conclusion, QIMMA represents a significant step forward in Arabic LLM evaluation. By prioritizing quality and validity, we can ensure that progress in the field is accurately measured and that models are developed to effectively serve the diverse needs of Arabic speakers worldwide.
Source: Hugging Face