The Open Agent Leaderboard
A new benchmark for comparing full AI agent systems, not just the models inside them, has been launched to evaluate their quality and cost across diverse tasks.

How good are general-purpose AI agents? This question is at the heart of a new evaluation framework we've developed, which assesses not just the models powering these agents, but the entire system: the tools they can use, how they plan their steps, what they remember between actions, and how they recover from setbacks. Change any of these components, and the same model can produce vastly different results at significantly different costs.
The Open Agent Leaderboard is an open benchmark that reports both the quality and cost of full agent systems, providing a more holistic view of their capabilities. Paired with the Exgentic framework for running and reproducing evaluations, and a paper detailing the methodology and results, this initiative aims to make agent evaluation more transparent and useful for the community. AI agents have shown remarkable utility when tailored to specific tasks, such as coding within a familiar repository or handling customer service with a predefined set of tools.
However, the real challenge lies in creating agents that can handle a wide range of tasks, each with its own tools, rules, and constraints, without requiring manual customization for each scenario. Generality, in this context, refers to an agent's ability to adapt to new settings with minimal modification. The Open Agent Leaderboard evaluates agents across six diverse benchmarks, each testing a different kind of realistic task, including coding, customer service, technical support, personal assistance, and research.
These benchmarks were chosen for their established credibility within the research community and their ability to collectively test a broad range of working settings. A key innovation of the leaderboard is its use of a unified protocol that allows different benchmarks to communicate with agents in a standardized way. This enables a more accurate comparison of agent performance across different tasks and settings.
The current top five agent systems on the leaderboard reveal some interesting insights. For instance, the top three systems all use the same model but differ significantly in both score and cost due to the agent systems wrapped around that model. The cost gap is also noteworthy, with the most efficient configuration running at a fraction of the price of the strongest one.
One of the most surprising findings is that general-purpose agents are already competitive with specialized ones, even without benchmark-specific tuning. This suggests that a single agent can increasingly handle many kinds of work, not just the one environment it was prepared for. The results also highlight the importance of understanding how agents fail.
Source: Hugging Face