The AI Evaluation Bottleneck: How Cost is Redefining the Field
The cost of evaluating AI models has skyrocketed, making it a new bottleneck in the field, with some evaluations costing tens of thousands of dollars.
['The cost of evaluating AI models has become a significant bottleneck in the field, with some evaluations costing tens of thousands of dollars. The Holistic Agent Leaderboard (HAL) recently spent around $40,000 to run 21,730 agent rollouts across 9 models and 9 benchmarks. A single GAIA run on a frontier model can cost $2,829 before caching.
These costs are not just a minor issue; they are changing who can do AI evaluations and how they are done.', "The cost problem started before agents. When Stanford's CRFM released HELM in 2022, the paper's own per-model accounting showed API costs ranging from $85 for OpenAI's code-cushman-001 to $10,926 for AI21's J1-Jumbo (178B). The aggregate cost of reported costs and GPU compute came to roughly $100,000.
This cost issue is not limited to API costs; it also affects training-in-the-loop benchmarks, where evaluation costs can exceed training costs.", "The field is trying to find ways to reduce these costs, but it's not easy. Compression techniques have been proposed for static benchmarks, but new agent benchmarks are noisy, scaffold-sensitive, and only partly compressible. Standardized documentation and reliability work are crucial, but they also add to the cost.
The economics have changed; not long ago, training was expensive and evaluation was cheap. Now, evaluation can cost more than training the candidate model.", 'The implications are significant. Academic groups, AI Safety Institutes, and journalists now hit the budget constraint before the technical one when they try to evaluate frontier agents independently.
Industry models have a significant advantage in terms of evaluation costs. The social process of evaluating AI systems becomes concentrated inside the same labs that build them, rendering external validation partial, and sometimes absent, unless someone subsidizes the cost directly.', "The conclusion is that whoever can pay for evaluation gets to write the leaderboard. Independent evaluation shouldn't require a frontier lab's compute budget.
There are efforts to make evaluation more affordable and accessible, such as TAB Platform, which runs 340+ benchmarks across 59 models from 5 providers at $0.03 per text test case. The field needs to rethink its approach to evaluation and find ways to make it more efficient and cost-effective."]
Source: Hugging Face