Frontier AI Models Fail to Impress on New Enterprise IT Benchmark
The first benchmark for agentic enterprise IT tasks, ITBench-AA, reveals that leading AI models score below 50% on Site Reliability Engineering tasks.

In a significant development, Artificial Analysis and IBM Software Innovation Lab have unveiled ITBench-AA, a groundbreaking benchmark designed to assess the capabilities of AI models in tackling complex enterprise IT tasks. The initial focus of ITBench-AA is on Site Reliability Engineering (SRE) tasks, where even the most advanced models have fallen short, scoring below 50%. The SRE tasks, part of the ITBench dataset developed by IBM, evaluate a model's ability to respond to Kubernetes incidents by analyzing logs, tracing dependencies, and pinpointing root causes within intricate infrastructure.
This benchmark was crafted by leveraging IBM's extensive expertise in enterprise IT operations. Artificial Analysis has collaborated closely with IBM over the past six months to adapt the dataset for evaluating frontier AI models, starting with SRE tasks and planning to expand to Financial Operations (FinOps) and Chief Information Security Officer (CISO) tasks in the future. The launch of ITBench-AA marks a pivotal moment in the evolution of AI, as it sets a new standard for assessing the practical applicability of AI models in real-world enterprise settings.
By challenging models to perform tasks that require a deep understanding of complex IT systems, ITBench-AA provides a more nuanced evaluation of AI capabilities than traditional benchmarks. The performance of frontier models on ITBench-AA's SRE tasks has significant implications for the development of AI in enterprise IT. As AI continues to play a critical role in managing and optimizing IT operations, the ability of models to diagnose and resolve issues accurately is paramount.
The results from ITBench-AA will likely inform future developments in AI, guiding researchers and developers in creating more effective and practical solutions for enterprise IT challenges. Artificial Analysis and IBM's collaboration on ITBench-AA underscores the importance of industry-academia partnerships in advancing AI research and development. By combining their expertise, the two organizations have created a benchmark that not only assesses current AI capabilities but also sets a clear direction for future advancements in the field.
Source: Hugging Face