AI models struggle with real-world knowledge work tasks
Top AI model solves just 3% of realistic knowledge work tasks, highlighting significant challenges in handling complex projects.

Even the best AI model fails at realistic knowledge work, fully solving just 3 percent of tasks. The new AA-Briefcase benchmark from Artificial Analysis puts AI models through multi-week knowledge work projects built from thousands of fragmented source files like Slack threads, emails, meeting transcripts, and large data exports. The top performer, Claude Fable 5, hits the highest rubric pass rate but still nails all criteria on just 3 percent of tasks.
On 31 out of 91 tasks, no model even clears 50 percent. The types of errors shift as models get better. Weaker models choke on basic execution as they miss relevant files or spit out unusable results.
Stronger models fail more quietly, as they hit the obvious requirements but miss details you'd only catch by piecing together information from multiple sources. There also is a significant price gap: Per-task costs span more than 800x, from about $0.04 for DeepSeek V4 Flash to over $31 for Claude Fable 5. Why this matters: The AA-Briefcase benchmark results underscore the current limitations of AI in handling complex, real-world knowledge work tasks.
With even the best models succeeding on only a tiny fraction of tasks, businesses and developers must carefully consider the capabilities and costs of AI solutions. As AI continues to be integrated into various industries, understanding its strengths and weaknesses is crucial. The wide range of per-task costs, from $0.04 to over $31, adds another layer of complexity, suggesting that cost will be a significant factor in determining which AI models are used and how they are deployed.
Ultimately, these findings highlight the need for further advancements in AI to bridge the gap between current capabilities and the demands of real-world applications.
Source: The Decoder