GPT-5.5 beats Claude Fable 5 on Agents' Last Exam benchmark
Researchers launch Agents' Last Exam, a benchmark testing AI's ability to execute long-horizon professional workflows, with GPT-5.5 taking top spot.

Claude Fable 5 on Agents' Last Exam benchmark">
Researchers from the University of California, Berkeley's Center for Responsible, Decentralized Intelligence (RDI), alongside an advisory committee of over 300 domain experts, have launched Agents' Last Exam (ALE)—a grueling new benchmark built to measure whether artificial intelligence can actually execute economically valuable, long-horizon professional workflows. In a shocking upset, OpenAI's GPT-5.5 from April, operating through the Codex harness, secured the absolute top spot on the new ALE Leaderboard with a 24.0% pass rate, beating Anthropic's highly anticipated, brand new Mythos-class Claude Fable 5 model released just yesterday, which came in third with a score of 22.0%. Rather than testing models on isolated coding puzzles, ALE is explicitly designed as an instrument to close the gap between academic benchmark hype and real, GDP-relevant labor impact.
And right now, the data proves the most advanced models in the world are fundamentally failing the exam. Ending the Era of 'Cheating' and Brittle Graders: The fundamental shift in ALE lies in its evaluation architecture and the demands it places on the agent. Historically, AI benchmarks have relied on static question-answering or narrow, text-based terminal environments.
ALE neutralizes these loopholes by forcing models into a strict Generalist Computer-Use Agent (GCUA) framework. To pass, an agent cannot merely execute terminal commands. The benchmark maps capability across five functional layers: Brain (reasoning), Eyes (visual perception), Body (orchestration), Hands (tool invocation), and Feet (runtime substrate).
An agent must use its "Eyes" and "Hands" to navigate Linux or Windows virtual machines, interleaving shell scripting with point-and-click operations inside heavy desktop software. Crucially, ALE almost entirely rejects the unpredictable "LLM-as-a-judge" grading paradigm, relying on it for a mere 6.8% of its workflows. Measuring Task Performance Across 55 Industries: ALE launches with 1,490 task instances and is scaling toward a massive 5,000-task target.
What makes the product remarkable is its authenticity. The tasks are strictly anchored in the U.S. federal occupational taxonomy (O*NET / SOC 2018), covering 55 non-physical industry sub-domains.
The victory of GPT-5.5 aligns with recent third-party analysis suggesting that OpenAI's models are currently superior at strictly adhering to multi-part, complex prompts. Conversely, users report Anthropic's Claude architecture can sometimes be "forgetful" with multi-part instructions, abandoning required steps mid-workflow—a fatal flaw in ALE's rigorous pipeline. Solving Benchmark Contamination: A core vulnerability in modern AI evaluation is "benchmark contamination"—the phenomenon where test questions inevitably leak into the massive data lakes used to train next-generation models.
Source: VentureBeat