New AI optimization framework Arbor outperforms Claude Code and Codex by 2.5x
Arbor framework automates AI-driven research and optimization, outperforming Claude Code and Codex by 2.5x on the same compute budget.

Imagine an AI agent deployed to search through internal company documents and answer employee questions. It works perfectly in development but consistently hallucinates or misses key constraints in production. Fixing this requires a tedious process of tweaking chunking strategies, retrieval methods, and system prompts simultaneously.
Researchers at Renmin University of China and Microsoft Research introduced Arbor, a framework that upgrades AI-driven research and optimization from trial-and-error guesses to a cumulative learning process. Arbor organizes hypotheses, experiments, and insights into a tree that helps the system learn from prior failures to make smarter, verified improvements over time. In practical tests, Arbor delivered more than 2.5 times the verifiable performance gains of standard AI coding agents across real-world engineering tasks while operating under the same resource budget.
For enterprise AI, this technique directly translates to automating the continuous improvement of complex, real-world engineering systems. Understanding the bottleneck in autonomous optimization is crucial. Large language models and AI systems are expected to carry out complex operations such as autonomous optimization (AO) of software systems.
AO captures the fundamental loop of autonomous research. An AI agent starts with an initial mutable artifact and a specific objective, aiming to iteratively improve this artifact through experimental feedback without human supervision. The main challenge of AO is often misunderstood.
Simply giving a coding agent more time or compute to optimize a codebase doesn't lead to better results. "Automation can keep an AI working for a very long time — but a loop is not the same as progress," Jiajie Jin, co-author of the paper, told VentureBeat. "If the goal is vague, or the metric is easy to hack, long-running automation often just produces 'improvements' faster that nobody actually wants." Current agent systems treat each attempt in isolation, missing the structural mechanisms to accumulate and act on what they've learned.
They lack the capacity to simultaneously maintain and compare multiple competing research directions. Without this, they cannot interpret both successes and failures to reshape their future exploration. The Arbor framework solves these challenges by automating the long-horizon loop of exploration, experimentation, and abstraction that characterizes human research.
Arbor separates the strategic direction of research from ground-level coding tasks with two key components: The coordinator and Executors. The coordinator acts like a principal investigator, owning the general state of the optimization research, observing accumulated evidence, coming up with new hypotheses and directions to explore, and deciding what to do with the experiment results. Executors are short-lived, highly focused AI agents that implement assigned ideas, run evaluations, debug errors, and report back to the coordinator.
Source: VentureBeat