Only three AI models finish above starting capital in 500-day startup test
Princeton University researchers create CEO-Bench, testing AI agents' ability to run a fictional software company for 500 simulated days.

Researchers at Princeton University built CEO-Bench, a test where AI agents have to run a fictional software company for 500 simulated days. Most current models go broke, and a simple rule-based heuristic with no AI beats nearly all of them. The test, designed to evaluate the capabilities of AI models in a simulated business environment, found that only three AI models were able to finish with a value above their starting capital.
The majority of AI models struggled to survive, with most going bankrupt within the 500-day simulation period. In comparison, a simple rule-based heuristic, which did not utilize AI, outperformed nearly all of the AI models, highlighting the challenges that current AI systems face in complex, dynamic environments. The findings suggest that there is still significant room for improvement in the development of AI systems capable of managing complex tasks, such as running a software company.
Why this matters: The results of the CEO-Bench test have significant implications for the development and deployment of AI systems in real-world business environments. The fact that only three AI models were able to finish above their starting capital, and that a simple rule-based heuristic outperformed most AI models, highlights the need for further research and development in areas such as decision-making, strategic planning, and risk management. As AI systems become increasingly used in business and other complex environments, it is crucial that they are able to demonstrate a higher level of performance and reliability.
The challenges faced by AI models in the CEO-Bench test also raise questions about the potential risks and limitations of relying on AI systems to make critical business decisions, and the need for ongoing evaluation and improvement of these systems to ensure they are safe and effective.
Source: The Decoder