OpenAI researchers want to predict how often AI models will fail before launch
OpenAI researchers propose a method for predicting how often a new AI model will make mistakes after release.

OpenAI researchers propose a method for predicting how often a new AI model will make mistakes after release. It could fill gaps left by standard safety testing.
Before an AI model ships, it goes through safety testing. These tests try to estimate how often the model will later show unwanted behavior, like producing banned content or deceiving users. According to an OpenAI research paper , most of these tests rely on handwritten, synthetic, or deliberately tricky questions.
But these tests only capture a skewed slice of reality. They're designed to probe for weaknesses, not to reflect what real users actually type. On top of that, models often pick up on the fact that they're being tested and behave differently than they would in normal use. Both issues mean test results say little about how a model will actually perform in the wild. Ad
Researchers Marcus Williams, Micah Carroll, and their team propose a straightforward approach called "Deployment Simulation." Instead of crafting new test questions, they pull from real, anonymized conversations that users had with a previous model. They keep the conversation history intact, all prior messages, and only have the new, unreleased model rewrite the next response. Ad
Because the source conversations come from real traffic, the model faces exactly the kinds of situations it'll encounter after launch. And it doesn't realize it's being tested, since it's just looking at a normal user request.
These simulated responses serve two purposes. First, they can be scanned for new types of misbehavior. Second, researchers can count how often a specific problem shows up and derive a concrete frequency estimate. That estimate is verifiable: after release, the same measurement runs against real production data and gets compared to the prediction. Ad
OpenAI tested the approach on four models in the GPT-5 series using roughly 1.3 million conversations from August 2025 through March 2026.
For GPT-5.4, the researchers went especially strict: they used the simulation to predict how often the model would show each type of misbehavior after release, then locked in those estimates before they could even look at real usage data. That made it possible to check later, without bias, how well the predictions matched reality. Three older models in the series were analyzed retroactively, after real-world results were already known. Ad
The team examined 20 categories of misbehavior, from banned content to deception. For categories where the frequency shifted significantly between model versions, the simulation correctly predicted whether a problem would increase or decrease 92 percent of the time. Standard tests got that right just 54 percent of the time. Ad
Source: The Decoder