OpenAI Introduces Deployment Simulation for Pre-Deployment Risk Assessment
OpenAI publishes Deployment Simulation, a new pre-deployment safety method that simulates model deployment to assess risks before release.

OpenAI has published a new pre-deployment safety method called Deployment Simulation. The idea is direct: before a model ships, simulate its deployment first. Replay past conversations through the new candidate model and study how it behaves in realistic contexts.
OpenAI already uses insights from the method during model development. It has informed mitigations and deployment decisions, and surfaced blind spots in traditional evaluations. Deployment Simulation is a method for simulating a future deployment before it happens.
OpenAI does this by replaying previous conversations with a new candidate model. The replay is privacy-preserving. The technique is simple at its core.
Take recent conversations from deployment, remove the original assistant response from the older model, regenerate that response with the candidate model to be released, and then evaluate the completions for new failure modes. From those completions, OpenAI estimates deployment-time undesired behavior frequency. The same measurement can run after release on real traffic, making pre-deployment forecasts checkable later.
There is a floor to the approach: it cannot measure behaviors that occur less than once in 200,000 messages. It targets non-tail risks, not the rarest events. Traditional evaluations mix synthetic, manually written, or production prompts chosen to be difficult, high severity, or adversarial.
Deployment Simulation instead samples a distribution representative of recent usage. That representativeness fixes three known problems: it reduces selection bias from hand-picked prompts, improves coverage by simulating more traffic, and reduces evaluation awareness, since contexts look like real deployment. The approach has a clear tradeoff: quality scales with compute, not with manual effort to build evals.
More resampled traffic means more behaviors surfaced. OpenAI team frames the two approaches as complementary. The key difference is detectability: models recognize synthetic tests but struggle to tell simulated traffic from real deployment.
OpenAI pre-registered predictions for 20 types of undesirable behavior on GPT-5.4 Thinking and ran retrospective studies across other GPT-5-series Thinking deployments. The dataset was large, analyzing roughly 1.3 million de-identified conversations spanning GPT-5 Thinking through GPT-5.4, from August 2025 to March 2026. Only traffic from users who allow data for model improvements was used.
Three quality bars were measured: taxonomy coverage checks for missed misaligned behaviors, directional accuracy checks predicted increase or decrease, and rate calibration checks closeness to observed rates, the hardest standard. The aggregate result was a median multiplicative error of 1.5x. For a true rate of 10 in 100k, that means estimating 15 or 6.67 in 100k.
Source: MarkTechPost