Alibaba's Qwen-AgentWorld Improves Agent Performance Across Seven Benchmarks
Alibaba's Qwen team releases Qwen-AgentWorld, two models trained to predict environment returns, boosting agent performance in seven domains.

Alibaba's Qwen team released Qwen-AgentWorld on Tuesday — two models trained not to act inside agent environments, but to predict what those environments return. The release covers seven domains under a single architecture: MCP, Search, Terminal, Software Engineering, Android, Web, and OS. The release extends Alibaba's recent push into autonomous agents.
Qwen3.7-Max, released in May, was built around a 35-hour autonomous execution capability. That shift targets a ceiling teams training agents at scale run into directly. Real search engines surface whatever results exist, with no mechanism to inject controlled conditions.
Live terminals do not allow injecting a low-disk-space condition on demand. Agent training is bounded by what production environments will surface, with no systematic way to expose the edge cases agents will need to handle but rarely encounter in training. The research team trained agents inside the resulting simulator and found performance gains that exceeded what training against real environments alone produced.
Using world model training as a warm-up before agentic fine-tuning improved performance across seven benchmarks, including three the model had never seen during training. The paper accompanying the release identified a gap in prior agent research. "We argue that world modeling is a crucial missing piece in the path to general agents." Qwen-AgentWorld trains on what environments return, not what agents should do.
Most agent models are trained to answer one question: given what the environment just showed me, what should I do next? Qwen-AgentWorld is trained to answer the inverse: given what the agent just did, what will the environment show next? That reversal is the core of what the paper calls a language world model: instead of optimizing for action selection, the model learns to predict the next environment state across all seven domains under a single training objective.
Alibaba trained both models in three stages on more than 10 million environment interaction trajectories from real agent runs. Stage one teaches the model how environments behave — file systems, terminal states, browser DOM changes, API responses. Stage two trains the model to reason through what comes next before predicting it.
Stage three, reinforcement learning, tightens predictions using rule-based checks and open-ended quality scoring. Both models are Mixture-of-Experts designs — only a fraction of parameters are active per token. The 35B model activates 3B; the 397B activates 17B.
Both support 256K context windows. The benchmark scores show how accurately the models predict what environments return. The training results show what that prediction capability is actually worth for teams building agents — and those are the numbers that matter more.
Source: VentureBeat