Meet Qwen-RobotSuite: Three Embodied AI Models for VLA Manipulation, Video World Modeling, and Navigation
AI News Desk
·
MarkTechPost
··
7 min read
The Qwen team has released three embodied AI models, grouped as Qwen-Robot-Suite.
The Qwen team has released three embodied AI models, grouped as Qwen-Robot-Suite. The three are Qwen-RobotManip, Qwen-RobotWorld, and Qwen-RobotNav. Each is built on a Qwen vision-language backbone and targets a different robotics problem.
Qwen-RobotManip is a Vision-Language-Action model for manipulation, built on Qwen3.5-4B. Qwen-RobotWorld is a language-conditioned video world model with a 60-layer MMDiT and a frozen Qwen2.5-VL encoder. Qwen-RobotNav is a navigation model built on Qwen3-VL, available at 2B, 4B, and 8B sizes.
Qwen-Robot-Suite is not a single model. It is a suite of three independent foundation models. Two of them, RobotManip and RobotNav, ship with public GitHub repositories.
Robotics data is fragmented across hardware and tasks. Different robots use incompatible observation and action formats. A policy trained on one arm rarely transfers to another.
The three research reports address this fragmentation in different ways. RobotManip aligns action representations so manipulation data scales. RobotWorld uses language as a unified action interface for video prediction. RobotNav exposes a controllable observation interface for navigation tasks.
Here is the core split between the three releases :
Qwen-RobotManip is a Vision-Language-Action (VLA) foundation model. It is built on Qwen-VL and predicts continuous robot actions.
A VLA model takes camera views and a language instruction. It then outputs low-level robot actions. The challenge is that manipulation data is heterogeneous by nature.
Different robots record states and actions in incompatible formats. When demonstrations arrive with mismatched representations, scaling data produces interference. RobotManip solves this with a unified alignment framework.
The framework has three complementary mechanisms. First is a canonical state-action representation. It is an 80-dimensional vector with per-dimension binary masking.
This vector holds two 29-dimensional per-arm blocks plus 22 reserved dimensions. Each block stores joint positions, end-effector pose, gripper state, and dexterous hand joints. Robots populate only the dimensions they have.
Second is a camera-frame delta pose parameterization. End-effector actions are expressed as deltas in the camera frame. This makes visually similar motions numerically proximate across embodiments.
Third is an in-context policy adaptation mechanism. It reads recent execution history as an implicit embodiment identifier. The policy adjusts behavior at deployment time without parameter updates.
A dual-stream co-training strategy runs alongside this. It jointly optimizes manipulation data and a vision-language stream. This prevents the backbone’s perception and reasoning from eroding.
RobotManip assembles roughly 38,100 hours of manipulation data. It uses only open-source datasets and human videos. No proprietary data collection was used.
A human-to-robot synthesis pipeline produces most of this scale. It converts egocentric hand demonstrations into robot trajectories. The pipeline renders across 15 robot platforms.
This synthesis alone yields about 24,808 hours of demonstrations. The egocentric source data is about 1,933 hours. Open-source robot datasets contribute over 11,000 hours.
The pipeline separates action alignment from visual alignment. Action alignment retargets hand keypoints to gripper poses. Visual alignment uses SAM3 masking, ProPainter inpainting, and MuJoCo inverse kinematics.
A five-stage curation pipeline then filters the combined corpus. It catches sudden changes, temporal misalignment, and extreme values. One check found 81% of episodes in a subset failed state-action alignment.
The research report argues standard benchmarks fail to measure generalization. Models without robot pretraining match pretrained ones on in-distribution tests. RobotManip therefore focuses on out-of-distribution (OOD) settings.
The largest reported gap is on cross-embodiment transfer. RobotManip reaches 23.9% using camera-frame EEF actions. That is 3.2× the 7.5% achieved by π0.5.
The model also ranks 1st on the RoboChallenge Table30-v1 generalist track. It scores a 20% relative improvement over the prior best. Real-robot validation covers AgileX ALOHA, Franka, UR, and ARX platforms.
Qwen-RobotWorld is a language-conditioned video world model. It predicts future visual trajectories from a current observation. Natural language serves as the unified action interface.
A world model learns environment dynamics. Given a current state and an action, it predicts the next state. RobotWorld represents states as video frames and actions as text.
This is important because language is embodiment-agnostic. One instruction encodes the action sequence, goal, and constraints. It works across a Franka gripper, an Aloha dual-arm system, or a humanoid.
The model uses a 60-layer double-stream Multimodal Diffusion Transformer. An understanding stream processes a frozen Qwen2.5-VL encoder’s features. A generation stream processes video-VAE latents.
The two streams interact via joint attention at every layer. Using an MLLM as the action encoder gives two advantages. It parses compositional instructions and constrains physically plausible transitions.
The MMDiT has 20B parameters. The VAE adopts the Wan-VAE architecture. The context length supports up to 48,360 video tokens.
A Scene2Robot mechanism reuses this backbone for cross-embodiment synthesis. It processes scene, robot reference, and generation segments together. This enables human-to-robot video transfer without robot-specific prompting.
Training uses the Embodied World Knowledge (EWK) dataset. It contains roughly 8.6M video-text pairs. That spans over 200M observation frames.
The corpus covers four embodied domains plus general video. Manipulation provides about 5.9M samples across 20+ morphologies. Driving, navigation, and human-to-robot transfer fill out the rest.
An action-language mapping framework standardizes everything. It converts 20+ embodiment types and 500+ action categories into language. A hierarchical five-layer annotation pipeline produces the captions.
RobotWorld was evaluated on four established benchmarks. It ranks 1st overall on two of them :
On EWMBench it leads motion fidelity with an HSD of 0.566. That is a 33% gain over the runner-up. Scene consistency reaches 0.914.
On WorldModelBench it scores 1.00 on four physics-adherence categories. These are Newton’s laws, mass conservation, fluid dynamics, and gravity. Penetration scores 0.94, and instruction following scores 2.33 out of 3.0.
Qwen-RobotNav is a scalable navigation model built on Qwen3-VL. It reframes multi-task navigation as observation context modeling. The model exposes a parameterized interface for external control.
Navigation spans many task families. Instruction following, point-goal navigation, object search, target tracking, and driving all differ. Each demands a different strategy for consuming the visual stream.
Instruction following needs long memory to re-reference landmarks. Target tracking needs only the most recent frames. No fixed context strategy serves all tasks well.
RobotNav formulates all tasks as waypoint trajectory prediction. It predicts 8 waypoints, each with a 2D position and heading. A lightweight 4-layer MLP head produces these from the backbone.
The interface has two configuration dimensions. Task modes select navigation behavior across VLN, PointNav, ObjNav, and Tracking. Observation parameters govern how visual history is encoded.
These observation controls include a visual token budget and temporal decay. They also include per-camera importance weights. Training-time randomization over all parameters ensures robustness.
Camera identity and temporal order use natural-language tags. This requires zero architectural modification to Qwen3-VL. Supporting a new platform needs only a new prompt template.
The interface makes RobotNav a building block for agentic systems. An upper-tier planner decomposes long-horizon goals into sub-goals. Qwen3.6-Plus serves as this planner in the system.
The planner reconfigures RobotNav’s task mode mid-episode. RobotNav serves as the reactive executor. The two tiers communicate exclusively through natural language.
A two-level memory supports long-horizon reasoning. Single-episode memory summarizes each rollout. Cross-episode memory accumulates durable conclusions like searched regions.
RobotNav was trained on 15.6M samples. Navigation trajectory data forms 85% of this. Vision-language reasoning data fills the remaining 15%.
The agentic system sets new state-of-the-art on Embodied Question Answering. It improves over the best prior method by 10.8% on HM-EQA. It also improves by 15.4% on EXPRESS-Bench while requiring 77% fewer navigation steps.
The report shows performance improving from 2B to 8B parameters. Joint multi-task training develops a shared spatial-planning substrate. The report states this transfers across task families.
Each model maps to concrete deployment scenarios. The examples below combine report-supported results with illustrative framing.
The table below consolidates the technical details. It is a reference for picking the right model.
The three research reports do not present a combined system. Read together, they cover complementary layers. RobotWorld handles simulation and data generation, RobotManip handles manipulation, and RobotNav handles mobility.
The RobotManip action representation is worth understanding in code terms. It is the mechanism that lets different robots share one model. Below is a simplified illustration of the masking idea.
The per-dimension binary mask is the key idea. It ensures gradients flow only through semantically populated entries. This prevents spurious supervision on absent degrees of freedom.
The same masking principle appears in the flow-matching loss. Each sample contributes equally regardless of how many dimensions are active. This stops robots with more populated slots from dominating optimization.
Check out the Technical details and Papers ( Qwen-RobotManip , Qwen-RobotWorld , and Qwen-RobotNav ) . Also, feel free to follow us on Twitter and don’t forget to join our 150k+ML SubReddit and Subscribe to our Newsletter . Wait! are you on telegram? now you can join us on telegram as well.
Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us
The post Meet Qwen-RobotSuite: Three Embodied AI Models for VLA Manipulation, Video World Modeling, and Navigation appeared first on MarkTechPost .