Robotics

Meet Qwen-RobotSuite: Three Embodied AI Models for VLA Manipulation, Video World Modeling, and Navigation

AI News Desk

MarkTechPost

Jun 16, 2026

7 min read

The Qwen team has released three embodied AI models, grouped as Qwen-Robot-Suite.

Meet Qwen-RobotSuite: Three Embodied AI Models for VLA Manipulation, Video World Modeling, and Navigation

The Qwen team has released three embodied AI models, grouped as Qwen-Robot-Suite. The three are Qwen-RobotManip, Qwen-RobotWorld, and Qwen-RobotNav. Each is built on a Qwen vision-language backbone and targets a different robotics problem.

Qwen-RobotManip is a Vision-Language-Action model for manipulation, built on Qwen3.5-4B. Qwen-RobotWorld is a language-conditioned video world model with a 60-layer MMDiT and a frozen Qwen2.5-VL encoder. Qwen-RobotNav is a navigation model built on Qwen3-VL, available at 2B, 4B, and 8B sizes.

Qwen-Robot-Suite is not a single model. It is a suite of three independent foundation models. Two of them, RobotManip and RobotNav, ship with public GitHub repositories.

Robotics data is fragmented across hardware and tasks. Different robots use incompatible observation and action formats. A policy trained on one arm rarely transfers to another.

The three research reports address this fragmentation in different ways. RobotManip aligns action representations so manipulation data scales. RobotWorld uses language as a unified action interface for video prediction. RobotNav exposes a controllable observation interface for navigation tasks.

Here is the core split between the three releases :

Qwen-RobotManip is a Vision-Language-Action (VLA) foundation model. It is built on Qwen-VL and predicts continuous robot actions.

A VLA model takes camera views and a language instruction. It then outputs low-level robot actions. The challenge is that manipulation data is heterogeneous by nature.

Different robots record states and actions in incompatible formats. When demonstrations arrive with mismatched representations, scaling data produces interference. RobotManip solves this with a unified alignment framework.

The framework has three complementary mechanisms. First is a canonical state-action representation. It is an 80-dimensional vector with per-dimension binary masking.

This vector holds two 29-dimensional per-arm blocks plus 22 reserved dimensions. Each block stores joint positions, end-effector pose, gripper state, and dexterous hand joints. Robots populate only the dimensions they have.

Second is a camera-frame delta pose parameterization. End-effector actions are expressed as deltas in the camera frame. This makes visually similar motions numerically proximate across embodiments.

Third is an in-context policy adaptation mechanism. It reads recent execution history as an implicit embodiment identifier. The policy adjusts behavior at deployment time without parameter updates.

Share this article

X LinkedIn Telegram

Source: MarkTechPost