認識 Qwen-RobotSuite：三個用於 VLA 操作、影片世界建模與導航的具身 AI 模型

2026年6月16日 16:51

重點摘要

Qwen 團隊發布了三個具身 AI 模型，統稱為 Qwen-Robot-Suite，分別是 Qwen-RobotManip、Qwen-RobotWorld 與 Qwen-RobotNav。每個模型均以 Qwen 視覺語言模型為骨幹，針對不同的機器人問題。Qwen-RobotManip 是基於 Qwen3.5-4B 的視覺語言動作模型，用於操作。Qwen-RobotWorld 是語言條件化的影片世界模型，採用 60 層 MMDiT 與凍結的 Qwen2.5-VL 編碼器。Qwen-RobotNav 是基於 Qwen3-VL 的導航模型，提供 2B、4B 與 8B 三種參數量版本。Qwen-Robot-Suite 並非單一模型，而是一組三個獨立基礎模型，其中 RobotManip 與 RobotNav 已附上公開的 GitHub 儲存庫。機器人資料因硬體與任務不同而分散，不同機器人使用各異的資料格式。

站內 AI 整理稿

The Qwen team has released three embodied AI models, grouped as Qwen-Robot-Suite. The three are Qwen-RobotManip, Qwen-RobotWorld, and Qwen-RobotNav. Each is built on a Qwen vision-language backbone and targets a different robotics problem. Qwen-RobotManip is a Vision-Language-Action model for manipulation, built on Qwen3.5-4B. Qwen-RobotWorld is a language-conditioned video world model with a 60-layer MMDiT and a frozen Qwen2.5-VL encoder. Qwen-RobotNav is a navigation model built on Qwen3-VL, available at 2B, 4B, and 8B sizes. Qwen-Robot-Suite Qwen-Robot-Suite is not a single model. It is a suite of three independent foundation models. Two of them, RobotManip and RobotNav, ship with public GitHub repositories. Robotics data is fragmented across hardware and tasks. Different robots use incompatible observation and action formats. A policy trained on one arm rarely transfers to another. The three research reports address this fragmentation in different ways. RobotManip aligns action representations so manipulation data scales. RobotWorld uses language as a unified action interface for video prediction. RobotNav exposes a controllable observation interface for navigation tasks. Here is the core split between the three releases: ModelProblemBackboneOutputQwen-RobotManipRobotic manipulationQwen3.5-4B (Qwen-VL)Continuous robot actionsQwen-RobotWorldEmbodied world modelingFrozen Qwen2.5-VLPredicted future videoQwen-RobotNavMobile navigationQwen3-VL (2B/4B/8B)Waypoint trajectories Qwen-RobotManip: Alignment Unlocks Scale for Manipulation Qwen-RobotManip is a Vision-Language-Action (VLA) foundation model. It is built on Qwen-VL and predicts continuous robot actions. A VLA model takes camera views and a language instruction. It then outputs low-level robot actions. The challenge is that manipulation data is heterogeneous by nature. Different robots record states and actions in incompatible formats. When demonstrations arrive with mismatched representations, scaling data produces interference. RobotManip solves this with a unified alignment framework. The Unified Alignment Framework The framework has three complementary mechanisms. First is a canonical state-action representation. It is an 80-dimensional vector with per-dimension binary masking. This vector holds two 29-dimensional per-arm blocks plus 22 reserved dimensions. Each block stores joint positions, end-effector pose, gripper state, and dexterous hand joints. Robots populate only the dimensions they have. Second is a camera-frame delta pose parameterization. End-effector actions are expressed as deltas in the camera frame. This makes visually similar motions numerically proximate across embodiments. Third is an in-context policy adaptation mechanism. It reads recent execution history as an implicit embodiment identifier. The policy adjusts behavior at deployment time without parameter updates. A dual-stream co-training strategy runs alongside this. It jointly optimizes manipulation data and a vision-language stream. This prevents the backbone’s perception and reasoning from eroding. The Data Engine RobotManip assembles roughly 38,100 hours of manipulation data. It uses only open-source datasets and human videos. No proprietary data collection was used. A human-to-robot synthesis pipeline produces most of this scale. It converts egocentric hand demonstrations into robot trajectories. The pipeline renders across 15 robot platforms. This synthesis alone yields about 24,808 hours of demonstrations. The egocentric source data is about 1,933 hours. Open-source robot datasets contribute over 11,000 hours. The pipeline separates action alignment from visual alignment. Action alignment retargets hand keypoints to gripper poses. Visual alignment uses SAM3 masking, ProPainter inpainting, and MuJoCo inverse kinematics. A five-stage curation pipeline then filters the combined corpus. It catches sudden changes, temporal misalignment, and extreme values. One check found 81% of episodes in a subset failed state-action alignment. Benchmark Results The research report argues standard benchmarks fail to measure generalization. Models without robot pretraining match pretrained ones on in-distribution tests. RobotManip therefore focuses on out-of-distribution (OOD) settings. Benchmark (OOD)Prev. SOTA (π0.5)Qwen-RobotManipLIBERO-Plus84.491.4RoboTwin-C2R Hard47.969.4EBench27.145.6RoboCasa36516.935.9RoboTwin-IF49.672.2 The largest reported gap is on cross-embodiment transfer. RobotManip reaches 23.9% using camera-frame EEF actions. That is 3.2× the 7.5% achieved by π0.5. The model also ranks 1st on the RoboChallenge Table30-v1 generalist track. It scores a 20% relative improvement over the prior best. Real-robot validation covers AgileX ALOHA, Franka, UR, and ARX platforms. (function(){ var frame=document.getElementById("qwen-robotmanip-canonical-vector-frame"); window.addEventListener("message",function(e){ if(e&&e.data&&e.data.type==="rmcv-resize"&&typeof e.data.height==="number"){ frame.style.height=Math.max(200,Math.min(4000,e.data.height))+"px"; } }); })(); Qwen-RobotWorld: Language as a Universal Action Interface Qwen-RobotWorld is a language-conditioned video world model. It predicts future visual trajectories from a current observation. Natural language serves as the unified action interface. A world model learns environment dynamics. Given a current state and an action, it predicts the next state. RobotWorld represents states as video frames and actions as text. This is important because language is embodiment-agnostic. One instruction encodes the action sequence, goal, and constraints. It works across a Franka gripper, an Aloha dual-arm system, or a humanoid. The Double-Stream MMDiT Architecture The model uses a 60-layer double-stream Multimodal Diffusion Transformer. An understanding stream processes a frozen Qwen2.5-VL encoder’s features. A generation stream processes video-VAE latents. The two streams interact via joint attention at every layer. Using an MLLM as the action encoder gives two advantages. It parses compositional instructions and constrains physically plausible transitions. The MMDiT has 20B parameters. The VAE adopts the Wan-VAE architecture. The context length supports up to 48,360 video tokens. A Scene2Robot mechanism reuses this backbone for cross-embodiment synthesis. It processes scene, robot reference, and generation segments together. This enables human-to-robot video transfer without robot-specific prompting. The Embodied World Knowledge Dataset Training uses the Embodied World Knowledge (EWK) dataset. It contains roughly 8.6M video-text pairs. That spans over 200M observation frames. The corpus covers four embodied domains plus general video. Manipulation provides about 5.9M samples across 20+ morphologies. Driving, navigation, and human-to-robot transfer fill out the rest. An action-language mapping framework standardizes everything. It converts 20+ embodiment types and 500+ action categories into language. A hierarchical five-layer annotation pipeline produces the captions. Benchmark Results RobotWorld was evaluated on four established benchmarks. It ranks 1st overall on two of them: BenchmarkResultRankingEWMBench4.601st overallDreamGen Bench4.9521st overallWorldModelBench8.991st open-source (3rd overall)PBench0.8041st open-source On EWMBench it leads motion fidelity with an HSD of 0.566. That is a 33% gain over the runner-up. Scene consistency reaches 0.914. On WorldModelBench it scores 1.00 on four physics-adherence categories. These are Newton’s laws, mass conservation, fluid dynamics, and gravity. Penetration scores 0.94, and instruction following scores 2.33 out of 3.0. (function(){ var frame=document.getElementById("qwen-robotworld-language-interface-frame"); window.addEventListener("message",function(e){ if(e&&e.data&&e.data.type==="rww-resize"&&typeof e.data.height==="number"){ frame.style.height=Math.max(200,Math.min(4000,e.data.height))+"px"; } }); })(); Qwen-RobotNav: A Con

原始來源：MarkTechPost AI ↗

查看原始來源

MarkTechPost AI模型更新

Liquid AI Introduces LFM2.5-Embedding-350M and LFM2.5-ColBERT-350M: Dense Bi-Encoder and Late-Interaction Models for Fast Multilingual Search Across 11 Languages

This week, Liquid AI released two new retrieval models. They are LFM2.5-ColBERT-350M and LFM2.5-Embedding-350M. Both hold 350M parameters. Both are the first bidirectional members of the LFM family. They build on LFM2.5-350M-Base, released in March. The pair targets fast multilingual and cross-lingual search across 11 languages. Their footprint is small enough to run almost anywhere. Both are available now on Hugging Face under the LFM Open License v1.0. LFM2.5 Retrievers The two models share one backbone but represent text differently. LFM2.5-Embedding-350M is a dense bi-encoder. It turns each document into a single vector. Pick it when you want the fastest search and the smallest, cheapest index. LFM2.5-ColBERT-350M is a late-interaction model. It converts each token into a vector rather

1 小時前閱讀分析

MarkTechPost AI模型更新

Perplexity Launches Brain, a Self-Improving Memory System That Builds a Context Graph of an Agent’s Work and Learns Overnight

Most AI memory remembers the user. It stores your preferences, your tastes, and your role. Perplexity is taking a different path. Today, Perplexity launched Brain, a self-improving memory system for its agent product, Computer. Brain does not focus on remembering you. It remembers what the agent did. That reframes what memory in AI is for. What is Perplexity‘s Brain Brain is a self-improving memory system. It builds a context graph of the work Computer performs. At set intervals, such as overnight, Brain reviews that graph. It then teaches itself how to do the work better. The idea is straightforward. The more work you do, the more efficient Brain makes your Computer. Brain is rolling out today to Perplexity Max and Enterprise Max subscribers in Research Preview. Two Axes of AI Memory Perp

15 小時前閱讀分析

36氪模型更新

智譜新高，MiniMax承壓，“大模型雙雄”命運殊途

這篇消息聚焦「智譜新高，MiniMax承壓，“大模型雙雄”命運殊途」。原始導語提到：大模型在被市場重新定價從 AI 情報角度來看，這類內容值得關注其背後的技術進展、產品落地、產業競爭與後續市場影響。

17 小時前閱讀分析

IT之家模型更新

華為昇騰 0 Day 支持智譜 GLM-5.2 模型，提供全面推理優化

華為昇騰 AI 宣佈在智譜開源 GLM-5.2 大模型當天即完成深度推理優化。通過 MOE 大融合算子、通信計算融合、高併發調度等七項關鍵技術，顯著提升編程和長程任務的處理效率，現已支持 A3 系列產品部署。#AI 大模型# #國產算力#

1 天前閱讀分析

AIBase模型更新

企業AI轉型再添利器：青雲科技算力雲接入 MiniMax-M3 模型

企業AI落地面臨高效低成本難題。青雲科技旗下基石智算平臺接入國產開源大模型MiniMax-M3，提供新算力支持。MiniMax-M3以卓越上下文處理能力等三大核心技術見長，依託自研架構，助企業便捷部署AI業務。

1 天前7000閱讀分析

AIBase模型更新

阿里開源統一科學大模型 LOGOS，僅用五十六分之一參數超越微軟

阿里 ATH-Token Foundry 聯閤中國人民大學高瓴人工智能學院開源科學基礎模型 LOGOS。該模型採用統一科學語法與純序列建模範式，在六大科學任務上匹配或超越傳統專用方法。其中 LOGOS-1B 僅 1B 參數，即展現出極高效率，性能超越參數量達 8×7B 的微軟模型。

1 天前9300閱讀分析

相關文章