NVIDIA Cosmos 3 正式登場:首個開放式全能模型,實現物理 AI 推理與行動
重點摘要
NVIDIA Cosmos 3 現已於 Hugging Face 開放取用,這是世界基礎模型(WFM)在物理 AI 領域的重大躍進。它整合世界生成、物理推理與行動生成於單一統一模型,無需再切換不同模型與推論流程。無論是開發機器人、自動駕駛車輛或智慧空間,Cosmos 3 都能提供模擬與理解物理世界的基礎。
Back to Articles Welcome NVIDIA Cosmos 3: The First Open Omni-model for Physical AI Reasoning and Action Enterprise + Article Published June 1, 2026 Upvote 1 Asawaree asawareeb Follow nvidia Atharva Joshi atharvajoshi10 Follow nvidia NVIDIA Cosmos 3 is here - and it's available on Hugging Face today. Cosmos 3 represents a major leap forward in world foundation models (WFMs) for physical AI: a single, unified omni-model that combines world generation, physical reasoning, and action generation in one model. No more juggling between different models and inference pipelines - Cosmos 3 does it all. Whether you're building for robotics, autonomous vehicles, or smart spaces, Cosmos 3 gives you the foundation to simulate and understand the physical world. Here's what's shipping with this release: Cosmos 3 Super and Cosmos 3 Nano on Hugging Face with model cards and licensing Cosmos 3 Diffusers integration for generation pipelines Post-training scripts for training Cosmos 3 on your own data (on GitHub) Open synthetic data generation (SDG) datasets for physical AI TABLE OF CONTENTS What's new with Cosmos 3? Cosmos 3 Capabilities Using Cosmos 3 with Diffusers Datasets for physical AI Cosmos Framework Resources SECTION 1: What's new with Cosmos 3? The biggest change in Cosmos 3 compared to previous Cosmos releases is that it's an omni-model, built on a Mixture-of-Transformers (MoT) architecture. Previously, developers had to work with separate models for different capabilities like world generation (Cosmos Predict), controlled generation (Cosmos Transfer), scene understanding (Cosmos Reason) and policy generation (Cosmos Policy). Cosmos 3 enables all of this in a single model that can reason and generate different modalities in one unified forward pass. This means you can now do all this from one model: Generate realistic and physically plausible video worlds from text, images, videos or action inputs Reason about physical properties like motion, causality, and spatial relationships Predict future video and action sequences based on the current state Why this matters for physical AI Cosmos 3 helps build physical AI systems capable of understanding the real world. Not just pixels and tokens, but motion, causality, physics, and action. If you're training a robot to fold laundry, building an autonomous driving simulation, or generating synthetic training data for warehouse safety scenarios, Cosmos 3 is the foundation model designed for exactly these use-cases. Video generated by Cosmos 3 for robotics pick and place use-cases. Video generated by Cosmos 3 for long tail driving scenarios. Image-to-video generation using Cosmos 3 for warehouse safety data. Cosmos 3 chain-of-thought reasoning in an autonomous driving application. Architecture Cosmos 3 is built on an MoT backbone that processes all modalities - text, image, video, audio, and action - within a single unified architecture. Each modality is first encoded by a dedicated encoder (a ViT for visual understanding, a VAE for visual/audio generation, and domain-aware vectors for actions), then projected into a shared representation space. The input sequence is split into two subsequences: an autoregressive (AR) subsequence that handles reasoning and understanding via next-token prediction, and a diffusion (DM) subsequence that handles generation via iterative denoising. AR and DM tokens use separate parameter sets within each transformer layer but interact through joint attention - this is what lets a single model seamlessly switch between acting as a VLM, a video generator, a forward/inverse dynamics model, or a robot policy without any architectural changes. Model Versions This release of Cosmos 3 includes two model sizes, optimized for different deployment scenarios: Cosmos 3 Nano - This is the 8B parameter model (8B reasoner and 8B generator), optimized for efficient inference. Cosmos 3 Nano is designed to run on workstation-grade compute like the RTX PRO 6000 GPU, and is available on Hugging Face at nvidia/Cosmos3-Nano. Cosmos 3 Super - This is the 32B parameter model (32B reasoner and 32B generator) designed for large-scale synthetic data generation (SDG) and research, and runs on NVIDIA Hopper and Blackwell GPUs. Cosmos 3 Super is available on Hugging Face at nvidia/Cosmos3-Super. SECTION 2: Cosmos 3 Capabilities Cosmos 3 supports multiple input and generation modalities through a single unified model: Input Modality Output Modality Application Text | Image | Video Video Video Model Text | Video Text Vision Language Model (VLM) Action | Image | Text Video Forward Dynamics Model Text | Video Action Inverse Dynamics Model Image | Text Video & Action Policy Model Prompt Guide For video generation, we recommend using detailed prompts in the form of narrative paragraphs. For example: The video begins with a view from inside a vehicle traveling on a multi-lane highway under a clear blue sky. The road is bordered by dense green trees on both sides, creating a tranquil environment. Several vehicles, including a prominent white semi-truck and various cars, are visible ahead, maintaining a steady pace. The highway features multiple lanes separated by concrete barriers, and the scene is bathed in bright sunlight, indicating a clear day. As the video progresses, a large amount of debris suddenly appears on the lane ahead. With little time to avoid it, the ego vehicle has to drive over the debris and continue moving forward. A noticeable jolt occurs as the ego vehicle passes over the scattered objects. A point-of-view shot from inside the vehicle, capturing the road ahead and the surrounding environment. For action generation, prompts should be concise and provide spatial references. For example: Put the pot to the left of the purple item. This video is captured from a first-person perspective looking at the scene. Find the prompt upsampling template, and best practices for writing high-quality prompts in the prompting guide on GitHub. SECTION 3: Using Cosmos 3 with Diffusers Cosmos 3 is integrated with the Hugging Face Diffusers library, making it easy to use world generation pipelines with just a few lines of code. You can run Cosmos 3 through the familiar DiffusionPipeline via Cosmos3OmniPipeline. With this, the goal is enabling frictionless adoption of Cosmos 3 and integration with your existing pipelines. Let's see a Text-to-Image example for single frame generation using the Cosmos 3 Nano model: import torch from diffusers import Cosmos3OmniPipeline pipe = Cosmos3OmniPipeline.from_pretrained( "nvidia/Cosmos3-Nano", torch_dtype=torch.bfloat16, device_map="cuda" ) prompt = ( "A medium shot of a modern robotics research laboratory with white walls and a gray floor. " "A robotic arm with a metallic finish is mounted on a clean white workbench, its gripper positioned " "above a row of small colored objects. A laptop and neatly arranged tools sit beside the robot. " "A large monitor on the wall behind displays a software interface. The scene is brightly lit by " "overhead fluorescent lights." ) result = pipe(prompt=prompt, num_frames=1, height=720, width=1280) result.video[0].save("cosmos3_t2i.jpg", format="JPEG", quality=85) Here's the image generated by the Cosmos 3 Nano model and given prompt: The documentation also has examples on Text-to-Video, Image-to-Video and more. Find information and API usage in the Cosmos 3 Diffusers documentation. SECTION 4: Datasets for physical AI As part of the Cosmos 3 launch, NVIDIA is releasing a set of Synthetic Data Generation (SDG) datasets to help the physical AI community train and evaluate world foundation models. These datasets were generated by various NVIDIA teams and are available on Hugging Face. Dataset Domain Description Embodied-Robot-Scenes Robotics Synthetic robot simulation data Physical-Interaction-Scenes Physics Isaac Sim physics simulation data Spatial-Reasoning Reasoning Embodied spatial reasoning data Digital-Human-Scenes Human motion Synthetic human motion da
Related
相關文章

GPT發AI原創新成果了
這篇消息聚焦「GPT發AI原創新成果了」。原始導語提到:AI實現藥物全自動研發,還遠嗎? 從 AI 情報角度來看,這類內容值得關注其背後的技術進展、產品落地、產業競爭與後續市場影響。

AI越強,越要“殺死”過去的自己
這篇消息聚焦「AI越強,越要“殺死”過去的自己」。原始導語提到:人類需要實現思維模式的轉變。 從 AI 情報角度來看,這類內容值得關注其背後的技術進展、產品落地、產業競爭與後續市場影響。
Salesforce CodeGen Tutorial: Generate, Validate, and Rerank Python Functions With Unit Tests and Safety Checks
In this tutorial, we implement an end-to-end workflow for Salesforce CodeGen. We load a CodeGen model from Hugging Face, prepare it for code generation, and use it to generate Python functions from natural-language prompts. We then move beyond basic inference by adding function extraction, syntax checking, static safety checks, unit-test-based validation, best-of-N candidate reranking, multi-step program synthesis, prompt-style experimentation, benchmark visualization, and artifact export. Through this workflow, we learn how CodeGen can be used not only as a code completion model but also as part of a structured code-generation pipeline that evaluates, filters, and organizes generated solutions. Loading the Salesforce CodeGen Model from Hugging Face Copy CodeCopiedUse a different Browserim

Transformer之父離開谷歌,奧特曼等了他十年
這篇消息聚焦「Transformer之父離開谷歌,奧特曼等了他十年」。原始導語提到:27億美元也沒能留住,Noam Shazeer追尋下一代架構。 從 AI 情報角度來看,這類內容值得關注其背後的技術進展、產品落地、產業競爭與後續市場影響。

Dario訪談首曝:Mythos被稱為“超級武器”
這篇消息聚焦「Dario訪談首曝:Mythos被稱為“超級武器”」。原始導語提到:在這場69分鐘完整訪談裡,Dario Amodei 說人類真正面對的不是某個突然降臨的奇點,而是一條已經開始垂直起飛的指數曲線。 從 AI 情報角度來看,這類內容值得關注其背後的技術進展、產品落地、產業競爭與後續市場影響。

用結構替代數據,因果世界模型如何重塑具身智能大腦
這篇消息聚焦「用結構替代數據,因果世界模型如何重塑具身智能大腦」。原始導語提到:因果世界模型需要一個標誌性的時刻來證明自己。 從 AI 情報角度來看,這類內容值得關注其背後的技術進展、產品落地、產業競爭與後續市場影響。