MarkTechPost AI模型更新

NVIDIA 推出 Nemotron 3 Ultra:5500 億參數的混合 Mamba-Transformer 專家混合模型,專為長效型代理人設計

2026年6月4日 21:42
NVIDIA 推出 Nemotron 3 Ultra:5500 億參數的混合 Mamba-Transformer 專家混合模型,專為長效型代理人設計

重點摘要

NVIDIA 發表 Nemotron 3 Ultra,這是其 Nemotron 3 系列中規模最大的模型,專為解決長效型代理人(long-running agents)在規劃、工具呼叫與多輪推理中的特定問題。隨著代理人運作時間拉長,Token 數量與推論成本也隨之增加。Nemotron 3 Ultra 旨在維持高準確度的同時,讓推論更快速且成本更低。該模型總參數達 5500 億,採用專家混合(MoE)架構,每個 Token 僅啟用 550 億個參數。其混合 Mamba-Attention 架構結合了 Mamba 層處理長序列的次二次方擴展特性,並保留少數注意力層以確保大上下文中的精確召回能力。

站內 AI 整理稿

NVIDIA has released Nemotron 3 Ultra, the largest model in its Nemotron 3 family. It targets a specific problem: long-running agents that plan, call tools, and reason across many turns. As agents run longer, token counts grow and inference cost climbs. Nemotron 3 Ultra is designed to keep accuracy high while making that inference faster and cheaper. What is Nemotron 3 Ultra Nemotron 3 Ultra is a 550 billion total parameter Mixture-of-Experts (MoE) model. Only 55 billion parameters are active per token. The MoE design improves accuracy per active parameter. It uses a hybrid Mamba-Attention architecture instead of a pure Transformer. Mamba layers handle long sequences with sub-quadratic scaling. A few Attention layers are kept for precise recall over large contexts. The model was pre-trained on 20 trillion text tokens. Context was then extended to 1 million tokens. It was post-trained using Supervised Fine-Tuning (SFT), Reinforcement Learning (RL), and Multi-teacher On-Policy Distillation (MOPD). NVIDIA team reports up to roughly 6x higher inference throughput than comparable open LLMs, at on-par accuracy. https://research.nvidia.com/labs/nemotron/files/NVIDIA-Nemotron-3-Ultra-Technical-Report.pdf The Architecture The model has 108 layers and a model dimension of 8,192. It uses 64 query heads and only 2 key-value heads, which keeps the KV cache small. Each MoE layer holds 512 experts, with the top 22 activated per token. Three design choices stand out: LatentMoE routes experts more efficiently. It buys more routed experts at fixed inference cost by trading away hidden-dimension width. NVIDIA team reports better accuracy per parameter than standard granular MoEs. Multi-Token Prediction (MTP) predicts several future tokens in one forward pass. It enables native speculative decoding for faster generation. Two MTP heads share parameters during training. NVFP4 pre-training uses the E2M1 4-bit datatype with two-dimensional block quantization on weights. NVIDIA team calls this the largest-scale demonstration of stable, accurate NVFP4 training to date. The hybrid Mamba-Attention stack are quite important for agents. Mamba’s per-step decode cost stays constant as sequence length grows. That is why throughput gains widen on long, decode-heavy workloads. Pretraining and the Data Release Pretraining used a Warmup-Stable-Decay learning rate schedule over 20 trillion tokens. It was split into two phases. The first 15 trillion tokens biased for diversity. The final 5 trillion biased for high-quality data. NVIDIA team also released new domain-specific pretraining datasets. These include 173 billion refreshed GitHub code tokens. In a Nemotron 3 Nano ablation, a synthetic legal set raised a proxy LegalBench average from 64.6 to 74.7. In a similar ablation, a Wiki-based fact-seeking set raised proxy SimpleQA from 40.2 to 50.2. The post-training release is also large. NVIDIA adds 10 million new SFT samples and 1 million new RL tasks. It adds 15 new RL environments. Cumulative Nemotron open totals reach 50M SFT samples, 2M RL tasks, and 55 RL environments. Training was not entirely smooth. NVIDIA documents two loss divergences and treats them as a useful engineering record. The first, near 8 trillion tokens, traced to moving output-layer gradient reduction from FP32 to BF16. The MTP gradient contribution was effectively lost in BF16’s 7 mantissa bits. Reverting to FP32 gradient reduction re-stabilized training. The second divergence, near 16 trillion tokens, had no confirmed root cause. NVIDIA mitigated it by annealing the learning rate early. It then cut the total token horizon to 20 trillion tokens. Post-Training: SFT, RLVR, and MOPD The post-training pipeline runs SFT, then unified RLVR, then MOPD warmup, MOPD, and MTP Boosting. The whole loop can repeat for several cycles. RLVR stands for Reinforcement Learning with Verifiable Reward. It trains across many environments at once: terminal use, software engineering, search, math, code, safety, and more. The reward in these settings is often sparse and environment-dependent. MOPD is the main new post-training method. Mixed-environment RLVR dilutes the learning signal as the number of environments grows. To address this, NVIDIA team trained more than ten domain-specialized teacher models. Each teacher has its own training pipeline. During MOPD, the student model generates its own rollouts across domains. Each rollout is scored by the matching teacher with dense, token-level guidance. This is a denser signal than RLVR’s sparse rewards. The process runs asynchronously, with rollout generation, teacher scoring, and student updates pipelined. MOPD is also iterative. After one MOPD checkpoint, new teachers are initialized from the improved student. Their gains merge back into the next round. NVIDIA team ran two MOPD iterations for Nemotron 3 Ultra. One practical caveat is worth noting. MOPD works best when student rollouts stay within the teacher’s support. A brief SFT warmup aligns the two distributions first. NVIDIA team found gains are smaller on self-contained reasoning tasks the student rarely samples. Reasoning Effort Control Nemotron 3 Ultra supports three reasoning modes: reasoning-off, regular, and medium-effort. The regular and medium modes also accept an inference-time budget control. Medium-effort is the efficiency lever. NVIDIA team reports it uses about 2.5x fewer tokens than regular mode. The cost is roughly a 7% drop in accuracy. For high-volume agent steps, that trade can lower spend meaningfully. The Benchmark Case The comparisons in the NVIDIA’s research report use GLM-5.1 (754B), Kimi-K2.6 (1T), and Qwen-3.5 (397B), among others. The picture is competitive rather than dominant. On agentic tasks, Nemotron 3 Ultra posts 90.0 on PinchBench and 56.0 on ProfBench (Search). NVIDIA team reserved both as held-out generalization gates, scored only once on the final model. It scores 71.9 on SWE-Bench Verified and 56.4 on Terminal Bench 2.1. On Terminal Bench, Kimi-K2.6 leads at 67.2. On reasoning, it scores 570.0 on IOI 2025. NVIDIA team frames this as top-3-human-level competitive programming. On AA-Omniscience, it records the highest non-hallucination score in the set at 78.7. That suggests a lower tendency to answer when uncertain. Long context holds up at scale. The model scores 94.7 on RULER at 1 million tokens. Several larger comparison models top out at 256K context. On an 8K input / 64K output setting at NVFP4 on GB200, Nemotron 3 Ultra reaches 5.9x the throughput of GLM-5.1. It is 4.8x faster than Kimi-K2.6 and 1.6x faster than Qwen-3.5. Note: Nemotron’s numbers use TRT-LLM, while the others use vLLM. The trade-off is visible on prefill-heavy work. On a 50K input / 2K output setting, it trails Qwen-3.5, because prefill cost tracks active parameters. NVIDIA team also reports up to 30% lower cost to task completion, from fewer tokens per turn on SWE-Bench and Terminal Bench. NVIDIA team also stresses harness robustness. The model is trained under multiple agent harnesses per task type, not one. SWE-Bench Verified scores stay between 65% and 70.4% across Pi, OpenHands, Hermes, OpenCode, and Mini SWE Agent. The goal is consistent behavior regardless of deployment framework. Quantization and Deployment NVIDIA team ships a single NVFP4 checkpoint. On Blackwell it runs with native FP4 math. On Hopper it runs as W4A16, since Hopper lacks native FP4 tensor cores. The final solution operates at 5.03 bits-per-element. It mixes NVFP4 routed experts with FP8 layers for shared experts and Mamba linears. Attention layers stay in BF16. NVIDIA team found accuracy saturated below this budget, so higher precision added no measurable gain. The reduced weight footprint has a deployment benefit. The W4A16 path leaves room to fit MTP weights on a single 8-GPU H100 node. An FP8 checkpoint could not, without spanning two nodes. Key Takeaways Nemotron 3 Ultra is a 550B open MoE (55B active) using a hybrid Mamba-Attention design for l

Related

相關文章

MarkTechPost AI模型更新

Liquid AI Introduces LFM2.5-Embedding-350M and LFM2.5-ColBERT-350M: Dense Bi-Encoder and Late-Interaction Models for Fast Multilingual Search Across 11 Languages

This week, Liquid AI released two new retrieval models. They are LFM2.5-ColBERT-350M and LFM2.5-Embedding-350M. Both hold 350M parameters. Both are the first bidirectional members of the LFM family. They build on LFM2.5-350M-Base, released in March. The pair targets fast multilingual and cross-lingual search across 11 languages. Their footprint is small enough to run almost anywhere. Both are available now on Hugging Face under the LFM Open License v1.0. LFM2.5 Retrievers The two models share one backbone but represent text differently. LFM2.5-Embedding-350M is a dense bi-encoder. It turns each document into a single vector. Pick it when you want the fastest search and the smallest, cheapest index. LFM2.5-ColBERT-350M is a late-interaction model. It converts each token into a vector rather

1 小時前
MarkTechPost AI模型更新

Perplexity Launches Brain, a Self-Improving Memory System That Builds a Context Graph of an Agent’s Work and Learns Overnight

Most AI memory remembers the user. It stores your preferences, your tastes, and your role. Perplexity is taking a different path. Today, Perplexity launched Brain, a self-improving memory system for its agent product, Computer. Brain does not focus on remembering you. It remembers what the agent did. That reframes what memory in AI is for. What is Perplexity‘s Brain Brain is a self-improving memory system. It builds a context graph of the work Computer performs. At set intervals, such as overnight, Brain reviews that graph. It then teaches itself how to do the work better. The idea is straightforward. The more work you do, the more efficient Brain makes your Computer. Brain is rolling out today to Perplexity Max and Enterprise Max subscribers in Research Preview. Two Axes of AI Memory Perp

15 小時前

智譜新高,MiniMax承壓,“大模型雙雄”命運殊途

這篇消息聚焦「智譜新高,MiniMax承壓,“大模型雙雄”命運殊途」。原始導語提到:大模型在被市場重新定價 從 AI 情報角度來看,這類內容值得關注其背後的技術進展、產品落地、產業競爭與後續市場影響。

17 小時前