NVIDIA 發布 Nemotron-Labs-TwoTower:基於凍結自迴歸 Nemotron-3-Nano-30B-A3B 骨幹的開放權重擴散語言模型
重點摘要
NVIDIA 發布了 Nemotron-Labs-TwoTower,這是一個建立在預訓練自迴歸骨幹上的擴散語言模型。該模型以開放權重形式發布,採用 NVIDIA Nemotron 開放模型授權。此次發布旨在解決文字生成中的吞吐量瓶頸。自迴歸(AR)模型一次解碼一個 token,這種序列化過程限制了生成吞吐量。離散擴散語言模型則採取不同路徑:它們並行生成 token 並反覆迭代優化。多數擴散語言模型使用單一網路處理兩項任務——同時表示乾淨 token 並在每一步去噪受損 token。TwoTower 將這些任務拆分為兩個塔,保留了 AR 基準 98.7% 的整體評測品質,同時實現了 2.42 倍的實際生成吞吐量提升。簡而言之,TwoTower 將生成與去噪分離,顯著加速推論。
NVIDIA has released Nemotron-Labs-TwoTower, a diffusion language model built on a pretrained autoregressive backbone. It ships as open weights under the NVIDIA Nemotron Open Model License. The release targets a throughput bottleneck in text generation. Autoregressive (AR) models decode one token at a time. That serial process caps generation throughput. Discrete diffusion language models take another route. They generate tokens in parallel and refine them iteratively. Most diffusion language models use one network for two jobs. It represents clean tokens and denoises corrupted ones at every step. TwoTower separates these jobs into two towers. It keeps 98.7% of the AR baseline’s aggregate benchmark quality. It also reports 2.42× higher wall-clock generation throughput. TL;DR TwoTower splits diffusion into a frozen AR context tower and a trained denoiser tower. It retains 98.7% of AR quality at 2.42× throughput (γ=0.8, S=16, 2×H100). The denoiser trained on ~2.1T tokens; the backbone used 25T. One checkpoint runs diffusion, mock-AR, and AR decoding modes. Nemotron-Labs-TwoTower TwoTower is a block-wise autoregressive diffusion model. It is instantiated on Nemotron-3-Nano-30B-A3B, an open-weight hybrid backbone. That backbone interleaves Mamba-2, self-attention, and mixture-of-experts (MoE) layers. Each tower has 52 layers: 23 Mamba-2, 6 self-attention, and 23 MoE. The released checkpoint ships both towers, roughly 60B total parameters. Active parameters per token are about 3B per tower. The MoE uses 128 routable experts, of which 6 activate, plus 2 shared experts. Both towers start as copies of the same backbone checkpoint. Only the denoiser tower is trained. The AR context tower stays frozen. The denoiser was trained on ~2.1T tokens, a fraction of the backbone’s 25T-token pretraining. How the Two Towers Work The AR context tower runs causally over the prompt and committed tokens. It produces per-layer KV cache and final Mamba-2 states. It preserves the backbone’s autoregressive capability. The diffusion denoiser tower refines noisy blocks. Within a block, it uses bidirectional in-block attention. It stays causal with respect to past clean blocks. The towers connect layer-by-layer. Denoiser layer i cross-attends to context tower layer i. This layer-aligned cross-attention gives multi-scale access to the backbone’s representations. Prior approaches broadcast only the last hidden state. Two more denoiser modifications matter. Mamba-2 layers seed their initial state from the context tower’s Mamba state. The diffusion timestep modulates each layer through adaLN-single time conditioning. That adaLN module adds only ~1.5M parameters. Generation runs block by block. Each block starts as S [MASK] tokens. The denoiser refines it over T steps, then commits it. The context tower then processes committed tokens to update its caches. This explains why multiple denoising steps can still beat one-token decoding. Autoregressive decoding commits exactly one token per step. TwoTower commits multiple tokens per step early in refinement. Benchmarks Evaluations use BF16 on 2×H100 GPUs. The default operating point is confidence unmasking, threshold γ=0.8, block size S=16. The table compares the AR baseline against TwoTower diffusion decoding. TaskNemotron-3-Nano-30B-A3B (AR)Nemotron-Labs-TwoTower (diffusion)MMLU (5-shot, acc)78.5678.24MMLU-Pro (5-shot, CoT EM)62.5960.93ARC-Challenge (25-shot, acc_norm)91.7292.66WinoGrande (5-shot, acc)76.0976.09RACE (0-shot, acc)88.9088.90HumanEval (0-shot)79.2775.58MBPP-Sanitized (3-shot)74.7174.28GSM8K (8-shot, acc)92.4990.14MATH-500 (4-shot)84.4080.60MMLU Global Lite (5-shot)73.9773.94MGSM (8-shot, avg acc)80.8080.40Quality retained100%98.7%Generation throughput (× AR)1.0×2.42× General knowledge stays within about one point of the AR baseline. Code and math show modest degradation. Commonsense and multilingual scores are recovered or slightly improved. Lowering γ commits more tokens per step and raises throughput, with reduced quality. Running It: Three Generation Modes The checkpoint exposes three inference paths. Full two-tower diffusion uses 2 GPUs, about 59GB per GPU in BF16. AR-only mode runs on a single 80GB GPU. Copy CodeCopiedUse a different Browserimport torch from transformers import AutoTokenizer, AutoModelForCausalLM model_name = "nvidia/Nemotron-Labs-TwoTower-30B-A3B-Base-BF16" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForCausalLM.from_pretrained( model_name, torch_dtype=torch.bfloat16, trust_remote_code=True, ) # context tower -> GPU 0, denoiser tower -> GPU 1 model.place_towers_on_devices("cuda:0", "cuda:1") model.eval() prompt = "France is a country " inputs = tokenizer(prompt, return_tensors="pt").to("cuda:0") outputs = model.generate_mask_diffusion( inputs["input_ids"], max_new_tokens=128, block_size=16, steps_per_block=16, mask_token_id=3, temperature=0.1, confidence_threshold=0.8, eos_token_id=tokenizer.eos_token_id, ) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)) The three modes are generate_mask_diffusion(), generate_mock_ar(), and generate_ar(). Mask diffusion commits up to block_size tokens per step. Mock-AR and AR commit one token per step. Where It Fits: Use Cases The most direct use case is faster batch generation. A data team producing synthetic text can trade a small quality drop for throughput. At γ=0.8, that trade is 1.3% quality for 2.42× speed. A second use case is tuning the quality–throughput trade-off. Raising γ preserves more quality, as per the NVIDIA’s paper. Lowering γ commits more tokens per step for speed. A third use case is drop-in adaptation. The context tower keeps its LM head for speculative decoding, verification, or AR scoring. Teams can run AR and diffusion from one checkpoint. Strengths and Weaknesses Strengths: Open weights under the NVIDIA Nemotron Open Model License; ready for commercial use 98.7% of AR quality retained at 2.42× throughput at the default operating point One checkpoint supports diffusion, mock-AR, and AR decoding Denoiser trained on ~2.1T tokens, not a full re-pretrain Sequence-length cache memory scales like the AR baseline Weaknesses: Full two-tower diffusion needs 2 GPUs and ~59GB per GPU in BF16 Code and math degrade more than general knowledge (HumanEval 79.27 → 75.58) Keeping both towers resident raises the fixed model-weight memory footprint Released checkpoint is a base model, before instruction tuning or alignment Throughput past 3× comes with larger quality loss Interactive Explainer (function(){ window.addEventListener("message",function(e){ var d=e&&e.data; if(d&&d.mtpTwoTowerHeight){ var f=document.getElementById("mtp-twotower-frame"); if(f){f.style.height=d.mtpTwoTowerHeight+"px";} } }); })(); Check out the Paper and Weights. Also, feel free to follow us on Twitter and don’t forget to join our 150k+ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well. Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us The post NVIDIA Releases Nemotron-Labs-TwoTower: an Open-Weight Diffusion Language Model Built on a Frozen Autoregressive Nemotron-3-Nano-30B-A3B Backbone appeared first on MarkTechPost.
Related
相關文章

谷歌風雨飄搖,市值蒸發數千億美元,Gemini Spark能救場嗎?
谷歌一週內有四名核心人物離職,包括Transformer之父和諾獎得主,導致市值蒸發數千億美元。公司推出新產品Gemini Spark試圖提振士氣,但能否扭轉頹勢仍待觀察。

Sonnet 5 降維:Anthropic 賭智能體降價換規模
Anthropic推出Sonnet 5中型模型,專為智能體應用設計,強化規劃、工具使用與自主工作流程。該模型以降價策略擴大規模,目標是提升在智能體市場的競爭力。

微信公眾號推出 AI 分身能力:率先向醫院開放,不用寫代碼、一鍵開通
微信公眾號宣布推出AI分身功能,率先向醫院開放,可實現7×24小時在線並秒回患者問題。醫院無需撰寫程式碼,只需在公眾號後臺一鍵開通即可啟用此服務。此舉旨在提升醫療諮詢效率,減輕人工客服負擔。

視頻版Nano Banana來了!內置Gemini世界知識;原版香蕉出圖僅需4秒
Google推出視頻版Nano Banana,內建Gemini世界知識功能。原版香蕉模型生成圖像僅需4秒,效能顯著提升。目前Gemini 3.5 Pro的發布時間尚未公布。

谷歌推出 AI 生圖模型 Nano Banana 2 Lite:4 秒出圖,比標準版更快更便宜
Nano Banana 2 Lite 可在 4 秒內生成一張圖像,延遲較此前明顯下降,適合需要快速反覆修改方案,或在短時間內批量生成大量圖像的工作,每生成 1000 張圖像僅收費 0.034 美元(現匯率約合 0.23 元人民幣)。
Anthropic發佈重磅大模型Claude Sonnet 5:性能直逼旗艦,價格卻大跳水
Anthropic發佈新中高階模型Claude Sonnet 5,主打性價比,性能大幅逼近旗艦Opus系列。該模型具備迄今最強代理能力,可自主規劃複雜任務、自查輸出,並靈活調用瀏覽器與終端等外部工具,在推理、編程和知識任務上表現突出。