MarkTechPost AIAI硬體

DFlash Speculative Decoding Drafts Whole Token Blocks in Parallel for Up to 15x Higher Throughput on NVIDIA Blackwell

2026年6月24日 07:21

重點摘要

站內 AI 整理稿

Autoregressive large language models generate text one token at a time. Each token waits for the one before it. This serial loop leaves modern GPUs underused and keeps inference slow. The cost grows worse with long Chain-of-Thought reasoning models. Their lengthy outputs make latency the dominant part of generation. Speculative decoding is the standard fix. A small draft model proposes future tokens. The large target model verifies those tokens in parallel. Accepted tokens are kept, so the output stays lossless. But most methods, including the state-of-the-art EAGLE-3, still draft autoregressively. That serial drafting caps real-world speedups near 2–3×. DFlash, introduced by research team from UC San Diego team (z-lab), takes a different route. It is a lightweight block diffusion model built for drafting. Instead of drafting tokens one at a time, it proposes a whole block in a single forward pass. The target model then verifies that block in parallel. The research team reports over 6× lossless acceleration across a range of models and tasks. It reaches up to 2.5× higher speedup than EAGLE-3. On NVIDIA Blackwell, NVIDIA engineering team reports up to 15× higher throughput for gpt-oss-120b. That figure holds at the same user interactivity target. https://developer.nvidia.com/blog/boost-inference-performance-up-to-15x-on-nvidia-blackwell-using-dflash-speculative-decoding/ What block diffusion drafting changes Block diffusion models denoise a block of masked tokens at once. They blend parallel generation with autoregressive block structure. DFlash applies this idea only to the drafting stage. Verification stays with the trusted autoregressive target model. This split matters for quality. Standalone diffusion LLMs often trail autoregressive models on accuracy. They also need many denoising steps, which slows their raw inference speed. DFlash sidesteps both problems. The draft only needs to be good enough to be accepted. The target’s parallel verification guarantees the final output distribution. A second benefit is drafting cost. An autoregressive drafter’s cost grows linearly with the number of speculative tokens. A diffusion drafter generates all tokens in one parallel pass. So drafting latency stays largely flat as the block grows. This frees DFlash to use deeper, more expressive draft models without adding latency. This separates DFlash from earlier diffusion-drafter work. Methods like DiffuSpec and SpecDiff-2 used massive 7B drafters, capping speedups near 3–4×. DFlash instead uses a small five-layer drafter (eight layers for Qwen3-Coder). The “target knows best” insight DFlash’s core idea is simple: the target knows best. Large autoregressive models’ hidden features encode information about multiple future tokens. DFlash extracts hidden states from several target layers. It fuses them into one compact target context feature. This feature then conditions the draft model. DFlash injects this feature differently than EAGLE-3. EAGLE-3 fuses target features into the draft’s input embeddings only. As draft depth grows, that signal gets diluted. DFlash instead injects the feature into the Key and Value projections of every draft layer. The projected features sit in the draft’s KV cache and persist across drafting iterations. This KV injection lets acceptance length scale with draft depth. A five-layer DFlash drafter generating 16 tokens beats EAGLE-3 generating 8 tokens. It is both lower-latency and higher-acceptance in the paper’s tests. The draft model effectively becomes a diffusion adapter on top of the target. Two speedup numbers, measured differently The DFlash research’s 6× is single-stream lossless acceleration. On Qwen3-8B with greedy decoding (Transformers backend), DFlash averages a 4.86× speedup. EAGLE-3 averages 1.76× at tree size 16 and 2.02× at tree size 60. DFlash peaks at 6.08× on MATH-500 (τ = 7.87) and averages τ = 6.49 across tasks. NVIDIA’s 15× is throughput at a fixed interactivity target. It applies to gpt-oss-120b on eight NVIDIA Blackwell GPUs in a DGX B300 system, using TensorRT-LLM. At the 500–600 tokens/sec per-user range, DFlash serves more than 15× the throughput of autoregressive decoding. That is about 1.5× more than EAGLE-3 at the same point. The table below shows the paper’s per-task speedups on Qwen3-8B at temperature 0 (Transformers backend). Task (Qwen3-8B, temp=0)BaselineEAGLE-3 (16)DFlash (16)DFlash τGSM8K1.00×1.94×5.15×6.54MATH-5001.00×1.81×6.08×7.87AIME251.00×1.79×5.62×7.08HumanEval1.00×1.89×5.14×6.50MBPP1.00×1.69×4.65×5.95LiveCodeBench1.00×1.57×5.51×7.27MT-Bench1.00×1.63×2.75×4.24Average1.00×1.76×4.86×6.49 A separate NVIDIA Speed-Bench comparison measures interactivity speedups at matched concurrency. On gpt-oss-120b, DFlash averages 2.3× versus EAGLE-3’s 1.7×. On Llama 3.1 8B Instruct, DFlash averages 2.8× versus EAGLE-3’s 2.2×. Use cases with examples DFlash targets latency-sensitive serving where token-by-token generation hurts. Three patterns fit well: Coding agents: Code generation needs fast, interactive responses. On Gemma 4 31B with vLLM, NVIDIA reports up to 5.8× on Math500 at concurrency 1. HumanEval reaches 5.6×. Faster drafts mean shorter wait times inside agent loops. Reasoning models: Long Chain-of-Thought traces dominate generation time. With thinking mode enabled, DFlash holds roughly 4.5× under greedy decoding on Qwen3-4B and Qwen3-8B. Under sampling, it holds about 3.9×. This cuts the cost of long reasoning outputs. Serving and throughput: DFlash also raises serving throughput. On SGLang with a B200 GPU, it reaches up to 5.1× on Qwen3-8B (Math500, concurrency 1). Gains taper as concurrency rises but stay positive, so serving cost still drops. Running DFlash DFlash ships with checkpoints and framework support, so adoption needs little code. On vLLM, you swap an EAGLE-3 config for a DFlash one. No application refactoring is required. Copy CodeCopiedUse a different Browservllm serve Qwen/Qwen3.5-27B \ --speculative-config '{"method": "dflash", "model": "z-lab/Qwen3.5-27B-DFlash", "num_speculative_tokens": 15}' \ --attention-backend flash_attn \ --max-num-batched-tokens 32768 The Transformers backend supports Qwen3 and LLaMA-3.1 models. It exposes a spec_generate call that pairs a draft model with a target model. Copy CodeCopiedUse a different Browserfrom transformers import AutoModel, AutoModelForCausalLM, AutoTokenizer draft = AutoModel.from_pretrained( "z-lab/Qwen3-8B-DFlash-b16", trust_remote_code=True, dtype="auto", device_map="cuda:0").eval() target = AutoModelForCausalLM.from_pretrained( "Qwen/Qwen3-8B", dtype="auto", device_map="cuda:0").eval() tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-8B") messages = [{"role": "user", "content": "How many positive whole-number divisors does 196 have?"}] input_ids = tokenizer.apply_chat_template( messages, return_tensors="pt", add_generation_prompt=True, enable_thinking=False).to(draft.device) output = draft.spec_generate( input_ids=input_ids, max_new_tokens=2048, temperature=0.0, target=target, stop_token_ids=[tokenizer.eos_token_id]) print(tokenizer.decode(output[0], skip_special_tokens=False)) Key Takeaways DFlash drafts an entire token block in one forward pass, not one token at a time. It injects target hidden features into every draft layer’s KV cache, scaling acceptance length with depth. Research Paper’s metrics: up to 6.08× lossless speedup on Qwen3-8B; NVIDIA test: up to 15× throughput on Blackwell at fixed interactivity. A lightweight five-layer drafter replaces the 7B drafters that capped earlier diffusion methods near 3–4×. Interactive Explainer (function(){ window.addEventListener('message',function(e){ if(e.data && e.data.dflashHeight){ var f=document.getElementById('dflash-frame'); if(f){ f.style.height=e.data.dflashHeight+'px'; } } }); })(); Check out the Project page, Paper (arXiv 2602.06036), GitHub, Hugging Face checkpoints and NVIDIA blog. Also, feel free to follow us on Twitter and don’t forget to join our 150k+ML SubReddit

原始來源：MarkTechPost AI ↗

查看原始來源

36氪AI硬體

光芯片：AI算力時代的光子革命，光通信與光計算雙輪驅動新徵程

這篇消息聚焦「光芯片：AI算力時代的光子革命，光通信與光計算雙輪驅動新徵程」。原始導語提到：光芯片進入規模化商用初期，國產替代與投資機遇顯現從 AI 情報角度來看，這類內容值得關注其背後的技術進展、產品落地、產業競爭與後續市場影響。

剛剛閱讀分析

IT之家AI硬體

映泰推出邊緣 AI 系統 MS-NAT5000，搭載 NVIDIA Jetson Thor 模組

這篇消息聚焦「映泰推出邊緣 AI 系統 MS-NAT5000，搭載 NVIDIA Jetson Thor 模組」。原始導語提到：這一設備在小巧的體積內提供了至高 2070 TFLOPS 的 FP4 AI 算力和 128GB 的 LPDDR5X 共享內存。從 AI 情報角度來看，這類內容值得關注其背後的技術進展、產品落地、產業競爭與後續市場影響。

剛剛閱讀分析

TechWebAI硬體

從數字人直播到實時推薦，Akamai李文濤解構AI大促背後的算力密碼

AI數字人主播凌晨三點仍在講解促單，AI推薦引擎在流量洪峰中實時為每位用戶匹配最優商品，AI客服同時響應數以百萬計的諮詢。AI數字人直播的任何卡頓都會打斷購買決策鏈路。Akamai亞太區雲計算架構師總監李文濤"亞太地區正從AI實驗階段邁向AI執行階段。在Akamai亞太區雲計算架構師總監李文濤看來，“延遲牆”問題的核心不在於算力不夠多，而在於算力離用戶不夠近。這也正是Akamai從CDN巨頭向全球最大分佈式AI推理平臺轉型的底層邏輯。

剛剛閱讀分析

IT之家AI硬體

馬斯克官宣 Starmind 太空 AI 算力項目名稱，規劃 100 萬顆計算衛星入軌

這篇消息聚焦「馬斯克官宣 Starmind 太空 AI 算力項目名稱，規劃 100 萬顆計算衛星入軌」。原始導語提到：埃隆 · 馬斯克（Elon Musk）今天（6 月 24 日）在 X 平臺發佈推文，確認 SpaceX 規劃中的軌道 AI 數據中心項目為 Starmind。從 AI 情報角度來看，這類內容值得關注其背後的技術進展、產品落地、產業競爭與後續市場影響。

剛剛閱讀分析

TechWebAI硬體