DeepSeek 推出 DSpark 推測解碼框架，加速 DeepSeek-V4 生成效率 60–85%

2026年6月27日 16:59

重點摘要

DeepSeek 發布 DSpark 推測解碼框架，並開源檢查點與訓練程式碼。此為服務最佳化，而非新模型。檢查點 DeepSeek-V4-Pro-DSpark 與 DeepSeek-V4-Flash-DSpark 沿用 V4 權重，附加草稿模組。研究團隊同步開源 DeepSpec（MIT 授權），用於訓練與評估推測解碼草稿器。目標解決繁忙生產環境中的大型模型推論加速問題。DSpark 以平行草稿主幹搭配小型序列頭，減少後綴衰減；透過信心頭與負載感知排程器，在 GPU 閒置時驗證更多 token，忙碌時減少驗證。離線環境下，接受長度較 Eagle3 提升 26–31%，較 DFlash 提升 16–18%。

站內 AI 整理稿

DeepSeek released DSpark, a speculative decoding framework, with open-source checkpoints and training code. It is a serving optimization, not a new model. The checkpoints DeepSeek-V4-Pro-DSpark and DeepSeek-V4-Flash-DSpark reuse the existing V4 weights, with a draft module attached. The DeepSeek research team also open-sourced DeepSpec, an MIT-licensed codebase for training and evaluating speculative decoding drafters. The work targets one problem: faster large-model inference in busy production serving. TL;DR DSpark pairs a parallel draft backbone with a tiny sequential head to cut suffix decay. A confidence head and load-aware scheduler verify more tokens when GPUs are idle, fewer when busy. Offline, accepted length rises 26–31% over Eagle3 and 16–18% over DFlash. In production on DeepSeek-V4, per-user generation runs 60–85% faster than the MTP-1 baseline. Output stays lossless, and the checkpoints plus DeepSpec training code are open-source. What is DSpark? Speculative decoding splits generation into two roles. A small draft model proposes a block of tokens. The full target model then verifies that block in one forward pass. Rejection sampling accepts the longest valid prefix and appends one bonus token. Because the rule preserves the target distribution exactly, there is no quality loss. DSpark keeps this guarantee. It changes how tokens are drafted and how many get verified. The Latency Math it Optimizes Per-token latency follows one equation from the paper: L = (Tdraft + Tverify) / τ. Here τ is the number of tokens accepted per cycle. Speedup comes from three levers only. You can draft faster, lowering Tdraft. You can draft better, raising τ. Or you can verify smarter, reducing wasted Tverify. DSpark pulls all three levers at once. How It Works: Semi-Autoregressive Generation Earlier drafters force a trade-off. Autoregressive drafters like Eagle3 condition each token on prior ones. That gives strong acceptance, but drafting cost grows with block size. Parallel drafters like DFlash produce the whole block in one pass. Drafting stays cheap, but each position ignores its neighbors. The result is ‘multi-modal collision’ and rapid acceptance decay along the suffix. DSpark splits drafting into two stages. A heavy parallel backbone, DFlash in their setup, produces base logits for every position. Then a lightweight sequential head adds a prefix-dependent bias before sampling each token. The default sequential head is a Markov head. It only looks at the immediately preceding token. A low-rank factorization (rank 256) keeps it cheap, even with large vocabularies. Once position one samples ‘of’, the head boosts ‘course’ and suppresses ‘problem’. An optional RNN head tracks the full block prefix. It adds only marginal gains, so the Markov head ships as the default. The payoff shows up position by position. DSpark inherits the parallel backbone’s high first-token accuracy. The sequential head then holds acceptance steady deep into the block. Training freezes the target model and reuses its embedding and output head. A total-variation loss is the key term. Minimizing that distance directly maximizes the draft’s acceptance rate. How It Works: Confidence-Scheduled Verification More draft tokens do not always mean more speed. Verifying tokens that will be rejected wastes batch capacity under heavy load. DSpark adds two parts to fix this. A confidence head outputs a score for each draft position. The score estimates the chance that token survives verification, given accepted predecessors. It is supervised by the analytical per-step acceptance rate. Raw neural confidence is usually overconfident. So the research team applies Sequential Temperature Scaling, a post-hoc calibration step. It cuts expected calibration error from 3–8% down to about 1%. A hardware-aware prefix scheduler then sets the verification length per request. It uses a profiled throughput curve, SPS(B), measured once at startup. When GPUs are idle, it verifies more tokens. When GPUs are busy, it verifies fewer. The scheduler uses an early-stopping rule to stay lossless. The appendix section gives a counterexample showing why a naive global search would leak information. Metrics Offline tests cover math, code, and daily chat. Targets include Qwen3-4B, 8B, 14B, and Gemma4-12B. DSpark beats both baselines on accepted length across every domain. Against Eagle3, macro-average accepted length rises 30.9%, 26.7%, and 30.0% on the three Qwen3 sizes. Against DFlash, gains are 16.3%, 18.4%, and 18.3%. A 2-layer DSpark even beats a 5-layer DFlash. The sequential head adds little cost. Scaling draft length from 4 to 16 adds only 0.2–1.3% per-round latency. In return, accepted length improves by up to 30%. Production results come from DeepSeek-V4-Flash and V4-Pro under live traffic. The baseline is MTP-1, the prior single-token setup. At matched throughput, per-user speed rises 60–85% on Flash and 57–78% on Pro. The shipped configuration is DSpark-5, a five-token draft block with the Markov head. DrafterDrafting styleBlock costSuffix acceptanceVerification lengthEagle3AutoregressiveGrows with block sizeHigh, stableFixedDFlashParallelNear-constantDecays fastFixed (full block)MTP-1Single-token (MTP)Low—Static 2 tokensDSparkParallel + sequential headNear-constantHigh, stableDynamic, load-aware Use Cases With Examples Structured workloads gain the most from longer verification. In code generation, acceptance is naturally high. The scheduler can verify long prefixes with little waste, so coding agents stream output faster. Open-ended chat behaves differently. A confidence-threshold sweep raised chat acceptance from 45.7% to 95.7%. The confidence head flags uncertain suffix tokens so they can be pruned. Math reasoning sits between the two. Its acceptance rose from 76.9% to 92.5% in the same sweep. Long step-by-step traces benefit from steady deep-block acceptance. High-concurrency serving is the headline case. At moderate load, the scheduler runs roughly 4–6 verified tokens per request. As concurrency rises, it trims that budget to protect throughput. Try It DeepSpec runs in three stages: data preparation, training, then evaluation. A config selects the algorithm and target model. Evaluation benchmarks a trained draft checkpoint across nine datasets. Copy CodeCopiedUse a different Browser# Install dependencies python -m pip install -r requirements.txt # Train a DSpark draft against a Qwen3-4B target. # The algorithm and target are chosen by the config, e.g. # config/dspark/dspark_qwen3_4b.py bash scripts/train/train.sh # Evaluate the trained draft across the 9 benchmark datasets. # Set in the eval config: # target_name_or_path = Qwen/Qwen3-4B # draft_name_or_path = ~/checkpoints/deepspec/dspark_block8_qwen3_4b/step_latest bash scripts/eval/eval.sh The default configs assume one node with 8 GPUs. Reduce CUDA_VISIBLE_DEVICES for fewer. Note the target cache can be large, near 38 TB for the Qwen3-4B setting. For the production checkpoints, the draft module attaches to the existing V4 weights. The Hugging Face cards include a minimal inference example in the inference folder. No retraining of the target model is required. The interactive demo below shows the mechanism. Pick a drafter, a domain, and a GPU-load level. Watch the draft block, the confidence scores, and the scheduler’s verification budget change in real time. The numbers are illustrative, modeled on the paper’s reported behavior. (function(){ window.addEventListener("message",function(e){ if(e && e.data && e.data.type==="dspark-resize"){ var f=document.getElementById("dspark-sim-frame"); if(f && e.data.height){ f.style.height=e.data.height+"px"; } } }); })(); Check out the Paper, GitHub and Model weight on HF. Also, feel free to follow us on Twitter and don’t forget to join our 150k+ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well. Need to partner with us for promoting your GitHub Repo OR Hug

原始來源：MarkTechPost AI ↗

查看原始來源

鈦媒體模型更新

GPT-5.6：最強的模型，最窄的門

這篇消息聚焦「GPT-5.6：最強的模型，最窄的門」。原始導語提到：GPT-5.6為什麼不能直接上線？從 AI 情報角度來看，這類內容值得關注其背後的技術進展、產品落地、產業競爭與後續市場影響。

51 分鐘前閱讀分析

36氪模型更新

梁文鋒署名論文，DeepSeek首輪融資後大動作：生成速度大漲85%

這篇消息聚焦「梁文鋒署名論文，DeepSeek首輪融資後大動作：生成速度大漲85%」。原始導語提到：DeepSeek聯合北大開源新成果。從 AI 情報角度來看，這類內容值得關注其背後的技術進展、產品落地、產業競爭與後續市場影響。

1 小時前閱讀分析

智東西模型更新

梁文鋒署名論文！DeepSeek首輪融資後大動作：生成速度大漲85%

這篇消息聚焦「梁文鋒署名論文！DeepSeek首輪融資後大動作：生成速度大漲85%」。原始導語提到：剛剛，DeepSeek開源，刀落推理！從 AI 情報角度來看，這類內容值得關注其背後的技術進展、產品落地、產業競爭與後續市場影響。

1 小時前閱讀分析

36氪模型更新

剛剛，DeepSeek V4更新DSpark，推理速度提升80%

這篇消息聚焦「剛剛，DeepSeek V4更新DSpark，推理速度提升80%」。原始導語提到：新機制也能給 Qwen、Gemma 加速。從 AI 情報角度來看，這類內容值得關注其背後的技術進展、產品落地、產業競爭與後續市場影響。

1 小時前閱讀分析

鈦媒體模型更新

【數智周報】DeepSeek：計劃將所有部門的規模擴大至少一倍；黃仁勳股東大會放言：本輪AI基建週期長達數十年；字節豆包Seedance 2.5將在7月初正式發佈

（5月22日-27日）AI推動HBM供不應求，美光預計供應緊張將持續至2027年以後；華為汪濤：2030年全球各類智能體將突破千億規模，2040年或達萬億規模；字節新一代豆包手機供應鏈信息曝光，發佈時間或延遲；英偉達宣佈Vera Rubin NVL4系統Q4起供貨；Groq完成6.5億美元融資，加速擴建AI推理雲並目標2027年底達200兆瓦；IDC：到2027年推理將佔智能算力需求70%以上...

1 小時前閱讀分析

IT之家模型更新

北大與 DeepSeek 聯合開源 DSpark：破解 AI 大模型高併發推理瓶頸，速度提升 60% 至 85%

針對大模型推理延遲高、併發效率低的痛點，DSpark 框架通過半自迴歸候選生成與置信度調度驗證兩項創新，在保證生成質量的同時，將單用戶生成速度提升 60% 至 85%。該框架已部署於 DeepSeek-V4 系列預覽版服務中，相關代碼與模型已在 GitHub 開源。#大模型#AI 推理#開源

2 小時前閱讀分析

相關文章

GPT-5.6：最強的模型，最窄的門

梁文鋒署名論文，DeepSeek首輪融資後大動作：生成速度大漲85%

梁文鋒署名論文！DeepSeek首輪融資後大動作：生成速度大漲85%

剛剛，DeepSeek V4更新DSpark，推理速度提升80%

【數智周報】DeepSeek：計劃將所有部門的規模擴大至少一倍；黃仁勳股東大會放言：本輪AI基建週期長達數十年；字節豆包Seedance 2.5將在7月初正式發佈

北大與 DeepSeek 聯合開源 DSpark：破解 AI 大模型高併發推理瓶頸，速度提升 60% 至 85%