Hugging Face BlogAI應用場景

Shipping a Trillion Parameters With a Hub Bucket: Delta Weight Sync in TRL

2026年5月27日 00:00

重點摘要

Back to Articles Shipping a Trillion Parameters With a Hub Bucket: Delta Weight Sync in TRL Published May 27, 2026 Update on GitHub Upvote 28 +22 Amine Dirhoussi aminediroHF Follow Quentin Gallouédec qgallouedec Follow Kashif Rasul kashif Follow Lewis Tunstall lewtun Follow Edward Beeching edbeeching Follow Albert Villanova del Moral albertvillanova Follow Leandro von Werra lvwerra Follow TL;DR, because you have models to train and we respect that: Async RL has a dirty secret: every step, the trainer has to ship the whole model to the inference engine. For a 7B in bf16 that is 14 GB. For a frontier 1T model checkpoint that is on the order of a terabyte. Per step. It turns out you do not have to. Between two consecutive RL optimizer steps, roughly 99% of bf16 weights are bit-identical (and

站內 AI 整理稿

Back to Articles Shipping a Trillion Parameters With a Hub Bucket: Delta Weight Sync in TRL Published May 27, 2026 Update on GitHub Upvote 28 +22 Amine Dirhoussi aminediroHF Follow Quentin Gallouédec qgallouedec Follow Kashif Rasul kashif Follow Lewis Tunstall lewtun Follow Edward Beeching edbeeching Follow Albert Villanova del Moral albertvillanova Follow Leandro von Werra lvwerra Follow TL;DR, because you have models to train and we respect that: Async RL has a dirty secret: every step, the trainer has to ship the whole model to the inference engine. For a 7B in bf16 that is 14 GB. For a frontier 1T model checkpoint that is on the order of a terabyte. Per step. It turns out you do not have to. Between two consecutive RL optimizer steps, roughly 99% of bf16 weights are bit-identical (and never less than 98% in the worst case). The actual delta is tiny. We landed a TRL PR that encodes just the changed elements as a sparse safetensors file, uploads it to a Hugging Face Bucket, and tells vLLM to fetch it. On Qwen3-0.6B, the per-step payload drops from 1.2 GB to 20 to 35 MB. The cherry on top: we ran a full disaggregated training where the trainer was on one box, vLLM lived in a Hugging Face Space, the Wordle environment lived in another Space, and weights flowed through a single Hub bucket. No shared cluster, no RDMA, no VPN. Async RL just got a lot cheaper. Read on. Two ways to ship the same weights. Red is wall-clock time during which no tokens are being generated. 1. The One Terabyte Problem If you read our previous post on the landscape of async RL training, you already know the punchline. Every async RL library, regardless of how it spells "actor model" or which color its NCCL backend is painted, eventually trips over the same root: weight synchronization. The inference engine speaks the policy of step N. The trainer just finished step N+1. The fresh weights have to get from one side to the other before the inference engine starts drifting hopelessly off-policy. This sits on the critical path whether you are running sync or async: a blocking transfer is wasted idle compute of GPUs not generating tokens. With a sparse delta path you collapse that idle time into seconds, and the trainer does not even have to wait for the inference engine to be ready: it just publishes "weights ready" and uploads the weights to the shared bucket the moment its optimizer step finishes, while the inference engine fetches on its own time. Fireworks put a very memorable number on this in their post Frontier RL Is Cheaper Than You Think: for a frontier 1T-parameter checkpoint at fp8 (their setting), a full snapshot is 1024 GiB, and that is what conventional wisdom says you have to ship every time you update your rollout fleet. That is the kind of number that gets people to start drawing diagrams with mega-clusters, RDMA fabrics, and dedicated cross-region links. Their measured average delta between adjacent checkpoints lands at 20.3 GiB, or 1.98% of the full model, and "more than 98% of weights in bf16 format remain bit-equivalent between consecutive checkpoints". Cursor's Composer 2 report tells a parallel story. They run training and inference in different regions and stitch them together with a shared S3 bucket (their exact words), into which the trainer uploads compressed weight diffs every training step. Each cluster independently downloads and reconstructs from the shared delta chain, "requiring no direct connectivity to the training cluster". The two sides never speak to each other about parameters directly. The bucket is the wire. Both papers agree on three things, and we want to repeat them slowly, because the rest of this post is essentially a faithful open source translation: Most of the weights have not actually changed between two adjacent RL steps. If you send only the parts that changed, your bandwidth bill collapses by roughly two orders of magnitude. If you route those tiny diffs through a shared object store, you no longer need the trainer and the inference cluster to live in the same data center. The only thing missing was a version of this story that you can pip install. So we wrote one. 2. Why bf16 RL Weights Are Almost Always Sparse Before we wire anything up, it is worth understanding why this whole game is even winnable. The "98% of weights do not change" claim sounds suspiciously like one of those numbers that works in the demo and falls apart in the wild. It is not. It falls out of how bf16 arithmetic works at the learning rates RL uses. A bf16 number has 7 mantissa bits. Between two consecutive powers of two, there are exactly 27=1282^7 = 12827=128 representable values, so the spacing between adjacent bf16 numbers around ∣w∣|w|∣w∣ is roughly ∣w∣⋅2−7|w| \cdot 2^{-7}∣w∣⋅2−7. An update gets absorbed by the bf16 cast whenever it sits below half of that spacing, i.e., when ∣Δw∣<∣w∣/256|\Delta w| < |w|/256∣Δw∣<∣w∣/256. This is the "bf16 visibility threshold" PULSE plots in their Figure 3. Now look at what Adam does. At an RL learning rate of, say, 3×10−63 \times 10^{-6}3×10−6, the update to a single weight is: Δw=−η⋅m^v^+ϵ\Delta w = -\eta \cdot \frac{\hat{m}}{\sqrt{\hat{v}} + \epsilon}Δw=−η⋅v^​+ϵm^​ The normalized step m^/(v^+ϵ)\hat{m}/(\sqrt{\hat{v}}+\epsilon)m^/(v^​+ϵ) is roughly order one, so ∣Δw∣≈η≈3×10−6|\Delta w| \approx \eta \approx 3 \times 10^{-6}∣Δw∣≈η≈3×10−6. For most weights, ∣w∣|w|∣w∣ sits somewhere around 10−210^{-2}10−2 to 10−110^{-1}10−1 (PULSE reports a median of 0.019 for representative LLM weights). The threshold ∣w∣/256|w|/256∣w∣/256 at that magnitude is around 4×10−54 \times 10^{-5}4×10−5 to 4×10−44 \times 10^{-4}4×10−4, which is bigger than the update. In other words: the optimizer is whispering, and bf16 cannot hear it. The update gets absorbed by rounding, the byte representation of www does not change, and from the inference engine's perspective, this weight did not move. Multiply that by a few hundred million parameters, and you get the >99% sparsity number, for free, with zero approximation. This is exactly the argument made formal in the PULSE paper (Mihai & Belilovsky, 2026). They define two thresholds. The absorption bound 10η10\eta10η is the conservative worst case for an Adam update, and the effective bound η\etaη is the regime you actually live in. The bf16 visibility threshold is ∣w∣/256|w|/256∣w∣/256. Whenever the update sits below the visibility threshold, it gets absorbed, and the bf16 byte does not change. Their Figure 3 plots both bounds against a cloud of representative LLM weights, and the conclusion is unambiguous: at η=3×10−6\eta = 3 \times 10^{-6}η=3×10−6, the absorption bound itself already sits below the visibility threshold for almost every weight in the model. They measure this empirically across Qwen2.5 (0.5B/1.5B/7B), Llama-3.2-3B, and Gemma-3-4B, and consistently find a mean per-step sparsity of ~99%, with a standard deviation of 0.2 to 0.4% over 400 training steps. The worst-case step stays above 98%. So <1% changed is not a lucky measurement; it is what the arithmetic guarantees. We do not have to predict this analytically (and indeed, we tried predicting the change mask from Adam's mmm and vvv statistics, but recall was a sad 30%, more on that later). We just need to observe which bytes flipped. That is a tiny boolean tensor per parameter, computed right around the optimizer step. Drag the learning rate down to RL territory and watch the cast-back-to-bf16 marker snap to the original tick. The 256-element grid on the bottom left is the aggregate effect across a tiny model. 3. HF Buckets and the Architecture Here is where the second piece of the story comes in, and where this post stops being a translation of Fireworks/Cursor and starts being a Hugging Face thing. 3.1 What is a Bucket? A Bucket is a repo type on the Hub designed for high-frequency object storage. No commit ceremony, no PR workflow, no LFS quirks. You add files, you list files, you download files. The Python interface is two functio

Related

相關文章

AI預測不了“佛得角”

AI預測模型在世界盃足球賽預測中集體失準,特別是對非洲隊伍「佛得角」的表現完全錯估,凸顯大模型在面臨動態不確定性與非主流聯賽數據不足時的脆弱性。這場預測翻車事件引發外界對AI可信度的質疑,也促使科技公司反思如何修正模型,導入即時動態資訊以提升預測準確度。

剛剛

AI 讓獨立遊戲更容易做出來,也更容易死在 Steam 裡

AI 降低了獨立遊戲的生產門檻,也放大了 Steam 供給過剩和玩家信任危機。獨立遊戲的競爭,正在從“能不能做出來”,轉向“能不能被看見、被相信、被持續選擇”。當工具讓內容越來越容易生成,真正稀缺的反而是人的表達、真實反饋、發行篩選與社區信任。

剛剛

八部門聯合發文力推“人工智能 + 消費”,擴大 AI 手機電腦及智能網聯汽車消費

商務部等八部門聯合印發《關於加快“人工智能 + 消費”發展的實施意見》,提出 5 方面 17 條舉措,旨在擴大智能產品消費、賦能服務消費、創新消費場景。政策將推動人工智能與消費深度融合,促進 AI 進千家萬戶。#人工智能消費新政##AI 手機電腦##智能網聯汽車#

2 小時前