MarkTechPost AIAI Agent

Prime Intellect Releases prime-rl 0.6.0 to Train Trillion-Parameter MoE Models on Agentic RL Workloads

2026年6月23日 07:20

重點摘要

站內 AI 整理稿

Prime Intellect has released prime-rl version 0.6.0. The framework targets reinforcement learning on trillion-parameter Mixture-of-Experts (MoE) models. It focuses on heavy agentic workloads, like long-horizon software-engineering tasks. The research team trained GLM-5 on SWE tasks at up to 131k sequence length. Step times stayed under five minutes. The batch size was 256 rollouts. The run used only 28 H200 nodes. TL;DR prime-rl 0.6.0 trains trillion-parameter MoE models on agentic RL workloads. GLM-5 trained on SWE at 131k sequence length, sub-5-minute steps, 28 H200 nodes. Asynchronous RL disaggregates trainer and inference for independent optimization. Inference uses FP8, Wide EP, P/D disaggregation, KV offloading, and router replay. Training uses 3-D parallelism (FSDP, EP, CP) plus block-scaled FP8. What is prime-rl 0.6.0? prime-rl is an open framework for asynchronous reinforcement learning. It post-trains large open-source models on agentic tasks. Version 0.6.0 extends this to trillion-parameter MoE scale. The example model in the announcement is zai-org/GLM-5.1. The optimizations also apply to other large MoE models. Examples include moonshotai/Kimi-K2.7-Code and nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16. A full GLM-5.1 run starts with one command on a Slurm cluster. Copy CodeCopiedUse a different Browseruv run rl @ examples/glm5_llmd/rl.toml --output-dir /shared/outputs/glm5-llmd Role of asynchronous RL Agentic tasks have long-tail outliers. Some coding rollouts run for hours. Waiting for them before each policy update would idle GPUs. Asynchronous RL avoids this. The trainer and inference systems are disaggregated. They run and scale independently. The inference policy updates as soon as the optimizer step finishes. There is one synchronization point: the policy update. prime-rl pushes new weights as soon as they exist. Already-dispatched rollouts keep their active prefix cache. So a single rollout may mix tokens from several policy versions. New rollouts behave differently. They repopulate their own KV cache, even when prefixes match. A KV-cache salt forces this. Requests from too old a policy are dropped. The max_off_policy_steps value controls that threshold. Inference optimizations Inference is usually the throughput bottleneck in an RL system. prime-rl optimizes for throughput, while keeping latency bounded. FP8 inference: Lower precision speeds up prefill and decode. prime-rl uses FP8 with DeepEP and DeepGEMM kernels. Wide Expert Parallelism: Wide EP spreads experts across ≥32 GPUs. It pairs with a large data-parallel rank, for example 32. Each GPU holds separate experts and serves as an endpoint. Synchronization happens per-layer, through dispatch and combine operations. Prefill and Decode Disaggregation: Some modelenv pairs hit a 4:1 prefill:decode token ratio. Shared workers would inflate end-to-end latency. That reduces the benefits of PipelineRL. P/D disaggregation separates prefill and decode workers. Long tool outputs then stop throttling decode workers. KV cache management: High concurrency needs large KV cache space. prime-rl supports tiered offloading to CPU and disk. vLLM native offloading creates one pool per worker. Mooncake Store instead pools RAM and disk across all nodes centrally. Request routing: prime-rl ships a fork of vllm-router by default. It also supports the NVIDIA Dynamo router as a drop-in. Routers score workers using KV cache reuse, queue depth, and live load. Router replay (R3): Trainerinference mismatch silently kills training. Router replay captures inference routing decisions. It replays them directly on the trainer. This cuts KL mismatch by roughly an order of magnitude. Routed experts have shape [num_layers, top_k, seq_len]. This payload can grow to hundreds of GB. At scale, the data rate reaches tens of Gbps. So prime-rl treats it as an opaque payload. Optimized PyTorch operations handle the processing. Training optimizations The trainer builds on torchtitan, a PyTorch-native training codebase. It relies on 3-D parallelism: FSDP, CP, and EP. The GLM-5 case study uses all three. StrategyWhat it shardsPrimary useKey detailFSDP (FSDP2)Parameters, gradients, optimizer statesBaseline memory amortizationGathers weights on demand per layer via fully_shardExpert Parallelism (EP)Experts within a layerShrinks active layer memoryall2all dispatch/combine; torch-native or DeepEPContext Parallelism (CP)The sequence dimensionLong-context activation memoryUlysses (default) or Ring Attention EP exists because layers stay huge after FSDP. With 78 layers and 800B params in float32, one layer’s all-gather needs roughly 40GB. Overlapping one layer pushes that near 80GB. Setting EP=8 dispatches tokens instead of gathering full experts. torch-native all2all is slightly faster within one node. DeepEP wins when EP spans multiple nodes. CP matters at 131k+ sequence length. There, activations dominate memory, not parameters. GLM-5 uses DSA, which neither Ulysses nor Ring Attention parallelizes directly. So prime-rl ships a custom context-parallel implementation for it. FP8 training. prime-rl uses DeepGEMM block-scaled FP8, as proposed by DeepSeek V3. This rarely raises throughput, due to quantization overhead. Its real value is matching trainer and inference precision. That reduces KL mismatch and stabilizes training. Interactive Explainer (function(){ window.addEventListener("message", function(e){ if(e && e.data && e.data.type === "primerl-explorer-height"){ var f = document.getElementById("primerl-explorer"); if(f && e.data.height){ f.style.height = e.data.height + "px"; } } }); })(); Use cases with examples Long-horizon SWE agents: Train a model on real repository issues. Rollouts can span 100s of turns and tool calls. P/D disaggregation keeps decode latency predictable here. 1T-scale post-training on fewer nodes: The GLM-5 run fit on 28 H200 nodes. Wide EP and KV offloading raise concurrency and throughput. Stable agentic RL at scale: Router replay and FP8 training both reduce trainerinference KL mismatch. Lower mismatch means steadier training. Check out the Technical details. Also, feel free to follow us on Twitter and don’t forget to join our 150k+ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well. Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us The post Prime Intellect Releases prime-rl 0.6.0 to Train Trillion-Parameter MoE Models on Agentic RL Workloads appeared first on MarkTechPost.

原始來源：MarkTechPost AI ↗

查看原始來源

鈦媒體AI Agent

Loop Engineering 火了：AI Agent 開始自己幹活，公司準備好背鍋了嗎？

Loop Engineering 近期引發關注，其核心在於重新定義產品、測試、研發與專案管理之間的權責界線。隨著 AI Agent 開始自主執行任務，企業必須正視責任歸屬問題，提前做好風險應對準備。

剛剛閱讀分析

36氪AI Agent

Home Assistant 鬧了個 AI 烏龍，但智能家居真要變天了

{"id":"76f6bf14-fa0d-4293-9a40-43d85ff4def4","object":"response","model":"deepseek-v4-flash","output":[],"stop_reason":"max_output_tokens","usage":{"input_tokens":118,"output_tokens":200,"total_tokens":318}}

剛剛閱讀分析

TechWebAI Agent

亞馬遜雲科技儲瑞松：Agentic AI爆發拐點來臨，不僅是技術創新，更是業務變革

隨著這一飛輪的正式啟動與加速旋轉，從而推動了 Agentic AI 爆發拐點的到來。為此，儲瑞松提出了企業實現 Agentic 業務轉型所需的AI的五層技術棧地圖。第四層是 Agentic 平臺層。第五層是智能體與應用層，這是Agentic AI 真正為企業創造價值、交付業務結果之處。真正決定 Agentic 項目成敗的唯一標尺，是可衡量的業務產出。Agentic AI不僅是技術創新，更是業務變革儲瑞松強調，Agentic AI不僅是技術創新，更是業務變革。

剛剛閱讀分析

鈦媒體AI Agent

北汽福田養蝦記：從禁用Excel到讓AI操作系統

北汽福田在養蝦過程中，從禁止使用Excel開始，逐步導入AI操作系統，實現更高效的養殖管理。這項技術讓AI從被動回答問題，進化到主動推動工作流程。最終，AI系統全面取代傳統管理工具，成為養蝦作業的核心。 W

1 小時前閱讀分析

雷峰網AI Agent

獨家丨微軟與平安CEO，曾「密謀」聯手做雲

近期，微軟雲中國研發團隊大裁員，讓地緣政治下的“脫鉤”與“切割”再度成為行業焦點，但很少有人記得這家巨頭曾在中國市場做過怎樣大膽的本土化嘗試。這段塵封的科技秘史的一角，藏在疫情前深圳平安金融大廈裡的一場閉門密談裡。據微軟知情人士向雷峰網透露，當時，時任平安聯席CEO的陳心穎與更高管理層，向微軟CEO薩提亞、Azure總裁Scott Guthrie等一行人介紹了平安在 “金融+科技”戰略轉型上的進展。薩提亞聽後表示驚歎，他感慨道，平安做科技業務的模式是：先想好要做什麼，然後再“生個孩子”，培養他去做，往往能做得不錯；而微軟的模式是：先生“孩子”，然後再給他找事做，結果經常做不好。這個形象的總結，被平安高層在後來數次公司內部會上提及、稱讚。彼時，這兩大中美巨頭正在嘗試強強聯合：微軟雲有意更換在華代理商，讓平安做中國區代理，而平安也在雲計算業務上野心勃勃，準備一展拳腳。於微軟而言，平安是一家在本土有極強政企信任度的巨頭，由它來做微軟雲在華代理商，會為微軟雲打開更大市場空間。於平安而言，作為雲市場的後來者，想要在極度內卷的紅海中搏殺出一席之地並非易事，這時候微軟雲伸出的橄欖枝，頗具吸引力。可以說，這是一場“雙向奔赴”。01微軟的隱痛：雲業務落地中國，不達預期這場洽談，源於微軟雲當年在華的深深焦慮，以及薩提亞治下微軟全面雲轉型的急迫。2014年，薩提亞·納德拉接任微軟CEO，提出了著名的“移動優先，雲優先”戰略，Azure被寄予厚望。然而，當微軟將目光投向中國這片龐大的增量市場時，卻一頭撞上了剛性的合規高牆——外資公有云不能在境內直接落地運營。為了繞開電信增值業務（IDC/ISP）限制，微軟在2013年初採取了“外資提供技術、本土企業代為運營”的模式，將運營權交給了世紀互聯，同時也找了一系列老牌IT分銷商來做渠道代理。但實行一段時間後，微軟還是覺得“慢了”。據多位外資雲老兵回憶，當時

7 小時前閱讀分析

36氪AI Agent

過程比結果重要：一個不給標準答案的調參框架，讓Agent自己把數據庫性能榨出來

一個不給標準答案的調參框架，更能激發AI Agent自主榨取數據庫性能。強調過程重於結果，給予解題能力比提供一百個標準答案更有效。這套方法讓Agent在沒有預設解答的情況下，自行探索最佳化路徑。

13 小時前閱讀分析

相關文章

Loop Engineering 火了：AI Agent 開始自己幹活，公司準備好背鍋了嗎？

Home Assistant 鬧了個 AI 烏龍，但智能家居真要變天了

亞馬遜雲科技儲瑞松：Agentic AI爆發拐點來臨，不僅是技術創新，更是業務變革

北汽福田養蝦記：從禁用Excel到讓AI操作系統

獨家丨微軟與平安CEO，曾「密謀」聯手做雲

過程比結果重要：一個不給標準答案的調參框架，讓Agent自己把數據庫性能榨出來