NVIDIA 推出 Polar:專為 Codex、Claude Code 與 Qwen Code 設計的 GRPO 訓練保留 Token 推出框架

重點摘要
語言代理的強化學習正變得日益複雜,代理現在需處理多輪工具使用、長時間運行的上下文以及多代理協調。主要的工程挑戰在於如何將現有代理軟體連接到訓練管線中,同時不破壞這些工具的運作方式。NVIDIA 研究團隊推出了 Polar,這是一個推出框架(rollout framework),讓研究人員能在不修改任何代理工具介面的情況下,對其執行強化學習。Polar 解決的核心問題在於,諸如 Codex CLI、Claude Code、Qwen Code 或 Pi 等代理工具介面(agent harness)負責管理系統提示、工具格式、上下文工程以及代理提交補丁的方式,這些細節會直接影響代理在評估階段的表現,而傳統 RL 基礎設施則需要重新實作這些邏輯。
Reinforcement learning for language agents is growing more complex. Agents now manage multi-turn tool use, long-running contexts, and multi-agent orchestration. The main engineering challenge is connecting existing agent software to training pipelines without breaking how those tools work. NVIDIA’s research team introduced Polar, a rollout framework that lets researchers run reinforcement learning over any agent harness without modifying that harness. The Core Problem Polar Solves An ‘agent harness’ is a tool like Codex CLI, Claude Code, Qwen Code, or Pi. These harnesses manage system prompts, tool formatting, context engineering, and how the agent submits patches. These details directly affect agent behavior at evaluation time. Traditional RL infrastructure requires harness logic to be rewritten behind a framework-owned environment API — typically env.init(), env.step(), env.reset() in the OpenAI Gym style. Every new harness requires new integration code. That integration can also lose execution details specific to the native harness path. Polar’s key observation is that every LLM-based agent must call a model. That model API boundary is a common interface outside the agent itself. Instead of integrating inside the harness, Polar places a proxy at that boundary. How the Proxy Works For each incoming model request, the gateway proxy performs four steps: Detect the provider API — using the request path and headers, it distinguishes Anthropic Messages, OpenAI Chat Completions, OpenAI Responses, and Google generateContent-style calls. Normalize the request — converts roles, content parts, tool definitions, and generation parameters into the OpenAI Chat Completions shape used by the local inference server. Capture token-level data — stores request messages, response messages, prompt token IDs, sampled response token IDs, finish reason, and log probabilities. Return the provider shape — transforms the response back into the schema the harness expects. For streaming requests, Polar obtains a non-streaming upstream response and emits a synthetic provider-shaped stream. This preserves compatibility with harnesses that expect server-sent events while ensuring complete token capture. The only required change to an existing harness is pointing its model base URL at the gateway. https://arxiv.org/pdf/2605.24220 Architecture: Rollout Server and Gateway Nodes Polar has two core components: The rollout server accepts a TaskRequest and expands it into num_samples independent sessions. Each session carries a session ID, task ID, timeout budget, runtime specification, agent specification, trajectory builder, evaluator, and callback URL. The server dispatches sessions to gateway nodes and accepts callbacks when sessions complete. Gateway nodes own the lifecycle of each session — starting the runtime, running the harness, building trajectories, evaluating output, and teardown. The gateway also hosts the proxy endpoint for that session’s model calls, keeping completion capture tied to the session registry. Within each gateway, isolated worker pools handle INIT, RUNNING, and POSTRUN stages. A bounded READY buffer holds initialized runtimes until a run slot is available. CPU-heavy runtime preparation and evaluator prewarm proceed off the critical path, without blocking active GPU-bound agent execution. If a harness times out after model calls have been captured, the gateway still enters POSTRUN so partial traces can be recovered. Built-in evaluators include a session-completion reward, a configurable test-on-output evaluator, and a SWE-Bench/SWE-Gym harness evaluator. Custom evaluators can be added through a registry interface. Polar currently supports Docker and rootless Apptainer runtimes. Built-in harness shortcuts include codex, claude_code, gemini_cli, qwen_code, opencode, and pi. Trajectory Reconstruction: Per Request vs. Prefix Merging After a session completes, Polar reconstructs trainable trajectories from captured model calls. Two strategies are available: The per_request builder treats every model call as one independent trace. It is lossless per individual call but fragments multi-turn sessions. A single coding problem can produce hundreds of per-request traces, increasing the burden on downstream trainers. The prefix_merging builder reconstructs longer traces where the harness session preserves append-only conversation histories. It partitions completions into ordered chains by verifying a strict token-prefix relation between adjacent completions. Sub-agents, context compaction boundaries, and parallel agent branches naturally form separate chains. Within each merged trace, only sampled assistant tokens are marked trainable. Canonical interstitial tokens receive a loss mask of zero. Ablation Results The research team benchmarks both strategies on the same model, hardware, and topology over three training steps. Metricper_requestprefix_mergingTrainer updates1,185218Wall-clock time189.5 min35.2 minSpeedup—5.39×Avg. rollout GPU utilization20.4%87.7% SWE-Bench Verified Results Training uses standard GRPO on the Qwen3.5-4B base model. The dataset is SkyRL-v0-293-data SWE-Gym (293 tasks, 1 epoch, rollout batch size 4, 16 samples per prompt) with the Slime trainer. All experiments use prefix_merging for trajectory construction. Training Rollout Reward Progress (pass@1) HarnessFirst 10 StepsLast 10 StepsCodex9.5%54.5%Claude Code28.8%67.0%Qwen Code61.6%66.0%Pi61.6%76.2% SWE-Bench Verified Final Scores HarnessBasePolar RLGainCodex3.8%26.4%+22.6 ptsClaude Code29.8%34.6%+4.8 ptsQwen Code34.6%35.2%+0.6 ptsPi34.2%40.4%+6.2 pts The largest gain is under Codex. Codex presents an unfamiliar action protocol and patch-submission style to a Qwen model not originally trained on that harness. Polar attaches the reward signal to the actual sampled tokens flowing through the Codex execution path, so GRPO optimizes the behavior the model uses at evaluation time. Under the native Qwen Code harness, where the base model is already well-aligned, Polar still delivers a 0.6 point gain. Offline SFT Data Generation Polar can also serve as a distributed offline data generation service with no changes to the runtime. The research team demonstrates this using Qwen3.5-122B-A10B on an 8×H100 server (TP=8, max_model_len=32,768) with the pi harness against 1,638 instances from seven SWE-Gym repositories. A trajectory is accepted into the SFT corpus only if the SWE-Bench evaluation harness confirms the agent’s patch resolves every FAIL_TO_PASS test and leaves every PASS_TO_PASS test green. RepositoryAttemptsAcceptedRategetmoto/moto34318453.6%python/mypy25710139.3%conan-io/conan712738.0%pydantic/pydantic812429.6%iterative/dvc2194520.5%pandas-dev/pandas4779819.7%dask/dask1412517.7%Total1,63850430.8% The run cost roughly 64 GPU-hours. Accepted trajectories average 104 messages per session and 51 assistant turns. Framework Comparison SystemAsync RLAsync Rollout StagingRollout as ServiceHarness AgnosticPolar✓✓✓✓ProRL Agent✓✓✓✗SkyRL-Agent✓✓✗partialPRIME-RL✓✗✗✗Agent Lightningpartial✗partialpartialrLLMpartial✗✗✗OpenClaw-RL✓✗✗partial Polar is the only system in this comparison with first-class support across all four properties. Strengths and Limitations Strengths No harness code changes required — the proxy intercepts at the model API boundary Provider-agnostic: supports Anthropic, OpenAI Chat, OpenAI Responses, and Google API formats natively prefix_merging reduces trainer updates from 1,185 to 218 and cuts wall-clock time 5.39× Works for both online RL and offline SFT data generation with the same runtime Harness-native RL delivers large gains for unfamiliar execution paths — 22.6 pts on Codex Partial traces are recovered when a harness times out mid-session Released as open source under NeMo Gym Limitations Reward design, evaluator quality, and distribution shift remain the researcher’s responsibility Requires the harness to support a configurable model base URL Token-level capture depends on the servi
Related
相關文章

Edge AI Daily 早報(6月19日)
AI Engineer World's Fair 2026規模再創新高,標誌AI工程從幕後走向舞臺中央。行業面臨結構性調整:楊立昆警示OpenAI年虧210億美元揭示商業模式脆弱性,Transformer之父轉投OpenAI反映人才爭奪白熱化。Anthropic多線佈局——語音支持七種語言、加入碳清除聯盟、落子首爾辦事處,展現生態擴張野心。監管壓力加劇,意大利依據DMA調查蘋果iCloud,巴西開放iOS側載佣金降至5%,蘋果圍牆花園持續崩塌。

今天起,Claude Design要把設計師和程序員變成同一種人了
猝不及防!Anthropic深夜甩出Claude Design大更新,設計系統一鍵導入,代碼雙向同步,9大平臺一鍵導出。Anthropic設計師親自下場錄屏:AI跑了八輪自查,才敢把設計稿給你看。

OpenAI 成為 Rust 基金會白金會員,合計贊助 60 萬美元
OpenAI 正式成為 Rust 基金會白金會員,將提供總計 60 萬美元資金,用於支持 Rust 開源項目維護者及 Rust 創新實驗室等計劃。這標誌著 AI 巨頭對安全、高效系統編程語言的重視。 #OpenAI #Rust #開源

Claude Design 上線首周用戶破百萬,和 Claude Code 共享 AI 配額
Anthropic 今天(6 月 18 日)發佈公告,在宣佈 Claude Design 上線首周用戶規模突破 100 萬後,進一步強化和 Claude Code 的雙向聯動,實現從設計到編程的無縫工作流。
谷歌時隔6年再發智能音箱,Gemini上桌,售價不到700元
智東西 編譯 | 劉煜 編輯 | 陳駿達 智東西6月18日消息,谷歌昨日宣佈,其首款搭載居家版Gemini語音助手的智能音箱(Google Home Speaker)已開啟預售,將於當地時間6月25日正式上市,售價為99.99美元(約合人民幣677.03元)。在此之前,谷歌已有6年沒有推出過獨立智能音箱產品。 谷歌這款智能音箱外觀近似球形,風格類似亞馬遜新一代Echo音箱與蘋果舊款音箱HomePod Mini。 ▲谷歌智能音箱(圖源:谷歌官網) 使用音箱時,用戶只需通過口令“Hey Google”或“OK Google”喚醒Gemini,就可以繼續下達相應指令。這與谷歌舊款音箱、智能顯示屏等喚醒語音助手的方式相同。此外,用戶只要按照日常說話習慣下達命令,Gemini便能理解用戶意圖,相比之前大大提升溝通效率。 一、加強短時對話記憶,會員可與Gemini不限次數對話 谷歌此次推出的全新音箱升級諸多功能。其中,音箱搭載的Gemini語音助手擁有10款全新擬人化語音音色,用戶可以根據喜好自行選擇聲線。音箱還可支持用戶一次性下達多條語音指令,即使指令未能說對、說完整,用戶中途改口Gemini也能識別。 Gemini還具備多鏈路推理能力,落地到實際生活場景中比較實用。例如,用戶問:“我支持的足球隊下場比賽天氣如何?”Gemini收到指令後,會自動查詢賽事時間、舉辦地點,同時匹配相應時段天氣,再給出答覆。 同時,Gemini加強了短時對話記憶,能承接上下文實現連續對話功能。即使用戶連續追問、甚至串聯多項任務、不重複交代前置條件,該語音助手也能實現來回連貫交流。 ▲谷歌Gemini對話場景(圖源:谷歌官網) 不僅如此,Gemini搭配的連續對話功能,能讓應答後的音箱麥克風保持短暫收音,用戶無需重複喊“OK Google”就能繼續提問。該功能現已全面支持所有Gemini原生適配的語言,包括

微軟,考慮接入DeepSeek
這篇消息聚焦「微軟,考慮接入DeepSeek」。原始導語提到:Copilot Cowork轉為按量計費。 從 AI 情報角度來看,這類內容值得關注其背後的技術進展、產品落地、產業競爭與後續市場影響。