MarkTechPost AI生成式AI

DeepReinforce 發布 Ornith-1.0:開源程式碼模型系列,可自行學習強化學習框架

2026年6月25日 17:11

重點摘要

DeepReinforce 發布了 Ornith-1.0,這是一個專為代理式程式碼任務打造的開源模型系列。該系列涵蓋四種尺寸,從 9B 密集模型到 397B 混合專家旗艦模型。所有檢查點皆以 MIT 授權條款在 Hugging Face 上發布。這些模型是在預訓練的 Gemma 4 和 Qwen 3.5 基礎上進行後訓練。多數程式碼代理會將模型與固定的人工設計框架搭配使用,而 Ornith-1.0 則學會自行編寫其框架。DeepReinforce 研究團隊報告指出,在同等規模的開放模型中,該模型達到了最先進的結果。摘要:Ornith-1.0 提供 9B、31B、35B-MoE 和 397B-MoE 四種尺寸,均採用 MIT 授權,基於 Gemma 4 和 Qwen 3.5。該模型在強化學習過程中自行學習其框架,同時優化框架與解決方案。Ornith-1.0-397B 在兩個主要基準測試中超越了 Claude Opus 4.7。

站內 AI 整理稿

DeepReinforce has released Ornith-1.0, an open-source model family built for agentic coding. The lineup spans four sizes, from a 9B dense model to a 397B mixture-of-experts flagship. Every checkpoint ships under the MIT license on Hugging Face. The models are post-trained on top of pretrained Gemma 4 and Qwen 3.5. Most coding agents pair a model with a fixed, human-designed harness. Ornith-1.0 instead learns to write its own. The DeepReinforce research team reports state-of-the-art results among open models of comparable size. TL;DR Ornith-1.0 ships in 9B, 31B, 35B-MoE, and 397B-MoE sizes under MIT, built on Gemma 4 and Qwen 3.5. The model learns its own scaffold during RL, jointly optimizing the harness and the solution. Ornith-1.0-397B tops Claude Opus 4.7 on both headline benchmarks, but not Opus 4.8 or the larger GLM-5.2-744B. Three layers — fixed trust boundary, deterministic monitor, frozen LLM judge — guard against reward hacking. What is Ornith-1.0? Ornith-1.0 is a set of reasoning models tuned for coding agents. The variants are 9B Dense, 31B Dense, 35B MoE, and 397B MoE. The 35B model is mixture-of-experts and activates roughly 3B parameters per token. FP8 and GGUF builds are also published for faster local serving. Each model is a reasoning model. Replies open with a <think> block before the final answer. The serving recipes enable a reasoning parser, so that trace returns in a separate reasoning_content field. The models also emit well-formed tool calls for agent loops. Deployment is straightforward. The 9B model is about 19GB in bf16 and serves on a single 80GB GPU. Serving recipes target vLLM, SGLang, and Transformers. Each model exposes an OpenAI-compatible endpoint. Standard agent frameworks therefore work without code changes. Interactive Explainer &lt;/button&gt; &lt;button class=&quot;btn gho&quot; id=&quot;resetBtn&quot;&gt;Reset&lt;/button&gt; &lt;/div&gt; &lt;div class=&quot;stepout&quot; id=&quot;stepOut&quot;&gt;Step 0 — untrained policy with a fixed, hand-written harness.&lt;/div&gt; &lt;/div&gt; &lt;!-- PANEL 2: BENCH --&gt; &lt;div class=&quot;panel&quot; data-panel=&quot;bench&quot;&gt; &lt;div class=&quot;lead&quot;&gt;Vendor-reported scores from DeepReinforce. Pick a model tier and a benchmark. Ornith is highlighted in green. Higher is better.&lt;/div&gt; &lt;div class=&quot;seg&quot;&gt;&lt;span class=&quot;lab&quot;&gt;Model tier&lt;/span&gt; &lt;div class=&quot;chip on&quot; data-tier=&quot;t397&quot;&gt;397B flagship&lt;/div&gt; &lt;div class=&quot;chip&quot; data-tier=&quot;t35&quot;&gt;35B MoE&lt;/div&gt; &lt;div class=&quot;chip&quot; data-tier=&quot;t9&quot;&gt;9B dense&lt;/div&gt; &lt;/div&gt; &lt;div class=&quot;seg&quot; id=&quot;benchChips&quot;&gt;&lt;span class=&quot;lab&quot;&gt;Benchmark&lt;/span&gt;&lt;/div&gt; &lt;div class=&quot;chart&quot; id=&quot;chart&quot;&gt;&lt;/div&gt; &lt;div class=&quot;foot-note&quot; id=&quot;benchNote&quot;&gt;&lt;/div&gt; &lt;/div&gt; &lt;!-- PANEL 3: DEFENSES --&gt; &lt;div class=&quot;panel&quot; data-panel=&quot;def&quot;&gt; &lt;div class=&quot;lead&quot;&gt;A model that writes its own scaffold could cheat the verifier. DeepReinforce describes three defense layers. Tap each to expand.&lt;/div&gt; &lt;div class=&quot;layers&quot;&gt; &lt;div class=&quot;layer open&quot;&gt;&lt;div class=&quot;lh&quot;&gt;&lt;span class=&quot;num&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;lt&quot;&gt;Fixed trust boundary&lt;/span&gt;&lt;span class=&quot;more&quot;&gt;tap&lt;/span&gt;&lt;/div&gt;&lt;div class=&quot;lb&quot;&gt;The environment, tool surface, and test isolation are immutable and outside the model's reach. The model evolves only its inner policy scaffold — memory, error-handling, and orchestration logic.&lt;/div&gt;&lt;/div&gt; &lt;div class=&quot;layer&quot;&gt;&lt;div class=&quot;lh&quot;&gt;&lt;span class=&quot;num&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;lt&quot;&gt;Deterministic monitor&lt;/span&gt;&lt;span class=&quot;more&quot;&gt;tap&lt;/span&gt;&lt;/div&gt;&lt;div class=&quot;lb&quot;&gt;A rule-based monitor flags any attempt to read withheld paths, modify verification scripts, or invoke unsanctioned tools. Such trajectories get zero reward and are excluded from the advantage computation.&lt;/div&gt;&lt;/div&gt; &lt;div class=&quot;layer&quot;&gt;&lt;div class=&quot;lh&quot;&gt;&lt;span class=&quot;num&quot;&gt;3&lt;/span&gt;&lt;span class=&quot;lt&quot;&gt;Frozen LLM judge&lt;/span&gt;&lt;span class=&quot;more&quot;&gt;tap&lt;/span&gt;&lt;/div&gt;&lt;div class=&quot;lb&quot;&gt;Because intent-level gaming can happen inside the allowed tool surface, a frozen LLM judge acts as a veto on top of the verifier — not as the primary reward signal.&lt;/div&gt;&lt;/div&gt; &lt;/div&gt; &lt;/div&gt; &lt;div class=&quot;ftr&quot;&gt;&lt;span&gt;Source: &lt;a href=&quot;https://deep-reinforce.com/ornith_1_0.html&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;deep-reinforce.com&lt;/a&gt; · MIT licensed · numbers vendor-reported&lt;/span&gt;&lt;span&gt;&lt;b&gt;Marktechpost&lt;/b&gt; · AI Dev Signals&lt;/span&gt;&lt;/div&gt; &lt;script&gt; (function(){ var root=document.getElementById('mtp-ornith-demo'); /* tabs */ root.querySelectorAll('.tab').forEach(function(t){ t.addEventListener('click',function(){ root.querySelectorAll('.tab').forEach(function(x){x.classList.remove('on')}); root.querySelectorAll('.panel').forEach(function(x){x.classList.remove('on')}); t.classList.add('on'); root.querySelector('.panel[data-panel=&quot;'+t.dataset.p+'&quot;]').classList.add('on'); resize(); }); }); /* loop sim */ var step=0,reward=0.08,timer=null; var scaffs=[ 'Baseline harness: linear retries, no memory.', 'Adds scratchpad memory across tool calls.', 'Adds error-triage branch before re-edit.', 'Reorders: read tests, then plan, then patch.', 'Caches sub-results; prunes dead branches.', 'Task-specific orchestration emerges automatically.']; var outs=[ 'Fixed harness, no learning yet.', 'Fewer redundant file reads observed.', 'Recovers from failed edits more often.', 'Higher first-pass test success.', 'Shorter trajectories, same accuracy.', 'Stable high-reward scaffold selected.']; var nodes=root.querySelectorAll('.node'); function lightSeq(cb){ var i=0;nodes.forEach(function(n){n.classList.remove('act')}); var iv=setInterval(function(){ nodes.forEach(function(n){n.classList.remove('act')}); nodes[i].classList.add('act');i++; if(i&gt;=nodes.length){clearInterval(iv);setTimeout(function(){nodes.forEach(function(n){n.classList.remove('act')});cb&amp;&amp;cb();},260);} },220); } function doStep(){ if(step&gt;=5){return;} step++; lightSeq(function(){ reward=[0.08,0.27,0.43,0.58,0.69,0.77][step]; root.querySelector('#rFill').style.width=(reward*100)+'%'; root.querySelector('#rVal').textContent=reward.toFixed(2); root.querySelector('#scaffTxt').textContent=scaffs[step]; root.querySelector('#outTxt').textContent=outs[step]; root.querySelector('#stepOut').innerHTML='Step '+step+' — &lt;b&gt;scaffold mutated&lt;/b&gt;; reward propagated to both stages.'; resize(); }); } root.querySelector('#stepBtn').addEventListener('click',doStep); root.querySelector('#autoBtn').addEventListener('click',function(){ if(timer){clearInterval(timer);timer=null;this.textContent='Auto-run <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/25b6.png" alt="▶" class="wp-smiley" style="height: 1em; max-height: 1em;" />';return;} this.textContent='Pause <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/23f8.png" alt="⏸" class="wp-smiley" style="height: 1em; max-height: 1em;" />';var b=this; timer=setInterval(function(){if(step&gt;=5){clearInterval(timer);timer=null;b.textContent='Auto-run <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/25b6.png" alt="▶" class="wp-smiley" style="height: 1em; max-height: 1em;" />';}else{doStep();}},1400); }); root.querySelector('#resetBtn').addEventListener('click',function(){ if(timer){clearInterval(timer);timer=null;root.querySelector('#autoBtn').textCon

Related

相關文章

Hugging Face Blog生成式AI

一鍵在 HF Jobs 上啟動 vLLM 伺服器

你現在可以透過單一指令,在 Hugging Face 基礎架構上啟動一個私有的、相容於 OpenAI 的 LLM 端點——無需佈建伺服器、不需要 Kubernetes,按秒計費。啟用後,你可以從筆電、筆記本或任何地方對其進行查詢。這是為測試、評估或批次生成快速啟動模型的最快方式。(如果你需要的是受管理的、可立即上線的服務,Inference Endpoints 才是你的選擇——文末會說明何時該選哪一種。)以下是完整流程。 前置需求:需有付款方式或正向預付餘額(Jobs 按硬體使用量以每分鐘計費),以及 huggingface_hub >= 1.20.0。

剛剛
智東西生成式AI

ChatGPT不拼智商拼情商了?GPT-5.5 Instant更新,明天開始免費使用

這篇消息聚焦「ChatGPT不拼智商拼情商了?GPT-5.5 Instant更新,明天開始免費使用」。原始導語提到:GPT-5.5 Instant更新上線,重點提升建議、決策和日常對話體驗。 從 AI 情報角度來看,這類內容值得關注其背後的技術進展、產品落地、產業競爭與後續市場影響。

52 分鐘前

月之暗面黃震昕:Kimi的目標是和海外三家模型掰手腕

這篇消息聚焦「月之暗面黃震昕:Kimi的目標是和海外三家模型掰手腕」。原始導語提到:企業級AI難點並不在模型廠商這一側,在於如何去切入和推進企業完成AI轉型。 從 AI 情報角度來看,這類內容值得關注其背後的技術進展、產品落地、產業競爭與後續市場影響。

1 小時前
雷峰網生成式AI

這次是阿里!中國的大模型團隊快被 Anthropic 告完了

這是Anthropic迄今控訴的最大規模“模型蒸餾”案。 作者丨高允毅 編輯丨馬曉寧 01Anthropic已經告了四家中國AI公司短短四個月,四家中國頂級AI公司被Anthropic接連點名,且沒有停手的跡象。這一次,輪到阿里。2026年6月10日,Anthropic向美國參議院銀行委員會遞交了一封信,矛頭直指阿里Qwen團隊。報告披露了一串數字:從4月22日到6月5日,整整45天,阿里相關運營者利用2.5萬個賬號,完成了2880萬次交互。這是Anthropic迄今公開的最大規模“模型蒸餾”數據。2880萬次對話是什麼概念?放一個行業參照:目前主流的高質量SFT(監督微調)數據集,規模通常在數十萬到幾百萬條之間。2880萬次針對核心能力的定向交互,足以在特定任務域內,低成本“提純”出一個極具競爭力的專用模型。這引起了Anthropic的高度警惕。在他們看來,對方的行為目標極其精準,刀刀直指其最新旗艦模型 Mythos Preview 的核心底牌,軟件工程與智能體推理能力。Anthropic在信中將其定性為“迄今為止,中國公司試圖搭美國頂尖實驗室便車的最大規模嘗試”。梳理時間線可以發現,Anthropic的反擊正在顯著升級。2026年2月23日,Anthropic發佈了一篇博客文章《Detecting and Preventing Distillation Attacks》,公開點名三家中國AI實驗室:DeepSeek、月之暗面(Kimi)、MiniMax(稀宇科技)。報告顯示,約2.4萬個中國相關賬號對Claude發起了超過1600萬次交互,其中MiniMax超1300萬,月之暗面超340萬,DeepSeek超15萬。從1600萬次到2880萬次,規模在翻倍,Anthropic的反擊,也從2月份的“技術曝光”,升級到6月份“政治施壓”。而這次的收信人,銀行委員會主席蒂姆·

6 小時前