DeepReinforce 發布 Ornith-1.0：開源程式碼模型系列，可自行學習強化學習框架

2026年6月25日 17:11

重點摘要

DeepReinforce 發布了 Ornith-1.0，這是一個專為代理式程式碼任務打造的開源模型系列。該系列涵蓋四種尺寸，從 9B 密集模型到 397B 混合專家旗艦模型。所有檢查點皆以 MIT 授權條款在 Hugging Face 上發布。這些模型是在預訓練的 Gemma 4 和 Qwen 3.5 基礎上進行後訓練。多數程式碼代理會將模型與固定的人工設計框架搭配使用，而 Ornith-1.0 則學會自行編寫其框架。DeepReinforce 研究團隊報告指出，在同等規模的開放模型中，該模型達到了最先進的結果。摘要：Ornith-1.0 提供 9B、31B、35B-MoE 和 397B-MoE 四種尺寸，均採用 MIT 授權，基於 Gemma 4 和 Qwen 3.5。該模型在強化學習過程中自行學習其框架，同時優化框架與解決方案。Ornith-1.0-397B 在兩個主要基準測試中超越了 Claude Opus 4.7。

站內 AI 整理稿

DeepReinforce has released Ornith-1.0, an open-source model family built for agentic coding. The lineup spans four sizes, from a 9B dense model to a 397B mixture-of-experts flagship. Every checkpoint ships under the MIT license on Hugging Face. The models are post-trained on top of pretrained Gemma 4 and Qwen 3.5. Most coding agents pair a model with a fixed, human-designed harness. Ornith-1.0 instead learns to write its own. The DeepReinforce research team reports state-of-the-art results among open models of comparable size. TL;DR Ornith-1.0 ships in 9B, 31B, 35B-MoE, and 397B-MoE sizes under MIT, built on Gemma 4 and Qwen 3.5. The model learns its own scaffold during RL, jointly optimizing the harness and the solution. Ornith-1.0-397B tops Claude Opus 4.7 on both headline benchmarks, but not Opus 4.8 or the larger GLM-5.2-744B. Three layers — fixed trust boundary, deterministic monitor, frozen LLM judge — guard against reward hacking. What is Ornith-1.0? Ornith-1.0 is a set of reasoning models tuned for coding agents. The variants are 9B Dense, 31B Dense, 35B MoE, and 397B MoE. The 35B model is mixture-of-experts and activates roughly 3B parameters per token. FP8 and GGUF builds are also published for faster local serving. Each model is a reasoning model. Replies open with a <think> block before the final answer. The serving recipes enable a reasoning parser, so that trace returns in a separate reasoning_content field. The models also emit well-formed tool calls for agent loops. Deployment is straightforward. The 9B model is about 19GB in bf16 and serves on a single 80GB GPU. Serving recipes target vLLM, SGLang, and Transformers. Each model exposes an OpenAI-compatible endpoint. Standard agent frameworks therefore work without code changes. Interactive Explainer </button> <button class="btn gho" id="resetBtn">Reset</button> </div> <div class="stepout" id="stepOut">Step 0 — untrained policy with a fixed, hand-written harness.</div> </div>  <div class="panel" data-panel="bench"> <div class="lead">Vendor-reported scores from DeepReinforce. Pick a model tier and a benchmark. Ornith is highlighted in green. Higher is better.</div> <div class="seg"><span class="lab">Model tier</span> <div class="chip on" data-tier="t397">397B flagship</div> <div class="chip" data-tier="t35">35B MoE</div> <div class="chip" data-tier="t9">9B dense</div> </div> <div class="seg" id="benchChips"><span class="lab">Benchmark</span></div> <div class="chart" id="chart"></div> <div class="foot-note" id="benchNote"></div> </div>  <div class="panel" data-panel="def"> <div class="lead">A model that writes its own scaffold could cheat the verifier. DeepReinforce describes three defense layers. Tap each to expand.</div> <div class="layers"> <div class="layer open"><div class="lh"><span class="num">1</span><span class="lt">Fixed trust boundary</span><span class="more">tap</span></div><div class="lb">The environment, tool surface, and test isolation are immutable and outside the model's reach. The model evolves only its inner policy scaffold — memory, error-handling, and orchestration logic.</div></div> <div class="layer"><div class="lh"><span class="num">2</span><span class="lt">Deterministic monitor</span><span class="more">tap</span></div><div class="lb">A rule-based monitor flags any attempt to read withheld paths, modify verification scripts, or invoke unsanctioned tools. Such trajectories get zero reward and are excluded from the advantage computation.</div></div> <div class="layer"><div class="lh"><span class="num">3</span><span class="lt">Frozen LLM judge</span><span class="more">tap</span></div><div class="lb">Because intent-level gaming can happen inside the allowed tool surface, a frozen LLM judge acts as a veto on top of the verifier — not as the primary reward signal.</div></div> </div> </div> <div class="ftr"><span>Source: <a href="https://deep-reinforce.com/ornith_1_0.html" target="_blank" rel="noopener">deep-reinforce.com</a> · MIT licensed · numbers vendor-reported</span><span><b>Marktechpost</b> · AI Dev Signals</span></div> <script> (function(){ var root=document.getElementById('mtp-ornith-demo'); /* tabs */ root.querySelectorAll('.tab').forEach(function(t){ t.addEventListener('click',function(){ root.querySelectorAll('.tab').forEach(function(x){x.classList.remove('on')}); root.querySelectorAll('.panel').forEach(function(x){x.classList.remove('on')}); t.classList.add('on'); root.querySelector('.panel[data-panel="'+t.dataset.p+'"]').classList.add('on'); resize(); }); }); /* loop sim */ var step=0,reward=0.08,timer=null; var scaffs=[ 'Baseline harness: linear retries, no memory.', 'Adds scratchpad memory across tool calls.', 'Adds error-triage branch before re-edit.', 'Reorders: read tests, then plan, then patch.', 'Caches sub-results; prunes dead branches.', 'Task-specific orchestration emerges automatically.']; var outs=[ 'Fixed harness, no learning yet.', 'Fewer redundant file reads observed.', 'Recovers from failed edits more often.', 'Higher first-pass test success.', 'Shorter trajectories, same accuracy.', 'Stable high-reward scaffold selected.']; var nodes=root.querySelectorAll('.node'); function lightSeq(cb){ var i=0;nodes.forEach(function(n){n.classList.remove('act')}); var iv=setInterval(function(){ nodes.forEach(function(n){n.classList.remove('act')}); nodes[i].classList.add('act');i++; if(i>=nodes.length){clearInterval(iv);setTimeout(function(){nodes.forEach(function(n){n.classList.remove('act')});cb&&cb();},260);} },220); } function doStep(){ if(step>=5){return;} step++; lightSeq(function(){ reward=[0.08,0.27,0.43,0.58,0.69,0.77][step]; root.querySelector('#rFill').style.width=(reward*100)+'%'; root.querySelector('#rVal').textContent=reward.toFixed(2); root.querySelector('#scaffTxt').textContent=scaffs[step]; root.querySelector('#outTxt').textContent=outs[step]; root.querySelector('#stepOut').innerHTML='Step '+step+' — <b>scaffold mutated</b>; reward propagated to both stages.'; resize(); }); } root.querySelector('#stepBtn').addEventListener('click',doStep); root.querySelector('#autoBtn').addEventListener('click',function(){ if(timer){clearInterval(timer);timer=null;this.textContent='Auto-run <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/25b6.png" alt="▶" class="wp-smiley" style="height: 1em; max-height: 1em;" />';return;} this.textContent='Pause <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/23f8.png" alt="⏸" class="wp-smiley" style="height: 1em; max-height: 1em;" />';var b=this; timer=setInterval(function(){if(step>=5){clearInterval(timer);timer=null;b.textContent='Auto-run <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/25b6.png" alt="▶" class="wp-smiley" style="height: 1em; max-height: 1em;" />';}else{doStep();}},1400); }); root.querySelector('#resetBtn').addEventListener('click',function(){ if(timer){clearInterval(timer);timer=null;root.querySelector('#autoBtn').textCon

原始來源：MarkTechPost AI ↗

查看原始來源

Hugging Face Blog生成式AI

一鍵在 HF Jobs 上啟動 vLLM 伺服器

你現在可以透過單一指令，在 Hugging Face 基礎架構上啟動一個私有的、相容於 OpenAI 的 LLM 端點——無需佈建伺服器、不需要 Kubernetes，按秒計費。啟用後，你可以從筆電、筆記本或任何地方對其進行查詢。這是為測試、評估或批次生成快速啟動模型的最快方式。（如果你需要的是受管理的、可立即上線的服務，Inference Endpoints 才是你的選擇——文末會說明何時該選哪一種。）以下是完整流程。前置需求：需有付款方式或正向預付餘額（Jobs 按硬體使用量以每分鐘計費），以及 huggingface_hub >= 1.20.0。

剛剛閱讀分析

智東西生成式AI

ChatGPT不拼智商拼情商了？GPT-5.5 Instant更新，明天開始免費使用

這篇消息聚焦「ChatGPT不拼智商拼情商了？GPT-5.5 Instant更新，明天開始免費使用」。原始導語提到：GPT-5.5 Instant更新上線，重點提升建議、決策和日常對話體驗。從 AI 情報角度來看，這類內容值得關注其背後的技術進展、產品落地、產業競爭與後續市場影響。

52 分鐘前閱讀分析

36氪生成式AI

月之暗面黃震昕：Kimi的目標是和海外三家模型掰手腕

這篇消息聚焦「月之暗面黃震昕：Kimi的目標是和海外三家模型掰手腕」。原始導語提到：企業級AI難點並不在模型廠商這一側，在於如何去切入和推進企業完成AI轉型。從 AI 情報角度來看，這類內容值得關注其背後的技術進展、產品落地、產業競爭與後續市場影響。

1 小時前閱讀分析

36氪生成式AI

跟Claude談個戀愛怎麼了？Nature最新研究：真能給人聊傻了

這篇消息聚焦「跟Claude談個戀愛怎麼了？Nature最新研究：真能給人聊傻了」。原始導語提到：別把AI當老公，容易聊出精神病從 AI 情報角度來看，這類內容值得關注其背後的技術進展、產品落地、產業競爭與後續市場影響。

6 小時前閱讀分析

36氪生成式AI

Fable 5即將復活，代碼已曝光？Anthropic CEO被白宮踢出來了

剛被「封印」，Fable 5就要滿血復活？最近，Claude Fable 5代碼痕跡曝光，開發者圈一片歡呼，而外媒爆料，Anthropic最近一路順利，背後竟是因為CEO被白宮趕下談判桌！

6 小時前閱讀分析

雷峰網生成式AI

這次是阿里！中國的大模型團隊快被 Anthropic 告完了

這是Anthropic迄今控訴的最大規模“模型蒸餾”案。作者丨高允毅編輯丨馬曉寧 01Anthropic已經告了四家中國AI公司短短四個月，四家中國頂級AI公司被Anthropic接連點名，且沒有停手的跡象。這一次，輪到阿里。2026年6月10日，Anthropic向美國參議院銀行委員會遞交了一封信，矛頭直指阿里Qwen團隊。報告披露了一串數字：從4月22日到6月5日，整整45天，阿里相關運營者利用2.5萬個賬號，完成了2880萬次交互。這是Anthropic迄今公開的最大規模“模型蒸餾”數據。2880萬次對話是什麼概念？放一個行業參照：目前主流的高質量SFT（監督微調）數據集，規模通常在數十萬到幾百萬條之間。2880萬次針對核心能力的定向交互，足以在特定任務域內，低成本“提純”出一個極具競爭力的專用模型。這引起了Anthropic的高度警惕。在他們看來，對方的行為目標極其精準，刀刀直指其最新旗艦模型 Mythos Preview 的核心底牌，軟件工程與智能體推理能力。Anthropic在信中將其定性為“迄今為止，中國公司試圖搭美國頂尖實驗室便車的最大規模嘗試”。梳理時間線可以發現，Anthropic的反擊正在顯著升級。2026年2月23日，Anthropic發佈了一篇博客文章《Detecting and Preventing Distillation Attacks》，公開點名三家中國AI實驗室：DeepSeek、月之暗面（Kimi）、MiniMax（稀宇科技）。報告顯示，約2.4萬個中國相關賬號對Claude發起了超過1600萬次交互，其中MiniMax超1300萬，月之暗面超340萬，DeepSeek超15萬。從1600萬次到2880萬次，規模在翻倍，Anthropic的反擊，也從2月份的“技術曝光”，升級到6月份“政治施壓”。而這次的收信人，銀行委員會主席蒂姆·

6 小時前閱讀分析

相關文章

一鍵在 HF Jobs 上啟動 vLLM 伺服器

ChatGPT不拼智商拼情商了？GPT-5.5 Instant更新，明天開始免費使用

月之暗面黃震昕：Kimi的目標是和海外三家模型掰手腕

跟Claude談個戀愛怎麼了？Nature最新研究：真能給人聊傻了

Fable 5即將復活，代碼已曝光？Anthropic CEO被白宮踢出來了

這次是阿里！中國的大模型團隊快被 Anthropic 告完了