MarkTechPost AI生成式AI

Anthropic Claude Sonnet 5 vs Sonnet 4.6 vs Opus 4.8: Agentic Coding Benchmarks, API Pricing, and Cost-Performance Tradeoffs Compared

2026年6月30日 21:37
Anthropic Claude Sonnet 5 vs Sonnet 4.6 vs Opus 4.8: Agentic Coding Benchmarks, API Pricing, and Cost-Performance Tradeoffs Compared

重點摘要

Anthropic just shipped Claude Sonnet 5. They call it its most agentic Sonnet model yet. It plans, drives browsers and terminals, and runs autonomously across long tasks. Sonnet 5 is the default model for Free and Pro plans today. Max, Team, and Enterprise users can select it. It is also live in Claude Code and on the Claude Platform. TL;DR Sonnet 5 is Anthropic’s most agentic mid-tier model, closing much of the gap to Opus 4.8. Beats Sonnet 4.6 on every published benchmark: 63.2% SWE-bench Pro, 81.2% OSWorld-Verified, 57.4% HLE. Cheaper to run: $2/$10 per MTok intro pricing through Aug 31, then $3/$15; Opus 4.8 is $5/$25. Best value at low/medium effort; at xhigh it can cost more than Opus 4.8 for similar quality. Safer than 4.6, with deliberately low cyber capability — Opus stays the pick

站內 AI 整理稿

Anthropic just shipped Claude Sonnet 5. They call it its most agentic Sonnet model yet. It plans, drives browsers and terminals, and runs autonomously across long tasks. Sonnet 5 is the default model for Free and Pro plans today. Max, Team, and Enterprise users can select it. It is also live in Claude Code and on the Claude Platform. TL;DR Sonnet 5 is Anthropic’s most agentic mid-tier model, closing much of the gap to Opus 4.8. Beats Sonnet 4.6 on every published benchmark: 63.2% SWE-bench Pro, 81.2% OSWorld-Verified, 57.4% HLE. Cheaper to run: $2/$10 per MTok intro pricing through Aug 31, then $3/$15; Opus 4.8 is $5/$25. Best value at low/medium effort; at xhigh it can cost more than Opus 4.8 for similar quality. Safer than 4.6, with deliberately low cyber capability — Opus stays the pick for accuracy-critical work. Claude Sonnet 5 Sonnet sits in the middle of Anthropic’s lineup. It is above the cheaper Haiku 4.5 and below the flagship Opus 4.8. Sonnet 5 is an upgrade to Sonnet 4.6, which launched in February 2026. Anthropic frames this release around agentic reliability, not one headline benchmark. In practice, that means longer task chains without losing context. It means better self-correction when a tool call fails. It means steadier behavior across extended sessions inside Claude Code or Cowork. The model exposes effort levels: low, medium, high, and xhigh (extra high). Higher effort spends more tokens on reasoning. That raises both quality and cost. It is important to note that Sonnet 5 uses an updated tokenizer, the same one introduced with Opus 4.7. The same text can map to roughly 1.0 to 1.35 times more tokens. Interactive Explainer Claude Sonnet 5 Cost & Capability Explorer :root{ --bg:#ffffff; --panel:#f6f8f5; --panel2:#eef2ea; --ink:#16201a; --muted:#5b6b60; --line:#e0e6dd; --green:#5a8f1e; --greenDark:#3f6b12; --accent:#76B900; --s46:#9aa6ad; --s5:#76B900; --opus:#e0a800; } *{box-sizing:border-box} html,body{margin:0} #mtp-s5{ font-family:-apple-system,BlinkMacSystemFont,"Segoe UI",Roboto,Helvetica,Arial,sans-serif; color:var(--ink); background:var(--bg); padding:22px; max-width:880px; margin:0 auto; line-height:1.5; } #mtp-s5 h2{font-size:21px; margin:0 0 4px; letter-spacing:-.2px} #mtp-s5 .sub{color:var(--muted); font-size:13.5px; margin:0 0 18px} #mtp-s5 .card{ background:var(--panel); border:1px solid var(--line); border-radius:14px; padding:18px; margin-bottom:18px; } #mtp-s5 .card h3{font-size:15px; margin:0 0 14px; display:flex; align-items:center; gap:8px} #mtp-s5 .dot{width:9px;height:9px;border-radius:50%;background:var(--accent);display:inline-block} #mtp-s5 .row{display:flex; flex-wrap:wrap; gap:8px; margin-bottom:14px} #mtp-s5 .seg{display:flex; flex-wrap:wrap; gap:6px} #mtp-s5 button.opt{ font:inherit; font-size:12.5px; cursor:pointer; padding:7px 12px; border-radius:999px; border:1px solid var(--line); background:#fff; color:var(--ink); transition:.15s; } #mtp-s5 button.opt:hover{border-color:var(--accent)} #mtp-s5 button.opt[aria-pressed="true"]{background:var(--accent); border-color:var(--accent); color:#16201a; font-weight:600} #mtp-s5 label.lbl{display:block; font-size:12px; color:var(--muted); margin:0 0 6px} #mtp-s5 .grid2{display:grid; grid-template-columns:1fr 1fr; gap:16px} #mtp-s5 input[type=range]{width:100%; accent-color:var(--accent)} #mtp-s5 .val{font-variant-numeric:tabular-nums; font-weight:600} #mtp-s5 .costbox{display:flex; align-items:baseline; gap:10px; flex-wrap:wrap; margin-top:6px} #mtp-s5 .cost{font-size:30px; font-weight:700; color:var(--greenDark); font-variant-numeric:tabular-nums} #mtp-s5 .costnote{font-size:12.5px; color:var(--muted)} #mtp-s5 .pill{font-size:11.5px; background:var(--panel2); border:1px solid var(--line); border-radius:999px; padding:3px 9px; color:var(--muted)} #mtp-s5 .compare{display:flex; gap:10px; margin-top:12px; flex-wrap:wrap} #mtp-s5 .ctag{font-size:12px; color:var(--muted)} #mtp-s5 .ctag b{color:var(--ink)} #mtp-s5 .bar-wrap{margin:10px 0} #mtp-s5 .bar-row{display:flex; align-items:center; gap:10px; margin:9px 0} #mtp-s5 .bar-name{width:88px; font-size:12.5px; color:var(--muted); flex:none} #mtp-s5 .track{flex:1; background:#fff; border:1px solid var(--line); border-radius:8px; height:24px; position:relative; overflow:hidden} #mtp-s5 .fill{height:100%; border-radius:7px 0 0 7px; transition:width .5s ease; display:flex; align-items:center; justify-content:flex-end; padding-right:7px; font-size:11.5px; font-weight:700; color:#16201a} #mtp-s5 .legend{display:flex; gap:14px; flex-wrap:wrap; font-size:12px; color:var(--muted); margin-top:8px} #mtp-s5 .lg{display:inline-flex; align-items:center; gap:6px} #mtp-s5 .sw{width:12px;height:12px;border-radius:3px;display:inline-block} #mtp-s5 .callout{font-size:12.5px; color:var(--muted); background:var(--panel2); border-left:3px solid var(--accent); padding:9px 12px; border-radius:0 8px 8px 0; margin-top:12px} #mtp-s5 .foot{text-align:center; font-size:12px; color:var(--muted); margin-top:6px} #mtp-s5 .foot b{color:var(--green)} #mtp-s5 .metricbtns{margin-bottom:6px} @media (max-width:640px){ #mtp-s5{padding:16px} #mtp-s5 .grid2{grid-template-columns:1fr} #mtp-s5 .bar-name{width:74px} #mtp-s5 .cost{font-size:25px} } Claude Sonnet 5 — Cost & Capability Explorer Estimate per-task cost across models and compare published benchmarks. All figures from Anthropic’s June 30, 2026 launch. Per-task cost estimator Sonnet 5 (intro $2/$10) Sonnet 5 (std $3/$15) Opus 4.8 ($5/$25) Sonnet 4.6 ($3/$15) Input tokens per task: 20,000 Output tokens per task: 6,000 Tasks per day: 500 Sonnet 5 tokenizer factor: 1.15× $0.00 per task • $0.00/day • $0.00/mo Sonnet 5 uses an updated tokenizer (same as Opus 4.7). The same text can map to roughly 1.0–1.35× more tokens, so the factor is applied to Sonnet 5 only. Published benchmark comparison Agentic coding (SWE-bench Pro) Terminal-Bench 2.1 Computer use (OSWorld-Verified) Humanity’s Last Exam (tools) Sonnet 4.6 Sonnet 5 Opus 4.8 On knowledge work (GDPval-AA v2), Sonnet 5 scores 1,618 and edges Opus 4.8’s 1,615. That benchmark uses a different scale, so it is shown here as a note rather than a bar. Interactive explainer by Marktechpost • figures: Anthropic launch & system card, June 30, 2026 (function(){ var price={ s5i:{i:2,o:10}, s5s:{i:3,o:15}, opus:{i:5,o:25}, s46:{i:3,o:15} }; var names={ s5i:"Sonnet 5 intro", s5s:"Sonnet 5 std", opus:"Opus 4.8", s46:"Sonnet 4.6" }; var state={ model:"s5i" }; var bench={ swe:{ s46:58.1, s5:63.2, opus:69.2 }, term:{ s46:67.0, s5:80.4, opus:null }, osw:{ s46:78.5, s5:81.2, opus:null }, hle:{ s46:46.8, s5:57.4, opus:57.9 } }; var benchMetric="swe"; function $(id){return document.getElementById(id);} function fmt(n){return n.toLocaleString(undefined,{maximumFractionDigits:0});} function money(n){ if(n<0.01) return "$"+n.toFixed(4); if(n<1) return "$"+n.toFixed(3); return "$"+n.toLocaleString(undefined,{minimumFractionDigits:2,maximumFractionDigits:2}); } function tkFactor(){ return parseInt($("tk").value,10)/100; } function costFor(modelKey){ var inTok=parseInt($("inTok").value,10); var outTok=parseInt($("outTok").value,10); var f=1; if(modelKey==="s5i"||modelKey==="s5s"){ f=tkFactor(); } var p=price[modelKey]; return (inTok*f/1e6)*p.i + (outTok*f/1e6)*p.o; } function render(){ var tasks=parseInt($("tasks").value,10); $("inVal").textContent=fmt(parseInt($("inTok").value,10)); $("outVal").textContent=fmt(parseInt($("outTok").value,10)); $("taskVal").textContent=fmt(tasks); $("tkVal").innerHTML=tkFactor().toFixed(2)+"&times;"; var c=costFor(state.model); $("costOut").textContent=money(c); $("dayOut").textContent=money(c*tasks); $("moOut").textContent=money(c*tasks*30); // comparison vs other models at same workload var order=["s5i","s5s","opus","s46"]; var html=""; order.forEach(function(k){ if(k===state.model) return; var oc=costFor(k); var diff=oc===0?0:((c-oc)/oc*100); var sign=diff>0?"+":""; html+='<span class="ctag">vs <b>'+names[k]+'</b>: '+money(oc)+' ('+sign+diff.to

Related

相關文章

軟件沒被AI殺死,但全球市場都捲上天了

AI寫程式能力崛起,但軟體並未被取代,反而在全球市場面臨更加激烈的競爭。軟體開發的門檻降低,促使各國廠商紛紛投入,導致市場「捲」上加「捲」。軟體業者需在效率與創新之間找到新平衡,才能應對這場無止境的挑戰。

1 小時前

Token管夠的時代結束了

這篇消息聚焦「Token管夠的時代結束了」。原始導語提到:企業的錢也不是大風颳來的 從 AI 情報角度來看,這類內容值得關注其背後的技術進展、產品落地、產業競爭與後續市場影響。

2 小時前
雷峰網生成式AI

中國信通院牽頭,首個智算運維智能體評測基準正式落地,覆蓋 5 款主流國產芯片

6月29日,在中國信通院人工智能軟硬件協同創新與適配驗證中心、中國人工智能產業發展聯盟、工信部人工智能標準化技術委員會聯合主辦的2026“眾智”大模型開放智算生態協同高級別研討會中,中國信通院副院長魏亮,正式發佈AISHPerf人工智能軟硬件基準體系3.0版本,包含兩項 AI Infra 領域核心評測基準——AISHPerf-智算運維智能體評測基準以及AISHPerf-算子生成智能體評測基準,兩大基準由國內頂尖AI原生基礎設施服務商無問芯穹及清華大學團隊作為重點技術支持方參與建設。前者是首個面向 AI Infra 的運維智能體評測基準,依託百億級真實運維數據構建,核心考核智算運維智能體在真實生產場景中解決實際問題的落地能力。後者則跳出 “模型能否生成可運行 GPU 算子” 的基礎維度,將評測重心錨定在 “模型生成的算子能否在真實量化推理部署中替代現有算子” 的工程可部署性上,更貼合產業實際落地需求。二者從底層算力優化到上層集群運維,共同為智算產業的標準化升級與高質量發展提供了統一的能力參照框架。 AISHPerf(Performance Benchmarks of Artificial Intelligence Software and Hardware)是中國信通院與人工智能大模型及軟硬件評測工業和信息化部重點實驗室,依託人工智能軟硬件協同創新與適配驗證中心(位於國家信創園)聯合構建的人工智能軟硬件基準體系,旨在設置多維度指標,考察端到端方案對模型及應用場景的真實承載能力,系統評估軟硬件各層級間的協同優化水平、兼容適配能力及整體交付效能。在此次發佈的兩項基準中,AISHPerf-智算運維智能體評測基準尤為引人注目,它不僅標誌著我國在智算集群運維智能體領域擁有了首個權威評測體系,更率先將國產芯片集群運維場景納入評測體系、填補了國產智算運維智能體評測領域的空白,為構建自主自治

8 小時前
智東西生成式AI

Hermes新功能上線!比Opus 4.8和GPT-5.5還猛

AI應用風向標(公眾號:ZhidxcomAI) 作者|畢偉豪 編輯|漠影 智東西6月30日報道,現在,Fable 5和Mythos 5等頂尖閉源模型沒法使用,就算能用,單一模型也總有搞不定的問題,那麼,想要高質量輸出結果的用戶該怎麼辦呢? 近日,Hermes Agent上線了MoA(Mixture of Agents)功能,支持用戶自由組合多種模型作為虛擬模型使用,在Nous Research即將發佈的基準測試中,這個混合模型的評分超過了Opus 4.8 和GPT-5.5。 一、Fable 5、Mythos 5被禁,多模型組合成為潮流 Nous Research在官推上說了這樣一句話:“最強大的模型是受限的,只有少數人才能獲得訪問權限。”這句話明晃晃地指向了Fable 5等模型被封禁的事件。 在這種背景下,不難看出,MoA這個混合模型模式的終極目的,是用開源模型的組合達到頂尖閉源模型的水準,就像Hermes Agent聯合創始人Teknium說的,他們正在測試各種開源模型組合,看看是否能用更便宜的模型達到Opus的水平。 這種多模型組合比肩頂尖模型的思路,最近其實有不少實踐的例子,比如前段時間日本AI獨角獸Sakana AI發佈的Sakana Fugu系列編排器模型,會根據任務選擇最佳的模型來處理,和MoA的思路非常相似。 而MoA的技術也在很久之前就已存在了,2024年6月Together AI曾發表過一篇論文《Mixture-of-Agents Enhances Large Language Model Capabilities》,核心是多LLM組合,每一層模型都會參考上一層模型的輸出,再繼續生成自己的回答。同時,論文也將模型分成了兩類,也就是現在Hermes所用的參考模型和聚合模型。 當用戶提出問題時,參考模型會先對問題進行分析判斷,然後給出參考意見,隨後由聚合模型來

9 小時前