MarkTechPost AI生成式AI

StepFun 發布 Step 3.7 Flash:一個用於編碼代理和搜索工作流程的 198B MoE 視覺語言模型

2026年5月29日 21:25

重點摘要

StepFun 今日發布 Step 3.7 Flash,這是一個多模態混合專家模型,專注於代理應用場景。它在 Step 3.5 Flash 的基礎上增加了原生視覺輸入和改進的工具使用可靠性。什麼是 Step 3.7 Flash?Step 3.7 Flash 是一個 198B 參數的稀疏混合專家(MoE)視覺語言模型。它將 196B 參數的語言主幹與 1.8B 參數的視覺編碼器(ViT)配對,以實現原生圖像理解。模型在推理過程中每個 token 約啟動 11B 參數。在 MoE 架構中,每次前向傳播僅觸發「專家」子網絡的子集,而非整個網絡。這使得推理計算量接近 11B 密集模型,同時保持 198B 的總參數預算。主要規格:數值總參數 198B(196B 語言 + 1.8B ViT),活躍參數每個 token 約 11B。

站內 AI 整理稿

StepFun today released Step 3.7 Flash, a multimodal Mixture-of-Experts model targeting agentic use cases. It adds native vision input and improved tool-use reliability over Step 3.5 Flash. What is Step 3.7 Flash? Step 3.7 Flash is a 198B-parameter sparse Mixture-of-Experts (MoE) vision-language model. It pairs a 196B-parameter language backbone with a 1.8B-parameter vision encoder (ViT) for native image understanding. The model activates approximately 11B parameters per token during inference. In MoE architectures, only a subset of “expert” sub-networks fires per forward pass — not the full network. This keeps inference compute closer to an 11B dense model while maintaining a 198B total parameter budget. Key specs: SpecValueTotal parameters198B (196B language + 1.8B ViT)Active parameters per token~11BContext window256k tokensThroughputUp to 400 tokens/secReasoning levelsLow, medium, highLicenseApache 2.0 Architecture Notes The vision encoder runs as a separate 1.8B ViT module. It injects image representations into the language backbone’s context. Step 3.5 Flash had no multimodal support; this is a new addition in 3.7. Three selectable reasoning depths — low, medium, and high — let developers trade latency for reasoning depth. Low is faster and cheaper; high applies more computation per response. Agentic Coding Performance On SWE-Bench Pro, Step 3.7 Flash scores 56.26%, up from Step 3.5 Flash’s 51.3% — a gain of roughly 5 percentage points. On Terminal-Bench 2.1, it scores 59.55%, up from 53.37%. On SWE-MTLG (a multi-task long-generation coding benchmark), it scores 72.42%. Cross-harness consistency on StepFun’s internal Step-SWE-Bench: ScaffoldStep 3.7 FlashStep 3.5 FlashHermes Agent67.5%60.0%OpenClaw67.0%47.0%KiloCode67.5%59.0%RooCode64.5%43.0%Claude Code71.5%73.0%OpenCode64.5%57.0% Step 3.5 Flash ranged from 43% to 73% across harnesses. Step 3.7 Flash ranges from 64.5% to 71.5%. In production, coding agents often run inside heterogeneous scaffolds — each with its own prompting conventions and tool schemas. Narrower per-harness variance means more predictable behavior across different setups. Advisor Mode Step 3.7 Flash supports Advisor Mode, StepFun’s implementation of the advisor strategy described by Anthropic. The model runs the agentic loop end-to-end — calling tools, reading results, iterating — and escalates to a larger advisor model only at specific inflection points, such as planning or recovering from repeated failures. Most of the run stays at executor cost. With Advisor Mode enabled on SWE-Bench Verified, StepFun reports Step 3.7 Flash reaches 97% of Claude Opus 4.6’s coding performance at roughly one-ninth the per-task cost ($0.19 vs. $1.76 per task). These are StepFun’s internal figures. Multimodal Capabilities Step 3.7 Flash supports two visual tool pathways: Visual Search Tool — For recognition tasks where the model’s parametric knowledge is insufficient (long-tail entities, recently emerged concepts), it invokes a visual search tool to retrieve and verify. On SimpleVQA (with Search), it scores 79.16%, comparable to GPT 5.5 (79.11%) and above Kimi K2.6 (78.24%) and GLM 5V Turbo (78.20%). Python Tool — For fine-grained visual tasks (high-resolution images, visual probing, bounding-box analysis), it uses a code interface to crop, zoom, and draw pixels or bounding boxes. On V (a self-tested score with Python), it scores 95.29%. On HR-Bench 4K and HR-Bench 8K, it scores 89.13% and 86.34% respectively. StepFun notes an observed behavior during testing: the model combined visual tools with non-visual tools without being explicitly trained to do so. For example, after generating frontend code, it used the GUI to render and inspect the result before iterating. StepFun describes this as emergent compositional tool use. On Android Daily (long-horizon phone UI task completion), Step 3.7 Flash scores 61.87%, ahead of Kimi K2.6 (53.36%) and GLM 5V Turbo (51.68%). Gemini 3 Flash (63.21%) leads this benchmark. Search and Research Benchmarks StepFun focused this model’s search design on planning, evidence filtering, and synthesis — integrating search as part of the reasoning loop rather than a separate add-on. BenchmarkStep 3.7 FlashNotable comparisonHLE with Tools (acc)47.20%DeepSeek V4 Flash: 45.10%BrowseComp (acc)75.82%Claude Opus 4.7: 79.30%DeepSearchQA (F1)92.82%Kimi K2.6: 92.50%ResearchRubrics (score)71.68%GPT 5.5: 61.50% Note: The HLE with Tools score of 47.20% compares to Step 3.5 Flash’s text-only score of 35.68%. Step 3.5 Flash did not support tool-augmented evaluation on HLE. General Agent Benchmarks BenchmarkStep 3.7 FlashDescriptionToolathlon49.51%Multi-tool coordinationClawEval-1.167.07%Daily autonomous task execution in realistic environmentsGDPval (44 occupations)45.8%General professional task executionTau2-bench Telecom>98%Across different reasoning difficulty tiers On ClawEval-1.1, Step 3.7 Flash (67.07%) leads DeepSeek V4 Flash (57.80%) and DeepSeek V4 Pro (59.80%) among the compared models. Long-Context Performance On AA-LCR (a long-context retrieval benchmark, avg@16/acc), Step 3.7 Flash scores 63.94%. This is comparable to DeepSeek V4 Flash (63.70%) and DeepSeek V4 Pro (66.30%). Pricing Token TypePriceInput (cache miss)$0.20 / M tokensInput (cache hit)$0.04 / M tokensOutput$1.15 / M tokens Marktechpost’s Visual Explainer #s37f-guide*{box-sizing:border-box;margin:0;padding:0} #s37f-guide{font-family:'Segoe UI',system-ui,sans-serif;background:#f7f8fa!important;border:1.5px solid #e2e6ed!important;border-radius:14px!important;overflow:hidden!important;max-width:720px!important;margin:0 auto!important;box-shadow:0 4px 24px rgba(0,0,0,.07)!important} #s37f-guide .sf-header{background:#ffffff!important;border-bottom:1.5px solid #e2e6ed!important;padding:18px 24px 14px!important} #s37f-guide .sf-tag{display:inline-block!important;background:#eef2ff!important;color:#4361ee!important;font-size:10px!important;font-weight:700!important;letter-spacing:.08em!important;text-transform:uppercase!important;padding:3px 10px!important;border-radius:20px!important;margin-bottom:8px!important} #s37f-guide .sf-title{font-size:17px!important;font-weight:700!important;color:#111827!important;line-height:1.35!important} #s37f-guide .sf-sub{font-size:12px!important;color:#6b7280!important;margin-top:4px!important} #s37f-guide .sf-progress{height:3px!important;background:#e9ecf0!important;position:relative!important} #s37f-guide .sf-progress-bar{height:3px!important;background:#4361ee!important;transition:width .35s ease!important;position:absolute!important;top:0!important;left:0!important} #s37f-guide .sf-slides{position:relative!important;overflow:hidden!important;height:560px!important} #s37f-guide .sf-slide{display:none!important;padding:24px!important;height:560px!important;overflow-y:auto!important;background:#f7f8fa!important;animation:sfIn .25s ease!important} #s37f-guide .sf-slide.active{display:block!important} @keyframes sfIn{from{opacity:0;transform:translateY(6px)}to{opacity:1;transform:translateY(0)}} #s37f-guide .sf-slide-label{font-size:10px!important;font-weight:700!important;letter-spacing:.1em!important;text-transform:uppercase!important;color:#4361ee!important;margin-bottom:10px!important} #s37f-guide .sf-slide h3{font-size:15px!important;font-weight:700!important;color:#111827!important;margin-bottom:12px!important;line-height:1.4!important} #s37f-guide .sf-slide p{font-size:12.5px!important;color:#374151!important;line-height:1.6!important;margin-bottom:8px!important} #s37f-guide .sf-slide p:last-child{margin-bottom:0!important} #s37f-guide .sf-kv{display:grid!important;grid-template-columns:1fr 1fr!important;gap:8px!important;margin-top:12px!important} #s37f-guide .sf-kv-item{background:#ffffff!important;border:1px solid #e2e6ed!important;border-radius:8px!important;padding:10px 12px!important} #s37f-guide .sf-kv-item .k{font-size:10px!important;color:#6b7280!important;font-weight:600!important;text-transform:uppercase!important

Related

相關文章

鈦媒體生成式AI

Edge AI Daily 早報(6月19日)

AI Engineer World's Fair 2026規模再創新高,標誌AI工程從幕後走向舞臺中央。行業面臨結構性調整:楊立昆警示OpenAI年虧210億美元揭示商業模式脆弱性,Transformer之父轉投OpenAI反映人才爭奪白熱化。Anthropic多線佈局——語音支持七種語言、加入碳清除聯盟、落子首爾辦事處,展現生態擴張野心。監管壓力加劇,意大利依據DMA調查蘋果iCloud,巴西開放iOS側載佣金降至5%,蘋果圍牆花園持續崩塌。

2 小時前
智東西生成式AI

谷歌時隔6年再發智能音箱,Gemini上桌,售價不到700元

智東西 編譯 | 劉煜 編輯 | 陳駿達 智東西6月18日消息,谷歌昨日宣佈,其首款搭載居家版Gemini語音助手的智能音箱(Google Home Speaker)已開啟預售,將於當地時間6月25日正式上市,售價為99.99美元(約合人民幣677.03元)。在此之前,谷歌已有6年沒有推出過獨立智能音箱產品。 谷歌這款智能音箱外觀近似球形,風格類似亞馬遜新一代Echo音箱與蘋果舊款音箱HomePod Mini。 ▲谷歌智能音箱(圖源:谷歌官網) 使用音箱時,用戶只需通過口令“Hey Google”或“OK Google”喚醒Gemini,就可以繼續下達相應指令。這與谷歌舊款音箱、智能顯示屏等喚醒語音助手的方式相同。此外,用戶只要按照日常說話習慣下達命令,Gemini便能理解用戶意圖,相比之前大大提升溝通效率。 一、加強短時對話記憶,會員可與Gemini不限次數對話 谷歌此次推出的全新音箱升級諸多功能。其中,音箱搭載的Gemini語音助手擁有10款全新擬人化語音音色,用戶可以根據喜好自行選擇聲線。音箱還可支持用戶一次性下達多條語音指令,即使指令未能說對、說完整,用戶中途改口Gemini也能識別。 Gemini還具備多鏈路推理能力,落地到實際生活場景中比較實用。例如,用戶問:“我支持的足球隊下場比賽天氣如何?”Gemini收到指令後,會自動查詢賽事時間、舉辦地點,同時匹配相應時段天氣,再給出答覆。 同時,Gemini加強了短時對話記憶,能承接上下文實現連續對話功能。即使用戶連續追問、甚至串聯多項任務、不重複交代前置條件,該語音助手也能實現來回連貫交流。 ▲谷歌Gemini對話場景(圖源:谷歌官網) 不僅如此,Gemini搭配的連續對話功能,能讓應答後的音箱麥克風保持短暫收音,用戶無需重複喊“OK Google”就能繼續提問。該功能現已全面支持所有Gemini原生適配的語言,包括

22 小時前

微軟,考慮接入DeepSeek

這篇消息聚焦「微軟,考慮接入DeepSeek」。原始導語提到:Copilot Cowork轉為按量計費。 從 AI 情報角度來看,這類內容值得關注其背後的技術進展、產品落地、產業競爭與後續市場影響。

22 小時前