Hugging Face Blog模型更新

推出 North Mini Code:Cohere 為開發者打造的首款模型

2026年6月9日 15:56

重點摘要

返回文章列表 推出 North Mini Code:Cohere 為開發者打造的首款模型 企業文章 發布於 2026 年 6 月 9 日 獲得 32 個讚 +26 Cohere Code Agents 團隊 coherecode 追蹤 CohereLabs 所有共同作者列於下方 今天,我們發布 North Mini Code,這是一款擁有 300 億參數的混合專家模型,其中 30 億參數為活躍參數,具備強大的代理編碼能力,可在 Hugging Face 上以 Apache 2.0 授權取得。North Mini Code 是 Cohere 新系列模型中的首款,專為代理軟體工程任務設計與訓練。圖 1:North Mini Code 在代理編碼任務與複雜程式碼生成基準測試中的表現,與同等規模的領先開源模型進行比較。詳見我們的基準測試方法說明。

站內 AI 整理稿

Back to Articles Introducing North Mini Code: Cohere’s First Model For Developers Enterprise Article Published June 9, 2026 Upvote 32 +26 Cohere Code Agents Team coherecode Follow CohereLabs All co-authors listed below Today, we are releasing North Mini Code, a 30B-parameter Mixture-of-Experts model with 3B active parameters with powerful agentic coding capabilities, available on Hugging Face under the Apache 2.0 license. North Mini Code is the first model in Cohere’s new family of models, and is specifically designed and trained for agentic software engineering tasks. Figure 1: North Mini Code’s performance in agentic coding tasks and complex code generation benchmarks, compared to leading open-source models of similar size. See here for the details of our benchmarking methodology. North Mini Code is optimized for complex software engineering workflows, terminal-based agentic tasks, and high-quality code generation. On Artificial Analysis’ Coding Index, North Mini Code achieves a score of 33.4, outperforming Qwen3.5 (35B-A3B), Gemma 4 (26B-A4B), Devstral Small 2 (24B Dense), and even substantially larger models such as Nemotron 3 Super (120B-A12B), Mistral Small 4 (119B-A6B), and Devstral 2 (123B).1 It ranks among the strongest open-source coding models in its size class. Try North Mini Code in OpenCode Real-world code agents depend on model quality and robustness across agent harnesses. We trained North Mini Code using multiple scaffolds rather than optimizing for a single one. This approach enables North Mini Code to serve as a reliable foundation for coding agents such as OpenCode. Architecture Figure 2: North Mini Code is a Mixture-of-Experts Transformer decoder with interleaved sliding-window self-attention and full self-attention. North Mini Code is a decoder-only Transformer-based sparse Mixture-of-Experts model. It uses our efficient attention implementation, interleaved between sliding-window attention with RoPE and global attention with no positional embeddings, in a 3:1 ratio [1]. The feed-forward block is an MoE block with 128 experts, of which 8 are activated per token. Each expert block is an FFN block with SwiGLU activation. The router applies a sigmoid activation function to the logits before the top-k selection. We also use a single dense layer before the sparse layers. Post-Training for Coding Excellence Figure 3: The post-training pipeline is made up of two phases of supervised fine-tuning (SFT) and a phase of agentic reinforcement learning with verifiable rewards (RLVR) targeting software engineering and terminal tasks. We post-train North Mini Code using a two-stage cascaded supervised fine-tuning (SFT) followed by reinforcement learning with verifiable rewards (RLVR), focusing on agentic coding. Our first stage SFT data focuses on coding capabilities that are integrated within a wider mix for robustness and usability. The datamix includes programming, reasoning, and instruction following across a large variety of domains where the code datasets correspond to 70% of trainable tokens, 43% agentic tool-use data, and 27% single-turn competitive or scientific programming data. In the second stage SFT, we use a 4.5 billion token data mixture from only agentic and reasoning-driven samples, where code data forms 61% of trainable tokens. This mixture comprises our highest-quality data across coding and wider agentic tasks where tool calls and completions are verified as executable and correct. Our internal data pipeline heavily relies on containerised agentic coding environments. We maintain a disjoint subset of these environments for use in synthetic SFT data generation and RLVR. The majority are based on software engineering tasks from real-world repositories, while the rest are terminal-based agentic tasks sourced from open-source and internal datasets. In total, we used over 70k verifiable tasks across ~5k unique repositories. We deduplicate our environments against the repository sources from SWE-Bench [2] and SWE-Bench-Pro [3] to avoid source leakage during evaluation [4]. We used 64K and 128K context lengths for the first and second stages of SFT, respectively. This “long-to-longer” cascade approach (similar to [5, 6]) enables bipartite training on valuable shorter data, establishing a robust performance baseline, followed by targeted long-context training only on high-quality verified samples. Without multi-stage training, the 20B non-code tokens during the initial training stage often dominated the 1.5B tokens of high-quality code data in later training, producing poorer performance and higher behavioral conflicts from data trends differing between stages. Anecdotally, training on a near-complete length distribution of samples produced shorter final trajectories during evaluation than training on a truncated distribution up to 64K only. Instead of optimising North Mini Code towards quantitative metrics during SFT, we adopted an approach strictly using SFT as priming for RLVR. The data mixture optimises sampling diversity and pass@K (for high K) in downstream stages. We use sample-level filtering to remove any pathologies such as invalid tool calls, erroneous whitespace generation, malformed special tokens, or hallucinated citations. Artifacts or hyperparameters producing undesirable RLVR behaviours (e.g., low entropy, invalid structured generations) were pruned via ablations. The final SFT model achieves 80.2% pass@10 on SWE-Bench Verified [2] and 55.1% pass@10 on Terminal-Bench v2 [7]. Robustness Across Harnesses Harness robustness improves model usability in realistic software development settings, where agents encounter diverse and unpredictable tooling environments. These environments differ not just in prompting but in fundamental tool-use modality, For instance, SWE-Agent [8] exposes a relatively rich agent-CLI interface with specialized commands (bash, str_replace_editor and submit tools) and templated observations; mini-SWE-agent [9] strips this down to a single bash tool, with raw stdout from shell as the only feedback; and OpenCode [10] uses fine-grained individually typed tools (edit, grep, todowrite and task etc) returning structured JSON responses. Figure 4: To power a variety of agentic coding harnesses, North Mini Code is exposed to a variety of coding harnesses during the second SFT stage. We address cross-harness generalization by introducing a small amount of additional benchmark harness data (6% of the SFT mix, compared to 50% of the chosen SWE-Agent harness) during the second SFT stage. Specifically, this data mix yields a 10% gain on the evaluation with OpenCode harness while maintaining performance with SWE-Agent on SWE-Bench Verified, demonstrating that cross-harness transfer can be cheaply acquired without degrading benchmark performance. Notably, North-Code-Mini achieves 61.0% pass@1 using mini-SWE-Agent, where the improvement emerged for free in the cross-task, cross-harness settings, suggesting that harnesses with overlapping tool capabilities share enough representational structure for positive transfer. We also observe minimal data conflict when training on hybrid harness data, indicating that skills required by different harnesses are usually complementary rather than contradictory. Similarly, the official Terminal-Bench uses its own Terminus 2 harness, where all the agent-CLI interactions are communicated via plain-text chat turns (instead of native tool calling). In order to prime our models on Terminus 2, we include a small amount of data (less than 20%) in a plain-text format in the data mixture, which has proved sufficient for the model to naturally generalise across. Interestingly, we also find that it’s crucial to introduce sufficient variations in the various harnesses (akin to data augmentation) in order to force the model to properly establish the link between instructions and behaviours rather than simply regurgitating a fixed template without understanding, and this is especially impo

Related

相關文章

MarkTechPost AI模型更新

Liquid AI Introduces LFM2.5-Embedding-350M and LFM2.5-ColBERT-350M: Dense Bi-Encoder and Late-Interaction Models for Fast Multilingual Search Across 11 Languages

This week, Liquid AI released two new retrieval models. They are LFM2.5-ColBERT-350M and LFM2.5-Embedding-350M. Both hold 350M parameters. Both are the first bidirectional members of the LFM family. They build on LFM2.5-350M-Base, released in March. The pair targets fast multilingual and cross-lingual search across 11 languages. Their footprint is small enough to run almost anywhere. Both are available now on Hugging Face under the LFM Open License v1.0. LFM2.5 Retrievers The two models share one backbone but represent text differently. LFM2.5-Embedding-350M is a dense bi-encoder. It turns each document into a single vector. Pick it when you want the fastest search and the smallest, cheapest index. LFM2.5-ColBERT-350M is a late-interaction model. It converts each token into a vector rather

1 小時前
MarkTechPost AI模型更新

Perplexity Launches Brain, a Self-Improving Memory System That Builds a Context Graph of an Agent’s Work and Learns Overnight

Most AI memory remembers the user. It stores your preferences, your tastes, and your role. Perplexity is taking a different path. Today, Perplexity launched Brain, a self-improving memory system for its agent product, Computer. Brain does not focus on remembering you. It remembers what the agent did. That reframes what memory in AI is for. What is Perplexity‘s Brain Brain is a self-improving memory system. It builds a context graph of the work Computer performs. At set intervals, such as overnight, Brain reviews that graph. It then teaches itself how to do the work better. The idea is straightforward. The more work you do, the more efficient Brain makes your Computer. Brain is rolling out today to Perplexity Max and Enterprise Max subscribers in Research Preview. Two Axes of AI Memory Perp

15 小時前

智譜新高,MiniMax承壓,“大模型雙雄”命運殊途

這篇消息聚焦「智譜新高,MiniMax承壓,“大模型雙雄”命運殊途」。原始導語提到:大模型在被市場重新定價 從 AI 情報角度來看,這類內容值得關注其背後的技術進展、產品落地、產業競爭與後續市場影響。

17 小時前