Is it agentic enough? Benchmarking open models on your own tooling
重點摘要
Back to Articles Is it agentic enough? Benchmarking open models on your own tooling Published June 18, 2026 Update on GitHub Upvote 2 Lysandre lysandre Follow Nathan Habib SaylorTwift Follow Pedro Cuenca pcuenq Follow Benchmarking transformers revisions across different metrics This is a human-made, agent-focused blogpost. Coding agents increasingly work with our software instead of us: describe a task, and the agent picks the library, writes the calls, runs them, and debugs its own mistakes. When the library gets in the way, it will happily bypass it and rewrite the logic from scratch. This introduces a new concept in library development: the code should not only be correct and fast, but should be designed so that an agent can drive it effectively. A clunky API or stale docs annoy us deve
Back to Articles Is it agentic enough? Benchmarking open models on your own tooling Published June 18, 2026 Update on GitHub Upvote 2 Lysandre lysandre Follow Nathan Habib SaylorTwift Follow Pedro Cuenca pcuenq Follow Benchmarking transformers revisions across different metrics This is a human-made, agent-focused blogpost. Coding agents increasingly work with our software instead of us: describe a task, and the agent picks the library, writes the calls, runs them, and debugs its own mistakes. When the library gets in the way, it will happily bypass it and rewrite the logic from scratch. This introduces a new concept in library development: the code should not only be correct and fast, but should be designed so that an agent can drive it effectively. A clunky API or stale docs annoy us developers, but it now also sends the agent down a longer, more expensive path. Most benchmarks just look at the final answer. We wanted the whole process instead: not just whether the agent got it right, but how much work it took to get there, and how that shifts across models, library revisions, and tasks. We measured exactly that, using transformers as our case study. Here, we will introduce a tool specific benchmark focusing on how the answer was found, and provide a simple implementation of one such harness, running entirely on open models driven by the pi coding agent, with the full sweep of models × revisions × tasks fanned out across Hugging Face Jobs so every run sees identical hardware. But, how do you optimize software for agents? We're strong believers in the following two software principles: If it isn't tested, then it doesn't work If it isn't documented, then it doesn't exist This remains the same within the realm of agentic-optimized tooling, and, for once, the two are directly tied to each other. You want your tool to exist for an agent: it needs to be discoverable. The API needs to be clear and the docs need to be extensive. They need to be structured in a way that the agent has rapid access to the useful files and examples. If you want your tool to work for an agent, then you should test it for agentic-use. Testing software for agentic-use We'll use transformers as an example throughout this blogpost: agents using it to solve ML tasks (classifying text, captioning images, transcribing audio), not contributing code to it; though the harness was designed to work with any tool that can be operated from the command line. Our intuition on transformers was that usage could be dramatically simplified with a few changes: a CLI, a Skill, and self-contained, task-specific examples. This is the same recipe recently applied to the hf CLI, redesigned to be agent-optimized, where agents used 1.3–1.8× (and up to 6×) fewer tokens. We wanted to know whether that kind of win generalizes, and whether it could be useful for transformers as well. Intuition is a powerful tool, but we wanted more evidence before we opened PRs that add several thousand lines of code to such a widely used codebase as transformers. We set out to measure what success looks like. Not all successes are equal Two agents can both produce the correct label for a sentiment-classification task, but one: writes a 40-line Python script, imports transformers, debugs a shape error, re-runs twice, and finally prints the answer; while the other types transformers classify --model ... --text "..." and is done in one call. Both reach POSITIVE (0.9999), and here are the two paths an agent actually took on this exact task: # Task: classify the sentiment of "I absolutely loved the movie, it was fantastic!" - # one agent: pipe a script into python and parse the output - python - <<'PY' - from transformers import AutoTokenizer, AutoModelForSequenceClassification - import torch - import torch.nn.functional as F - - model = AutoModelForSequenceClassification.from_pretrained("distilbert/distilbert-base-uncased-finetuned-sst-2-english") - tokenizer = AutoTokenizer.from_pretrained("distilbert/distilbert-base-uncased-finetuned-sst-2-english") - inputs = tokenizer("I absolutely loved the movie, it was fantastic!", return_tensors="pt") - with torch.no_grad(): - logits = model(**inputs).logits - probs = F.softmax(logits, dim=1) - idx = torch.argmax(probs, dim=1).item() - print(model.config.id2label[idx], probs[0][idx].item()) - PY + # the other agent: one command + transformers classify \ + --model distilbert/distilbert-base-uncased-finetuned-sst-2-english \ + --text "I absolutely loved the movie, it was fantastic!" Both methods reach the same result. But they have very different profiles in cost, latency, token usage, and failures. If your evaluation only checks the final string, you're blind to these as well as whether a change you shipped to the library (a CLI improvement, better error messages, a Skill) actually helped agents. Our goal with this harness is to evaluate how much work an agent has to do to perform a given task, and whether changes to the library improve performance. How do we run evaluations? A few words on how we'll evaluate agents here. We run every task under three variants (or "tiers"); three different ways an agent can come at transformers: bare pip install transformers, and nothing else clone the full transformers source, checked out in the working directory skill a packaged Skill: the CLI's docs + task examples, loaded in context These aren't nested: skill doesn't contain clone (it ships curated docs, not the source tree), and neither strictly contains the other, each gives the agent a different kind of help. As we'll see, a model can sometimes do better on clone than on skill. A few more choices: For now we only focus on deterministic tasks which can provide an exact match, as they provide a very nice ground for experimentation. Model-as-a-judge and other schemes are the obvious next steps for other tasks. Every run is its own Hugging Face Job: one per (model × revision × task), so the whole sweep runs in parallel on identical hardware, which keeps the comparison fair at scale. Results and traces land in a Hugging Face Bucket: fast, no versioning needed, and handles very high write concurrency. Which models to benchmark against? Not all models driving agents are equal, and their difference changes what you should look at when running them. Large open models At one end, you have the largest, most capable open models. On reasonably common tasks, these should get the right answer, eventually. For them, task completion saturates near 100% and stops telling you much about your tool; a more relevant benchmark is the effort it took the agent to get there: how many turns, tokens and seconds it took, and whether they walked a clean path or used deprecated APIs. Local Local models vary widely in size, and so do their abilities. Metrics such as "match %" are more relevant than for their larger counterparts, as you can see how model sizes/capabilities affect results on your specific tool. This harness not only provides guidance to library maintainers on how to improve a repository for agent interactions, it also helps assess how different agents and models perform on the tasks users care about. The harness scores every run on several axes, so that you can ask what actually matters for each class of model: match %: did the final answer contain the expected result (per-task, case-insensitive substring / regex / exact, all explicit in the report); median time and median tokens (new vs. cached vs. generated); runs with error %: including a guard that flags runs which produced nothing (0 output tokens, no tool calls, no answer) so silent failures don't masquerade as "0"; marker adoption: tool-defined behavior markers; see below for an explanation of what this is. All of it lands in a report you can directly examine: The live report: Overview, Coverage, and Results, all client-side. (Not loading? open it in a new tab.) And because it captures the native agent trace of every run, numbers are just the beginning: yo
Related
相關文章
Liquid AI Introduces LFM2.5-Embedding-350M and LFM2.5-ColBERT-350M: Dense Bi-Encoder and Late-Interaction Models for Fast Multilingual Search Across 11 Languages
This week, Liquid AI released two new retrieval models. They are LFM2.5-ColBERT-350M and LFM2.5-Embedding-350M. Both hold 350M parameters. Both are the first bidirectional members of the LFM family. They build on LFM2.5-350M-Base, released in March. The pair targets fast multilingual and cross-lingual search across 11 languages. Their footprint is small enough to run almost anywhere. Both are available now on Hugging Face under the LFM Open License v1.0. LFM2.5 Retrievers The two models share one backbone but represent text differently. LFM2.5-Embedding-350M is a dense bi-encoder. It turns each document into a single vector. Pick it when you want the fastest search and the smallest, cheapest index. LFM2.5-ColBERT-350M is a late-interaction model. It converts each token into a vector rather
Perplexity Launches Brain, a Self-Improving Memory System That Builds a Context Graph of an Agent’s Work and Learns Overnight
Most AI memory remembers the user. It stores your preferences, your tastes, and your role. Perplexity is taking a different path. Today, Perplexity launched Brain, a self-improving memory system for its agent product, Computer. Brain does not focus on remembering you. It remembers what the agent did. That reframes what memory in AI is for. What is Perplexity‘s Brain Brain is a self-improving memory system. It builds a context graph of the work Computer performs. At set intervals, such as overnight, Brain reviews that graph. It then teaches itself how to do the work better. The idea is straightforward. The more work you do, the more efficient Brain makes your Computer. Brain is rolling out today to Perplexity Max and Enterprise Max subscribers in Research Preview. Two Axes of AI Memory Perp

智譜新高,MiniMax承壓,“大模型雙雄”命運殊途
這篇消息聚焦「智譜新高,MiniMax承壓,“大模型雙雄”命運殊途」。原始導語提到:大模型在被市場重新定價 從 AI 情報角度來看,這類內容值得關注其背後的技術進展、產品落地、產業競爭與後續市場影響。

華為昇騰 0 Day 支持智譜 GLM-5.2 模型,提供全面推理優化
華為昇騰 AI 宣佈在智譜開源 GLM-5.2 大模型當天即完成深度推理優化。通過 MOE 大融合算子、通信計算融合、高併發調度等七項關鍵技術,顯著提升編程和長程任務的處理效率,現已支持 A3 系列產品部署。#AI 大模型# #國產算力#
企業AI轉型再添利器:青雲科技算力雲接入 MiniMax-M3 模型
企業AI落地面臨高效低成本難題。青雲科技旗下基石智算平臺接入國產開源大模型MiniMax-M3,提供新算力支持。MiniMax-M3以卓越上下文處理能力等三大核心技術見長,依託自研架構,助企業便捷部署AI業務。
阿里開源統一科學大模型 LOGOS,僅用五十六分之一參數超越微軟
阿里 ATH-Token Foundry 聯閤中國人民大學高瓴人工智能學院開源科學基礎模型 LOGOS。該模型採用統一科學語法與純序列建模範式,在六大科學任務上匹配或超越傳統專用方法。其中 LOGOS-1B 僅 1B 參數,即展現出極高效率,性能超越參數量達 8×7B 的微軟模型。