它夠「代理化」嗎?使用自有工具對開源模型進行基準測試
重點摘要
這篇由真人撰寫、聚焦代理的文章指出,編碼代理正逐漸取代我們與軟體互動:描述任務後,代理會自行選擇函式庫、撰寫呼叫、執行並除錯。這為函式庫開發引入新概念:程式碼不僅要正確、快速,還需設計成能讓代理有效驅動。若API設計生硬或文件過時,代理會直接繞過並從頭重寫邏輯。
Back to Articles Is it agentic enough? Benchmarking open models on your own tooling Published June 18, 2026 Update on GitHub Upvote 19 +13 Lysandre lysandre Follow Nathan Habib SaylorTwift Follow Pedro Cuenca pcuenq Follow Benchmarking transformers revisions across different metrics This is a human-made, agent-focused blogpost. Coding agents increasingly work with our software instead of us: describe a task, and the agent picks the library, writes the calls, runs them, and debugs its own mistakes. When the library gets in the way, it will happily bypass it and rewrite the logic from scratch. This introduces a new concept in library development: the code should not only be correct and fast, but should be designed so that an agent can drive it effectively. A clunky API or stale docs annoy us developers, but it now also sends the agent down a longer, more expensive path. Most benchmarks just look at the final answer. We wanted the whole process instead: not just whether the agent got it right, but how much work it took to get there, and how that shifts across models, library revisions, and tasks. We measured exactly that, using transformers as our case study. Here, we will introduce a tool specific benchmark focusing on how the answer was found, and provide a simple implementation of one such harness, running entirely on open models driven by the pi coding agent, with the full sweep of models × revisions × tasks fanned out across Hugging Face Jobs so every run sees identical hardware. But, how do you optimize software for agents? We're strong believers in the following two software principles: If it isn't tested, then it doesn't work If it isn't documented, then it doesn't exist This remains the same within the realm of agentic-optimized tooling, and, for once, the two are directly tied to each other. You want your tool to exist for an agent: it needs to be discoverable. The API needs to be clear and the docs need to be extensive. They need to be structured in a way that the agent has rapid access to the useful files and examples. If you want your tool to work for an agent, then you should test it for agentic-use. Testing software for agentic-use We'll use transformers as an example throughout this blogpost: agents using it to solve ML tasks (classifying text, captioning images, transcribing audio), not contributing code to it; though the harness was designed to work with any tool that can be operated from the command line. Our intuition on transformers was that usage could be dramatically simplified with a few changes: a CLI, a Skill, and self-contained, task-specific examples. This is the same recipe recently applied to the hf CLI, redesigned to be agent-optimized, where agents used 1.3–1.8× (and up to 6×) fewer tokens. We wanted to know whether that kind of win generalizes, and whether it could be useful for transformers as well. Intuition is a powerful tool, but we wanted more evidence before we opened PRs that add several thousand lines of code to such a widely used codebase as transformers. We set out to measure what success looks like. Not all successes are equal Two agents can both produce the correct label for a sentiment-classification task, but one: writes a 40-line Python script, imports transformers, debugs a shape error, re-runs twice, and finally prints the answer; while the other types transformers classify --model ... --text "..." and is done in one call. Both reach POSITIVE (0.9999), and here are the two paths an agent actually took on this exact task: # Task: classify the sentiment of "I absolutely loved the movie, it was fantastic!" - # one agent: pipe a script into python and parse the output - python - <<'PY' - from transformers import AutoTokenizer, AutoModelForSequenceClassification - import torch - import torch.nn.functional as F - - model = AutoModelForSequenceClassification.from_pretrained("distilbert/distilbert-base-uncased-finetuned-sst-2-english") - tokenizer = AutoTokenizer.from_pretrained("distilbert/distilbert-base-uncased-finetuned-sst-2-english") - inputs = tokenizer("I absolutely loved the movie, it was fantastic!", return_tensors="pt") - with torch.no_grad(): - logits = model(**inputs).logits - probs = F.softmax(logits, dim=1) - idx = torch.argmax(probs, dim=1).item() - print(model.config.id2label[idx], probs[0][idx].item()) - PY + # the other agent: one command + transformers classify \ + --model distilbert/distilbert-base-uncased-finetuned-sst-2-english \ + --text "I absolutely loved the movie, it was fantastic!" Both methods reach the same result. But they have very different profiles in cost, latency, token usage, and failures. If your evaluation only checks the final string, you're blind to these as well as whether a change you shipped to the library (a CLI improvement, better error messages, a Skill) actually helped agents. Our goal with this harness is to evaluate how much work an agent has to do to perform a given task, and whether changes to the library improve performance. How do we run evaluations? A few words on how we'll evaluate agents here. We run every task under three variants (or "tiers"); three different ways an agent can come at transformers: bare pip install transformers, and nothing else clone the full transformers source, checked out in the working directory skill a packaged Skill: the CLI's docs + task examples, loaded in context These aren't nested: skill doesn't contain clone (it ships curated docs, not the source tree), and neither strictly contains the other, each gives the agent a different kind of help. As we'll see, a model can sometimes do better on clone than on skill. A few more choices: For now we only focus on deterministic tasks which can provide an exact match, as they provide a very nice ground for experimentation. Model-as-a-judge and other schemes are the obvious next steps for other tasks. Every run is its own Hugging Face Job: one per (model × revision × task), so the whole sweep runs in parallel on identical hardware, which keeps the comparison fair at scale. Results and traces land in a Hugging Face Bucket: fast, no versioning needed, and handles very high write concurrency. Which models to benchmark against? Not all models driving agents are equal, and their difference changes what you should look at when running them. Large open models At one end, you have the largest, most capable open models. On reasonably common tasks, these should get the right answer, eventually. For them, task completion saturates near 100% and stops telling you much about your tool; a more relevant benchmark is the effort it took the agent to get there: how many turns, tokens and seconds it took, and whether they walked a clean path or used deprecated APIs. Local Local models vary widely in size, and so do their abilities. Metrics such as "match %" are more relevant than for their larger counterparts, as you can see how model sizes/capabilities affect results on your specific tool. This harness not only provides guidance to library maintainers on how to improve a repository for agent interactions, it also helps assess how different agents and models perform on the tasks users care about. The harness scores every run on several axes, so that you can ask what actually matters for each class of model: match %: did the final answer contain the expected result (per-task, case-insensitive substring / regex / exact, all explicit in the report); median time and median tokens (new vs. cached vs. generated); runs with error %: including a guard that flags runs which produced nothing (0 output tokens, no tool calls, no answer) so silent failures don't masquerade as "0"; marker adoption: tool-defined behavior markers; see below for an explanation of what this is. All of it lands in a report you can directly examine: The live report: Overview, Coverage, and Results, all client-side. And because it captures the native agent trace of every run, numbers are just the beginning: you can read exactly what the agen
Related
相關文章
廣告治理迎來“視覺進化”:巨量引擎發佈 Mamoda 2.5 版本,實現視頻全形態覆蓋
巨量引擎發佈自研廣告治理大模型Mamoda 2.5,實現內容安全風控技術升級。該模型從1.0僅能識別基礎違規文本起步,經持續迭代,能力邊界不斷擴展,為數字化廣告生態的違規內容高效精準識別與治理提供更強支撐。
AI 視頻賽道格局重塑:谷歌 Gemini Omni Flash 登頂盲測榜首
谷歌DeepMind的文生視頻模型Gemini Omni Flash在權威盲測排行榜Video Arena中以1404Elo分躍居第一,彰顯谷歌多模態技術實力,也印證視頻生成領域正高速迭代。
AI基礎設施的下一個千億市場,為何藏在網絡裡?
過去六年,國產GPU公司一路站上AI風口,估值不斷刷新,DPU卻被忽略了。這並不符合產業現實。2020年英偉達完成收購Mellanox後,就已經明確了“GPU+CPU+DPU”的三芯戰略。過去幾年,英偉達也持續強化網絡能力,黃仁勳在2026年CES展示“六芯組合”時,其中四款都與網絡相關。一個越來越清晰的趨勢開始浮出水面:AI基礎設施的瓶頸,正在從算力本身轉向網絡與調度。
Google Health API 有了 CLI:ghealth 是專為 Fitbit 資料設計的開源工具
Google Health API 是 Fitbit Web API 的官方後繼者,它鎖定 Google Health API v4,並讓開發者遷移至 Google OAuth 2.0。現在,一款名為 ghealth 的開源 CLI 命令列工具將該 API 包裝起來,適用於終端機與 AI 代理。該工具是單一的 Go 二進位檔,採用 Apache 2.0 授權。它將 40 種經過驗證的資料類型以結構化 JSON 形式呈現,讓你能將睡眠、心率與步數資料直接導入代理的上下文。什麼是 ghealth?ghealth 是 Google Health API v4 的包裝工具。你可以透過 go build -o ghealth . 從原始碼建置,產出一個自包含的二進位檔。該工具明確以代理為優先,每個指令都會回傳形狀穩定的簡化 JSON。此外,它還提供確定性錯誤碼、--dry-run 旗標與 --raw 旗標。
