在 Hugging Face 模型頁面上展示所有評估結果

2026年6月30日 00:00

重點摘要

「所有評估結果」(Every Eval Ever, EEE) 現與 Hugging Face 社群評估互通。我們實現了評估結果的跨平臺發布與解讀，同時連結開源模型、排行榜及統一的標準化元數據儲存庫。EEE 於 2026 年 2 月由 EvalEval 聯盟發起，是首個跨機構合作改善 AI 評估結果報告方式的計畫，涵蓋第一方與第三方評估者。Hugging Face 則推出 Co...

站內 AI 整理稿

Back to Articles Featuring Every Eval Ever Results on Hugging Face Model Pages Published June 30, 2026 Update on GitHub Upvote 3 Sree Harsha Nelaturu deepmage121 Follow evaleval Avijit Ghosh evijit Follow Jan Batzner janbatzner Follow evaleval Leshem Choshen borgr Follow evaleval Irene Solaiman irenesolaiman Follow Julien Chaumond julien-c Follow Every Eval Ever (EEE) and Hugging Face Community Evals are now intercompatible. We enable cross-posting and interpreting evaluation results, while linking to open models, leaderboards, and a unified standardized metadata store. EEE launched in February 2026 as a project of the EvalEval Coalition, the first cross-institutional effort to improve how AI evaluation results get reported by both first and third party evaluators. Hugging Face launched Community Evals in February 2026 to decentralize how benchmark scores get reported on the Hub. Combined, they patch gaps in how users, researchers, and policymakers trust, understand, and choose evaluations and models. Evaluation results are how we measure model capabilities, compare models against each other, and reason about safety and governance, and yet they are scattered and hard to compare. They live in papers, leaderboards, blog posts, and harness logs, among others, each in its own format. The same model on the same benchmark often returns different scores depending on who ran it and how; LLaMA 65B, for one, has been reported at both 63.7 and 48.8 on MMLU. These gaps can arise from evaluation settings that we found are commonly unreported. EEE is our fix for the reporting side. It's one JSON schema for an evaluation result that records: who ran it which model how it was accessed generation settings what the metric actually means [recommended] companion JSONL file for per-sample outputs. The schema was built with feedback from researchers and policy researchers, and it takes in results from any source, so harness logs, leaderboard scrapes, and paper numbers all end up in the same shape. The GitHub repository has the converters, examples, and a contributor guide. Since launching, the datastore on Hugging Face has grown to around 229,000 evaluation results across more than 22,000 models and 2,200 benchmarks, pulled from 31 different reporting formats. Reproducing just those runs from scratch would cost somewhere in the hundreds of thousands of dollars, which is a reasonable argument for not letting the data scatter once someone has paid to generate it. Learn more about the schema and how to contribute here. Now, it comes with better integration and attribution. Contributors can now send EEE results to Hugging Face Community Evals. We built a converter that takes your EEE records and writes the small YAML files Hugging Face expects, so you don't have to keep the same result in two formats by hand. This is new functionality for everyone who reports or reads evaluations, not only existing EEE contributors. First-party evaluators reporting on their own models and third-party evaluators reporting on someone else's can both submit to Community Evals and to EEE, and anyone browsing the Hub gets results that trace back to a full record. When you submit your data through your organization's official Hugging Face account, your results show up with a verified checkmark on EvalEval, a signal to readers that the numbers come straight from the source. The rest of this post walks through what Community Evals are and what the converter does. How Hugging Face Community Evals works together with EvalEval Hugging Face Community Evals has two sides. A benchmark lives in a dataset repo that registers itself by adding an eval.yaml. Once registered, that dataset page collects and displays a leaderboard of every score reported against it across the Hub. The list of official benchmarks grows over time. A model's scores live in .eval_results/*.yaml inside the model repo. They show up on the model card and feed into the matching benchmark leaderboard. Both the model author's own results and results submitted by anyone else through a pull request get aggregated, and each score carries a badge saying whether it was author-submitted, community-submitted, or independently verified. Anyone can add a score to any model by opening a PR with the right YAML file, and the model author can close PRs or hide results on their own repo. Here is what one of these leaderboards looks like: Community Evals Leaderboard for Humanity's Last Exam on the Hub This is where EEE and Community Evals fit together. When you send a result to both, two things happen: First, your score appears on the Hugging Face model page and gets pulled into the benchmark's leaderboard. And second, it carries a source badge that links straight back to the full EEE record, where the generation config, the harness version, the reproducibility notes, and any instance-level data live. An Evaluation (MMLU-Pro) from EEE Datastore (a) cross-linked at the file level to a Hugging Face model card (b). The Source EvalEval badge links to the full JSON record. The two destinations do different jobs toward the same goal. Hugging Face puts your result where people look at models, with a link back to the source. EEE keeps the full structured record that makes the result interpretable, and powers Eval Cards on top of it. Send your data to both and the same evaluation ends up visible and legible at once, which is the point of reporting one at all. You can see that cross-compatibility below. The same GPQA scores that surface on the model card above also render in Eval Cards, which composes the EEE run data with benchmark and model metadata into one interpretable record. Same evaluation, a different surface: How it works Hugging Face stores eval scores in the model repo as a YAML under .eval_results/. The required fields are just the benchmark dataset, the task, and the value. The source block is the part that creates the backlink to EEE. - dataset: id: openai/gsm8k task_id: gsm8k value: 96.8 date: '2024-07-16' notes: '8-shot CoT' source: url: https://huggingface.co/datasets/evaleval/EEE_datastore/blob/main/flat/objects/<xx>/<yy>/<uuid>.json name: EvalEval The converter fills this in from your existing records. It maps source_data.hf_repo to dataset.id, evaluation_name to task_id, score_details.score to value, and evaluation_timestamp to date, then drops in the datastore object URL as the source link to the per-record EEE JSON. It currently handles four of the official benchmarks: MMLU-Pro, GPQA, HLE, and GSM8K. The converter does more than reshape fields. You point it at one EEE datastore collection and it downloads that collection along with the records it references, checks the object hashes, and finds the scores that map to a supported benchmark. Before it writes anything live it audits what already exists: it reads every .eval_results YAML on the model's main branch and in open PRs, and compares by dataset and task rather than by filename. If a score is already there it is marked already_present, if a different score is there it is flagged as a score_conflict, and if the model repo doesn't resolve on the Hub it is marked missing_hf_model. Everything else is marked ready. Nothing gets pushed without your sign-off. The tool writes local YAML previews and a review file you can inspect, shows a report of what is ready and what needs attention, and only opens PRs after you type OPEN PRS and enter a commit message. Reruns reuse the cached results for a collection unless you pass --force. The converter's review step. Excluded entries (here, models with no matching Hub repo) are listed with their EEE source URLs, and the ready PRs wait on an explicit OPEN PRS confirmation. Start here Submit your full records to the EEE datastore. Utilizing EEE requires only one additional step, which the converter largely automates. The community eval converter tool can be found in the GitHub repository. To process a collection, execute the following: uv run tools/hf-comm

原始來源：Hugging Face Blog ↗

查看原始來源

36氪模型更新

退錢，Claude 4.8連夜大降智，GPT-5.6算力遭“腰斬”

最近，AI社區遭遇集體降智潮！OpenAI疑似暗中開啟GPT-5.6灰度測試，神秘「Juice」測試引爆全網查成分；另一邊，Anthropic的Claude Opus 4.8被曝斷崖式降智，疑似被切腦。我們花錢買到的AI，究竟是什麼版本？

剛剛閱讀分析

IT之家模型更新

馬斯克抽調 SpaceX 頂尖星艦和星鏈工程師，全力攻堅 Grok 大模型

馬斯克將數十名 SpaceX 星鏈與星艦核心工程師調至 AI 團隊，全力迭代 Grok 大模型。最新版 Grok 4.5 已開始內部測試，SpaceX 計劃每月推出一款全新訓練的大模型。這標誌著 SpaceX 在完成 850 億美元 IPO 後，正利用鉅額資金和工程資源全面押注人工智能賽道。#SpaceX##人工智能##馬斯克#

1 小時前閱讀分析

量子位模型更新

GLM-5.3你來定！智譜唐傑全球徵集意見，評論區清一色：視覺

這篇消息聚焦「GLM-5.3你來定！智譜唐傑全球徵集意見，評論區清一色：視覺」。原始導語提到：真·有求必應·阿拉丁從 AI 情報角度來看，這類內容值得關注其背後的技術進展、產品落地、產業競爭與後續市場影響。

2 小時前閱讀分析

IT之家模型更新

920 億參數，華為 openPangu-2.0-Flash 模型正式開源上線

openPangu-2.0 模型相關組件，將於 6 月 30 日起陸續開源：openPangu-2.0-Flash 模型權重、基礎推理代碼、訓推算子，6 月 30 日正式上線開源平臺。openPangu-2.0-Pro 模型權重、基礎推理代碼，將於 7 月上線開源平臺。

6 小時前閱讀分析

智東西模型更新

花99萬，買個機器人男友談戀愛

智東西作者 | 許麗思編輯｜漠影智東西6月30日深圳現場報道，今天，優必選正式發佈了首款全尺寸超仿生人形機器人U1系列。其首批產品包括女版機器人小優Una和男版機器人凌夜Nix。 U1 最出彩的地方，一是面部表情非常逼真，二是全身動作也挺流暢絲滑，能和人類舞者搭檔跳《愛樂之城》，整套舞蹈的動作、節奏都把握得不錯。 ▲現場展出的U1，能夠模擬多種人類表情 ▲舞臺上，女版U1Ultra與人類共舞 U1系列共有三款產品：U1Lite是輕量化半身版，方便攜帶、搬運，價格為11.98萬元；高配全身版U1Pro為16.98萬元；高動態全身版U1Ultra女版為88萬元，男版貴一些，為99萬元。 U1 Ultra 男女款差價高達11 萬，差不多都能買一輛車了。 U1系列主打高顏值，僅限成年人購買，它可以情緒陪伴、日常互動，還能在陪伴過程中越用越懂你。不過，對於不少用戶好奇的能不能買回家順便完成做飯、打掃衛生、收納等家務活，優必選明確表示：那還不行。優必選創始人、董事會主席兼CEO周劍現場說，上臺前，他剛得知訂單突破11000臺；而在半個小時後，優必選首席品牌官、機器人大消費創新事業部總裁、優世界總經理譚旻提到，這個數據又刷新到超13361臺。用戶支付3000定金即可訂購，在7月15日預售結束前可退。截至目前，在京東平臺上，可以看到U1已定5467件，天貓平臺已定80件。按照計劃，優必選將於9月16日啟動量產交付，2026年全年目標交付1萬臺以上。發佈會現場打造了一個4000㎡沉浸式人機共生藝術空間，展示了U1的多個不同應用場景，包括醫康養評估接待，家庭情感陪伴、青少年心理療愈、前臺迎賓接待等。比如在前臺迎賓接待的場景中，化身接待員的U1可以給觀眾自我介紹、講講優必選的發展、不同的產品線，可以說是有問必答。不過，從現場體驗來看，U1有時需要等待片刻才能作出回答，語音與唇形

7 小時前閱讀分析

IT之家模型更新

業界首個：美團 LongCat-2.0 發佈，國產芯片上跑出的萬億參數模型

LongCat-2.0 宣稱是業界首個在五萬卡國產算力集群上完成全流程訓練與推理的萬億參數模型（總參數 1.6 T，平均激活約 48 B，動態範圍 33B~56B），從零開始預訓練，原生支持 1M 超長上下文。

8 小時前閱讀分析

相關文章

退錢，Claude 4.8連夜大降智，GPT-5.6算力遭“腰斬”

馬斯克抽調 SpaceX 頂尖星艦和星鏈工程師，全力攻堅 Grok 大模型

GLM-5.3你來定！智譜唐傑全球徵集意見，評論區清一色：視覺

920 億參數，華為 openPangu-2.0-Flash 模型正式開源上線

花99萬，買個機器人男友談戀愛

業界首個：美團 LongCat-2.0 發佈，國產芯片上跑出的萬億參數模型