Introducing the FFASR Leaderboard: Benchmarking ASR in the Real World

2026年6月24日 00:00

重點摘要

站內 AI 整理稿

Back to Articles Introducing the FFASR Leaderboard: Benchmarking ASR in the Real World Published June 24, 2026 Update on GitHub Upvote 7 +1 Daniel Gert Nielsen daniel-treble Follow treble-technologies Shivam Saini whojavumusic Follow treble-technologies Alessia Milo alessia-treble Follow treble-technologies Georg Götz georg-goetz Follow treble-technologies Eric Bezzam bezzam Follow 🚀 First open far-field ASR benchmark: community-driven evaluation across 14 simulated rooms, validated against real-world measurements: https://huggingface.co/spaces/treble-technologies/ffasr 📉 The gap is real and it is large: across all submitted models, far-field WER at low SNR is consistently several times higher than near-field WER on the same speech content 🔬 Methodology you can trust: hybrid wave-based simulation, sim-to-real validation, moving-source splits in beta, held-out audio, and standardized evaluation hardware across all submissions ⚡ Accuracy and speed together: the Pareto front plots average WER against RTFx so you can evaluate the tradeoff that is right for your deployment 👀 More is coming: multi-talker scenarios, microphone array support, and echo cancellation are on the roadmap The gap between benchmark performance and real-world deployment is one of the more persistent frustrations in ASR development. Models that score well on standard evaluations often behave differently once real room acoustics are involved: reverberation, background noise, microphone distance. The complex interactions between these factors affect performance in ways that clean-speech benchmarks do not capture. The FFASR Leaderboard is our attempt to quantify that gap. Treble Technologies and Hugging Face are launching the Far-Field ASR (FFASR) Leaderboard, the first open, community-driven benchmark designed to evaluate ASR models under realistic far-field acoustic conditions. It is live now, and we are inviting the community to submit models, explore the results, and help shape what comes next. Why far-field evaluation matters Voice interfaces have expanded well beyond the headset and the smartphone. AI voice agents, conference room transcription, in-car assistants, humanoid robots, smart glasses, and hands-free tools are all seeing rapid adoption. What they have in common is that they operate in acoustically complex environments: reverberation, background noise, overlapping sounds, and a microphone that may be anywhere from one to several meters from the speaker. The dominant ASR evaluation paradigm has not caught up with this reality. Clean, close-microphone benchmarks remain the standard, and while they are useful for measuring core recognition quality, they do not predict far-field performance. A model that performs well on LibriSpeech or other near-field sets may degrade substantially once real room acoustics enter the picture. While there have been several research efforts around far-field and noisy speech evaluation — including CHiME, URGENT, and NOIZEUS — the community has not had a standardized, open way to measure that degradation consistently across models in a continuously updated leaderboard format. That is what FFASR is built for. A major challenge of far-field evaluation is the availability of data. Collecting far-field recordings across a representative range of room types, microphone distances, and noise conditions at scale is prohibitively expensive with physical measurements alone. Simulation makes it possible to cover that space systematically and to extend coverage over time without a corresponding increase in measurement cost. Another goal of FFASR is to encourage the development of models that are explicitly robust to these conditions. Leaderboards have historically been effective at directing research effort. By making far-field performance visible and comparable, we hope to raise the priority of real-world acoustic robustness across the field. How the benchmark is constructed The FFASR Leaderboard evaluates models across nine conditions. The four that determine the primary ranking score are (as of 22 June 2026): Near-field (dry) — clean speech measured in an anechoic chamber (similar to Librispeech but with minimal reverberation) Far-field high SNR (above 14 dB) Far-field mid SNR (8 to 12 dB) Far-field low SNR (below 6 dB) To give a sense of what these conditions actually sound like, the samples below let you hear the same speech utterance as dry anechoic audio, then convolved with a room impulse response, and finally with noise added at each SNR tier. The difference between the dry recording and the low-SNR far-field condition is a reasonable proxy for the scale of the problem the leaderboard is measuring. Two additional columns, Lab Measured and Lab Simulated, serve as a sim-to-real validation track. The leaderboard also includes moving-source splits, currently in beta, which evaluate models against audio where the speaker is in motion rather than stationary. This condition reflects use cases such as humanoid robots, in-car speech, and mobile voice assistants where the acoustic geometry between speaker and microphone changes continuously. The acoustic data is generated with Treble's hybrid simulation engine, which combines a wave-based solver at low to mid frequencies with geometrical-acoustics modeling at higher frequencies. This approach captures physical phenomena that simpler simulation methods often miss: diffraction, scattering, interference, and modal behavior. The result is simulated data that closely matches measured acoustic conditions, which the Lab Measured and Lab Simulated columns confirm directly by running the same evaluation on both. Fourteen fully furnished rooms are included in the benchmark, ranging from 20 to 470 m³ and covering bathrooms, living rooms with hallways, offices, classrooms, and restaurant spaces. Each acoustic scene contains one target speaker, recorded in an anechoic chamber to avoid reverberation artifacts from the recording environment, and up to three noise sources. Every scene includes both a transient noise source such as coughing and a continuous noise source such as HVAC, at three SNR levels. This coverage is designed to reflect the actual variety of spaces where deployed voice systems operate. Alongside WER, the leaderboard reports RTFx (audio seconds per inference second) for every submission, evaluated on an NVIDIA L4 GPU under identical conditions. Accuracy and latency together are what matter in real deployments, and the Pareto front view in the Analysis tab makes that tradeoff explicit. This benchmark is build on simulated acoustic spaces via Treble Technologies proprietaty simulation engine. An example of the output from the enginge can be found in the Treble10 dataset released last year, which established the simulation pipeline and made far-field RIRs available for training and research. FFASR extends that foundation into a standardized evaluation framework with a held-out test set, consistent normalization, and automated scoring. What the data already shows With the leaderboard live, a consistent pattern is emerging across all submitted models: the gap between near-field and far-field performance is large, and it grows significantly as SNR decreases. Near-field WER values, on clean dry speech, look comparable to what the same models achieve on established benchmarks. Far-field WER at low SNR tells a different story, often several times higher. The benchmark makes this degradation visible and comparable in a way that was previously difficult to do outside proprietary evaluation pipelines. The Pareto front of average WER against RTFx is also revealing. There is a genuine spectrum of approaches represented in the current submissions: models that prioritize speed at the cost of some accuracy, models that push accuracy at the cost of throughput, and a smaller number that achieve a competitive position on both axes. Visualizing these tradeoffs against far-field accuracy rather than clean-speech accuracy

原始來源：Hugging Face Blog ↗

查看原始來源

MarkTechPost AI研究與前沿

...

Interfaze, a young YC’s startup, has open-sourced a new speech recognition model. It is called diffusion-gemma-asr-small.

9 小時前閱讀分析

雷峰網研究與前沿

十年榜單首迎中國雙料冠軍：這次贏的不只是性能

6月，在德國漢堡ISC高性能計算大會的展臺上，GPU、液冷、量子計算的聲浪依舊洶湧，但今年，會場的主角悄悄換了人。IO500榜單——全球高性能計算存儲領域最權威的評測體系——公佈了最新一期結果：中科曙光ParaStor F9000分佈式全閃存儲系統，同時拿下生產型全節點和10節點兩大榜單的第一名。

1 天前閱讀分析

AIBase研究與前沿

OpenAI 發佈 GeneBench-Pro 基準測試，提升 AI 模型生物學分析能力！

OpenAI推出GeneBench-Pro基準，聚焦評估AI在基因組學、蛋白質組學等複雜生物數據分析中的實際研究能力，尤其檢驗模型處理混亂、不完整數據時的判斷與決策水平，與傳統基準截然不同。

1 天前6400閱讀分析

何夕2077研究與前沿

BlockPilot解碼加速技術發佈

BlockPilot解碼加速技術發佈。這套創新算法 ✨ 能夠自動預測推理過程的最佳分塊。研究團隊採用自適應生成策略來具體實現。它的推理速度 ⚡️ 竟然直接飆升了四倍多。這套新架構極其輕量並且支持無縫嵌入現有系統。

1 天前閱讀分析

IT之家研究與前沿

Meta 複用拆機內存：3:1 搭配 DDR5/DDR4，推理 AI 所需服務器最多減少 25%

科技媒體 The Register 昨日（6 月 29 日）發佈博文，報道稱 Meta 公司為減少採購新硬件需求，發佈自研 Vistara 定製芯片方案，讓新服務器複用拆機 DDR4 內存。

3 天前閱讀分析

36氪研究與前沿

30頁論文被判“98%由AI生成”，每年30萬獎學金也告吹？一名大學生"破防"：我寫了整整6個月

一名大學生花費6個月撰寫的30頁論文，被AI檢測工具判定有98%內容由AI生成，可能導致每年30萬元的獎學金被取消。學生對此感到相當沮喪，強調論文是自己親手完成，並非使用AI輔助。這起事件引發關於AI檢測工具準確性與學術公平性的討論。