...
重點摘要
Interfaze, a young YC’s startup, has open-sourced a new speech recognition model. It is called diffusion-gemma-asr-small.
Interfaze, a young YC’s startup, has open-sourced a new speech recognition model. It is called diffusion-gemma-asr-small. The model transcribes audio through a diffusion decoder, not an autoregressive one. It is described as the first multilingual audio diffusion ASR model. One adapter handles six languages. The research team trained only about 42M parameters on top of a frozen 26B backbone. That is roughly 0.16% of the model’s weights. Here two terms matter up front. Autoregressive models generate text one token at a time. Diffusion models refine all tokens in parallel. This model uses the diffusion approach for speech-to-text. TL;DR Claimed by the Interfaze team, to be the first open-source multilingual diffusion ASR: six languages from a single ~42M-parameter adapter. Transcribes via DiffusionGemma’s diffusion decoder using uniform, random-token diffusion, not the absorbing <mask> scheme. Transcription cost scales with denoising steps, not transcript length. Leads diffusion peers on LibriSpeech (6.6% WER vs Whisfusion’s 8.3%) but trails autoregressive Whisper. The adapter ships under Apache-2.0; DiffusionGemma (Gemma terms) and whisper-small (MIT) load separately. What is diffusion-gemma-asr-small? diffusion-gemma-asr-small is an audio-native ASR model. It converts speech to text using a discrete diffusion decoder. That decoder belongs to DiffusionGemma, Google’s 26B mixture-of-experts model. DiffusionGemma activates 4B parameters, using 128 experts with top-8 routing. It generates text by discrete diffusion instead of autoregression. The diffusion detail is specific. Most diffusion LLMs use an absorbing <mask> scheme. DiffusionGemma uses uniform, random-token diffusion instead. It fills a fixed-length canvas with random vocabulary tokens. Each step keeps confident predictions and re-randomizes the rest. After a few steps the noise anneals into text. Interfaze added audio to this text-only model. Out of the box, DiffusionGemma takes text, images, and video. It does not take audio. The repo ships only the trained adapter, about 42M parameters. The frozen backbones download separately from their own repos. How it works The model does not feed raw waveforms to the LLM. An early attempt tried exactly that and failed. A frozen LLM has never seen a spectrogram. The embedding space has no notion of formants or phonemes. The model learned to ignore audio and hallucinate fluent nonsense. The working design uses a frozen whisper-small encoder. It acts only as a feature extractor, not a decoder. Whisper turns 30 seconds of audio into 1500 frames. Each frame holds 768-dimensional acoustic features. A small trainable projector then compresses these frames. It uses conv layers that subsample 8× plus a linear map. The output is 188 “audio tokens” at 2816 dimensions. These tokens scatter into the prompt’s reserved <|audio|> slots. LoRA adapters let the backbone attend to this new modality. The decoder then denoises a 192-token transcript canvas. It runs bidirectionally over roughly 16 steps. The pipeline, from the model card, is compact: Copy CodeCopiedUse a different Browserraw audio ─► whisper-small encoder (frozen) ─► projector (trained, ~19M) ─► scatter into <audio> token slots of DiffusionGemma's encoder ─► DiffusionGemma decoder denoises a 192-token canvas (bidirectional, cross-attends audio) ─► transcript The training unlock The first training runs stalled. Loss flatlined near 8. The failure was circular. The projector started random, so its output was noise. Attention then learned to ignore it. Almost no gradient reached the projector. The model never learned. The fix supervised the projector directly. The research team ran the 188 audio tokens through DiffusionGemma’s frozen lm_head. They applied a CTC loss against the transcript. CTC means Connectionist Temporal Classification. It aligns audio features to text without needing attention. This sidesteps the standoff. The audio embeddings became linearly predictive of the right words. CTC loss then dropped from 24 to 8.6 in 300 steps. On LibriSpeech test-clean, English WER fell 90% → 52% → 14.6% → 6.6% over ten epochs. Performance and benchmarks WER means Word Error Rate, where lower is better. CER means Character Error Rate. The model trained on FLEURS, LibriSpeech, and VoxPopuli. All scores below use the Whisper text normalizer at 16 diffusion steps. benchmarkmetricscoreLibriSpeech test-clean (en)WER6.6%FLEURS EnglishWER15.7%VoxPopuli EnglishWER18.5%FLEURS HindiCER15.8%FLEURS MandarinCER29.6% Against other diffusion or non-autoregressive ASR, it leads. modelapproachLibriSpeech test-cleanTransFusion (2022)multinomial diffusion~6–7% (proof-of-concept)Whisfusion (Aug 2025)Whisper-large-v3 + masked diffusion8.3%diffusion-gemma-asr-small (2026)Whisper-small + DiffusionGemma6.6% Against autoregressive Whisper, it trails. The team frames this gap as data, not architecture. benchmarkoursWhisper-smallWhisper-large-v3LibriSpeech clean6.6%~3.4%~2.0%FLEURS-en15.7%~9–10%~4–5%VoxPopuli-en18.5%~9–11%~7–10% The denoising-step sweep shows a nearly flat curve. stepsFLEURS-en WERspeed815.7%14.9× real-time1615.6%10.3×3215.2%6.5×4815.6%4.7× Going from 8 to 48 steps buys about 0.1 WER point. It costs roughly 3× the latency. The model converges in about 8 parallel passes. That is around 0.7–1.5s of model time for a 10-second clip. Use cases with examples Batch transcription pipelines benefit from parallel decoding. Cost is set by denoising steps, not clip length. A 10-second clip needs roughly the same passes as a shorter one. Multilingual transcription runs from a single adapter. It covers English, German, French, Spanish, Hindi, and Mandarin. Teams avoid loading a separate model per language. Non-autoregressive ASR research gains a reproducible baseline. The recipe grounds a frozen LLM with a small adapter. Researchers can extend it with more audio or a larger encoder. How to get started The model lives on the Hub. It ships the adapter, model.py, audio.py, and a runnable inference.py. DiffusionGemma support needs transformers from main. Copy CodeCopiedUse a different Browserpip install torch peft soundfile librosa huggingface_hub \ "transformers @ git+https://github.com/huggingface/transformers.git" Then transcribe in Python: Copy CodeCopiedUse a different Browserimport sys, soundfile as sf from huggingface_hub import snapshot_download repo = snapshot_download("interfaze-ai/diffusion-gemma-asr-small") # adapter, ~170 MB sys.path.insert(0, repo) from inference import load, transcribe # Loads frozen DiffusionGemma-26B + whisper-small + this adapter. model, tok, fe = load(f"{repo}/diffusion_asr_small.pt", device="cuda") wav, sr = sf.read("audio.wav") # 16 kHz mono float32 print(transcribe(wav, model, tok, fe, max_steps=16)) A command-line path also works from inside the downloaded repo: Copy CodeCopiedUse a different Browserpython inference.py audio.wav The max_steps argument trades speed for accuracy. The team notes 8 is near-best and fastest. The default is 16. The base models load under their own licenses: DiffusionGemma under Gemma terms, whisper-small under MIT. Interactive Explainer Denoise</button> </div> </div> <!-- stage --> <div class="stage"> <div class="stage-top"> <span class="stage-title">Transcript canvas</span> <div class="prog-wrap"><div class="prog" id="prog"></div></div> </div> <div class="canvas" id="canvas"></div> <p class="caption"> Illustrative visualization of parallel diffusion denoising: confident tokens lock, the rest re-randomize each step. The example transcript is fixed — this animation is not live model inference. </p> </div> <!-- readouts --> <div class="reads"> <div class="read"><div class="k">Step</div><div class="v" id="
Related
相關文章

arXiv脫離康奈爾,正式單飛
這篇消息聚焦「arXiv脫離康奈爾,正式單飛」。原始導語提到:AI論文的第一站,自己也被AI逼著變了 從 AI 情報角度來看,這類內容值得關注其背後的技術進展、產品落地、產業競爭與後續市場影響。

訓練世界模型,開始從人類的肌肉和腦子裡偷師了
這篇消息聚焦「訓練世界模型,開始從人類的肌肉和腦子裡偷師了」。原始導語提到:具身智能數採迎來了新範式 從 AI 情報角度來看,這類內容值得關注其背後的技術進展、產品落地、產業競爭與後續市場影響。
十年榜單首迎中國雙料冠軍:這次贏的不只是性能
6月,在德國漢堡ISC高性能計算大會的展臺上,GPU、液冷、量子計算的聲浪依舊洶湧,但今年,會場的主角悄悄換了人。IO500榜單——全球高性能計算存儲領域最權威的評測體系——公佈了最新一期結果:中科曙光ParaStor F9000分佈式全閃存儲系統,同時拿下生產型全節點和10節點兩大榜單的第一名。
OpenAI 發佈 GeneBench-Pro 基準測試,提升 AI 模型生物學分析能力!
OpenAI推出GeneBench-Pro基準,聚焦評估AI在基因組學、蛋白質組學等複雜生物數據分析中的實際研究能力,尤其檢驗模型處理混亂、不完整數據時的判斷與決策水平,與傳統基準截然不同。
BlockPilot解碼加速技術發佈
AI資訊日報|BlockPilot解碼加速技術發佈 BlockPilot解碼加速技術發佈。 這套創新算法 ✨ 能夠自動預測推理過程的最佳分塊。研究團隊採用自適應生成策略來具體實現。它的推理速度 ⚡️ 竟然直接飆升了四倍多。這套新架構極其輕量並且支持無縫嵌入現有系統。

獨家|清華系初創完成數億元種子輪融資:我們不想被貼上「世界模型」的標籤
一家清華系初創公司近日完成數億元種子輪融資,該公司定位為 Physical AI 公司,強調自身既非本體公司也非模型公司,並明確表示不願被貼上「世界模型」的標籤。