Hugging Face Blog模型更新

如何針對您的語言、領域或口音微調 Nemotron 3.5 ASR

2026年6月4日 12:59

重點摘要

NVIDIA 推出 Nemotron 3.5 ASR,這是一個具備 6 億參數的串流多語言語音轉文字模型,能從單一檢查點即時轉錄 40 種語言區域,並內建標點符號與大小寫功能。它繼承了今年稍早於 Hugging Face 及 NIM 發布的 Nemotron 3 ASR(僅支援英文)模型,後者已獲得人工智慧分析獨立基準測試的驗證。

站內 AI 整理稿

Back to Articles How to Fine-Tune Nemotron 3.5 ASR for Your Language, Domain, or Accent Enterprise + Article Published June 4, 2026 Upvote - Maryam Motamedi maryameee Follow nvidia Adi- margolin Amargolin Follow nvidia Francesco fciannella Follow nvidia Myungjong Kim Myungjong Follow nvidia Enas Albasiri enas-albasiri Follow nvidia Introducing NVIDIA Nemotron 3.5 ASR, streaming multilingual: a 600M-parameter speech-to-text model that transcribes 40 language-locales from a single checkpoint, in real time, with punctuation and capitalization built in. It is the successor of the popular Nemotron 3 ASR model (English only) which was released on Hugging Face and as a NIM earlier this year. Since its release, Nemotron 3 ASR has been validated by independent benchmarks at Artificial Analysis, where it ranks 2nd in latency among all streaming ASR models— with just 0.07 seconds to final transcript after end of speech — and sits in the "most attractive quadrant" of the AA-WER Streaming Index vs. Time to Final Transcription leaderboard, placing it among the best models on the combined accuracy-latency tradeoff. The model uses a Cache-Aware FastConformer-RNNT architecture that streams audio without the redundant recomputation that makes most streaming ASR slow — so you get low latency and high accuracy, not one at the expense of the other. Nemotron 3.5 ASR ships as open weights on Hugging Face — you can inspect, fine-tune, and deploy it without API dependencies or per-call billing. No data leaves your infrastructure unless you choose. And because it's a strong base model, you can fine-tune it for your own language, domain, or accent. The second half of this post walks through exactly how. The problem with multilingual speech recognition today If you've ever built a product that needs to transcribe speech, you've probably hit one of these walls: The polyglot tax. You want to support multiple languages, so you stitch together 40 different models — or 40 different vendor APIs — each with its own quirks, latency profile, and billing. Your infrastructure becomes a museum of one-off integrations. The streaming-vs-accuracy tradeoff. Real-time captioning needs low latency, but most "streaming" ASR systems fake it by re-processing overlapping windows of audio over and over. That burns compute and adds delay. Turn down the latency and accuracy falls off a cliff. The post-processing pipeline. Raw ASR output is often an unpunctuated, lowercase wall of text. You bolt on a second model for punctuation and capitalization, adding yet another moving part. The "known language" assumption. Many systems require you to tell them the language up front. But what about a customer-support line where callers switch between English and Spanish mid-sentence? Nemotron 3.5 ASR was built to collapse all four of those problems into one model. What it does One model, 40 language-locales. A single 600M-parameter checkpoint transcribes English (US/GB), Spanish (US/ES), German, French (FR/CA), Italian, Arabic, Japanese, Korean, Portuguese (BR/PT), Russian, Hindi, Turkish, Vietnamese, Dutch, Ukrainian, Polish, Finnish, Mandarin, Czech, Bulgarian, Slovak, Swedish, Croatian, Romanian, Estonian, Danish, Hungarian, Norwegian Bokmål, Norwegian Nynorsk, Hebrew, Greek, Lithuanian, Latvian, Maltese, Slovenian, and Thai. No per-language deployment, no model-swapping. Real-time streaming, done right. The model is built on a Cache-Aware FastConformer encoder. Traditional "buffered" streaming re-processes overlapping chunks of audio at every step, doing the same work many times over. This model instead caches the encoder's internal state and reuses it — every audio frame is processed exactly once, with no overlap. The result is dramatically lower compute and end-to-end latency, with no accuracy penalty. Punctuation and capitalization, natively. The output is production-ready text — proper casing, commas, periods, question marks — straight from the model. No separate punctuation-restoration step. Language conditioning, your choice. You can run it two ways: Tell the model the input language (target_lang=en-US) when you know it — typically the best accuracy. Let the model detect the language (target_lang=auto) when you don't — the model detects the language and transcribes accordingly. How it works (the 2-minute version) The model has two main pieces: A Cache-Aware FastConformer encoder (24 layers). FastConformer is an efficient evolution of the Conformer architecture with linearly scalable attention. The "cache-aware" part is the streaming magic: the encoder keeps a cache of its self-attention and convolution activations from previous frames, so as new audio arrives it only computes what's genuinely new. Nothing is recomputed. An RNNT (Recurrent Neural Network Transducer) decoder. RNNT is the workhorse decoder for streaming ASR — it emits text as audio streams in, frame by frame, which is exactly what you want for live transcription. On top of this, the model adds prompt-based language-ID conditioning: a language signal is fed alongside the audio, which lets one set of weights specialize its output to the target language — or, in auto mode, infer the language itself. It was trained on a massive speech data spanning all supported languages, using a blend of public and proprietary data normalized to punctuated, properly-cased text. A knob worth knowing: att_context_size Streaming ASR is fundamentally a tradeoff between how soon you emit text and how much future audio the model gets to "peek at" before committing. Nemotron ASR exposes this directly through the attention context size: Attention Context Chunk Size (Latency) Use Case [56, 0] 80ms (Ultra-Low) Ultra low latency Voice Agents [56, 1] 160ms (Low) Interactive Voice Agents, Conversational AI [56, 3] 320ms (Balanced) Conversational AI, Live caption [56, 6] 560ms (Medium) High accuracy with reasonable latency [56, 13] 1.12s (High) Highest accuracy with high latency The same checkpoint covers the whole spectrum — you choose the operating point at inference time, no retraining required. Try it in minutes The model ships as a NeMo checkpoint. Clone the NeMo branch and point the streaming inference script at your audio: git clone https://github.com/NVIDIA-NeMo/NeMo.git Transcribe with a known language: python ${NEMO_ROOT}/examples/asr/asr_cache_aware_streaming/speech_to_text_cache_aware_streaming_infer.py \ model_path=${MODEL_PATH} \ dataset_manifest=${MANIFEST_PATH} \ output_path=${OUTPUT_FOLDER} \ target_lang=es-ES \ att_context_size="[56,3]" \ strip_lang_tags=true Or let the model detect the language: python ${NEMO_ROOT}/examples/asr/asr_cache_aware_streaming/speech_to_text_cache_aware_streaming_infer.py \ model_path=${MODEL_PATH} \ dataset_manifest=${MANIFEST_PATH} \ output_path=${OUTPUT_FOLDER} \ target_lang=auto \ att_context_size="[56,3]" \ strip_lang_tags=true Audio should be mono-channel .wav. The manifest is a standard NeMo JSON-lines file: {"audio_filepath": "/path/to/clip.wav", "duration": 4.27, "text": "reference transcript"} Model automatically predicts language_tag at the end of each completed sentence, i.e. “This is a test sample. <en-US>”. “strip_lang_tags=True” removes the language tag <xx-XX> for better readability. Deep Dive: Fine-Tuning Nemotron ASR for Your Language Nemotron 3.5 ASR is strong out of the box — but it was trained on a mix where some languages have far more data than others. The long-tail locales have headroom, and a few hours of in-domain audio plus the right recipe closes a surprising amount of it. To make this concrete, we ran a worked example: take the base model and sharpen it on two mid-resource European languages — Greek, and Bulgarian — then measure honestly on held-out data. The results below are from that run. This section is a high-level overview and the coding example lives in the companion GitHub repo. When we publish an agentic SKILL.md covering the whole process, this blog will be updated accordingly. Why fine-t

Related

相關文章

MarkTechPost AI模型更新

Liquid AI Introduces LFM2.5-Embedding-350M and LFM2.5-ColBERT-350M: Dense Bi-Encoder and Late-Interaction Models for Fast Multilingual Search Across 11 Languages

This week, Liquid AI released two new retrieval models. They are LFM2.5-ColBERT-350M and LFM2.5-Embedding-350M. Both hold 350M parameters. Both are the first bidirectional members of the LFM family. They build on LFM2.5-350M-Base, released in March. The pair targets fast multilingual and cross-lingual search across 11 languages. Their footprint is small enough to run almost anywhere. Both are available now on Hugging Face under the LFM Open License v1.0. LFM2.5 Retrievers The two models share one backbone but represent text differently. LFM2.5-Embedding-350M is a dense bi-encoder. It turns each document into a single vector. Pick it when you want the fastest search and the smallest, cheapest index. LFM2.5-ColBERT-350M is a late-interaction model. It converts each token into a vector rather

1 小時前
MarkTechPost AI模型更新

Perplexity Launches Brain, a Self-Improving Memory System That Builds a Context Graph of an Agent’s Work and Learns Overnight

Most AI memory remembers the user. It stores your preferences, your tastes, and your role. Perplexity is taking a different path. Today, Perplexity launched Brain, a self-improving memory system for its agent product, Computer. Brain does not focus on remembering you. It remembers what the agent did. That reframes what memory in AI is for. What is Perplexity‘s Brain Brain is a self-improving memory system. It builds a context graph of the work Computer performs. At set intervals, such as overnight, Brain reviews that graph. It then teaches itself how to do the work better. The idea is straightforward. The more work you do, the more efficient Brain makes your Computer. Brain is rolling out today to Perplexity Max and Enterprise Max subscribers in Research Preview. Two Axes of AI Memory Perp

15 小時前

智譜新高,MiniMax承壓,“大模型雙雄”命運殊途

這篇消息聚焦「智譜新高,MiniMax承壓,“大模型雙雄”命運殊途」。原始導語提到:大模型在被市場重新定價 從 AI 情報角度來看,這類內容值得關注其背後的技術進展、產品落地、產業競爭與後續市場影響。

17 小時前