Hugging Face Blog生成式AI

Reachy Mini goes fully local

2026年5月27日 00:00

重點摘要

站內 AI 整理稿

Back to Articles Reachy Mini goes fully local Published May 27, 2026 Update on GitHub Upvote 34 +28 Amir Mahla A-Mahla Follow Andres Marafioti andito Follow After building your Reachy Mini, you'll install the conversation app and start talking to it. Until now, you had to send your audio to a server. But not anymore. Today we'll walk you through running the whole stack locally. This stack is powered by speech-to-speech, our cascaded VAD → STT → LLM → TTS pipeline that exposes a Realtime API-compatible /v1/realtime WebSocket. Once you launch the backend, point the robot at it from the UI. Cascades are the most flexible option in the open-source landscape today, and with the right pieces they're also the fastest. We'll recommend the components we like best, but the whole point of a cascade is that you can swap them. New models drop every week. TL;DR Deploy a local speech backend for your Reachy Mini. We use our speech-to-speech library, a cascade approach. Recommended: llama.cpp with Gemma 4, Silero VAD, Parakeet-TDT 0.6B v3 STT, Qwen3-TTS. Quick start This blog walks you through running conversations with Reachy Mini fully locally. No cloud, no API keys, no data leaving your machine. Here's a video showing this live: Locally serving the LLM To serve the LLM, we'll use Hugging Face's llama.cpp. If you need to install it, the simplest way is brew install llama.cpp or winget install llama.cpp, for more help, check the docs. First, we'll run: llama-server -hf ggml-org/gemma-4-E4B-it-GGUF -np 2 -c 65536 -fa on --swa-full And done! The first time it will download the model, subsequent launches are fast. What do those flags do? -hf ggml-org/gemma-4-E4B-it-GGUF — pulls the model straight from the Hub. First run downloads it, subsequent runs use the cache. -np 2 — two parallel slots. Lets the server handle a second request (e.g. a quick interruption) without blocking on the first. -c 65536 — 64k context window, shared across slots. Plenty of headroom for long conversations. -fa on — flash attention. Faster and lower memory, basically free on modern hardware. --swa-full — keeps the full sliding-window attention cache instead of recomputing it. Trades a bit of RAM for noticeably faster prompt processing on Gemma. Setting up speech-to-speech We'll begin by simply installing the library uv pip install speech-to-speech Then, while we are serving the LLM in another terminal, we can simply run: speech-to-speech --responses_api_base_url "http://127.0.0.1:8080" --responses_api_api_key "" --mode local And you can start talking to the model through your terminal! The first time it will need to download Parakeet-TDT 0.6B v3 and Qwen3TTS, but subsequent launches are fast. Here's a video showing the local conversation mode: Now, after you've tried it in --mode local, you can run again the command without that option to serve speech-to-speech to the robot. Connecting Reachy Mini to speech-to-speech Once you have llama.cpp and speech-to-speech running, you can start the robot with the desktop app and launch the conversation app. In the UI from the conversation app, you need to choose the local mode by clicking on "edit connection" in the HF backend. Here's a video showing how to do it: And you're done. You can start talking to your robot. Every stage of the pipeline is a trade-off: there are faster TTS models with lower quality, slower STT models with higher quality. We optimized for multilingual, you might want to optimize for a single language. The rest of the blog covers how to customize. Going deeper Why run your own Speech-to-Speech server? Hosted realtime backends are convenient, but running your own engine unlocks three things: Privacy. Audio never leaves your network, the entire pipeline runs on hardware you control. No API costs. No per-minute or per-token fees. Full control over the pipeline. Swap any piece: VAD, STT, LLM, TTS. Whenever something better lands on the Hub 🤗. The speech-to-speech repo gives you all of that in a single CLI. It boots a WebSocket server at /v1/realtime that speaks the same protocol Reachy Mini already knows how to talk to. Our opinionated defaults: VAD, STT, TTS A cascaded voice pipeline has four stages: VAD, STT, LLM, and TTS. For three of them, we pick solid defaults so you can focus on the LLM: Stage Choice Why VAD Silero VAD v5 Tiny, accurate, runs on CPU. The de-facto default in the open-source voice-agent world. STT Parakeet-TDT 0.6B v3 Streaming-friendly, very fast, great quality on English. TTS Qwen3-TTS Expressive, low-latency, multilingual, supports custom voices. We are opinionated about these choices, feel free to swap them out for your own if you have a preference. Choosing your LLM The LLM is the layer with the most impact on latency and overall performance of the system. We support two options: run a model locally (llama.cpp, MLX, Transformers, vLLM), or use a server with a Responses API (OpenAI, Gemini, HF Inference Endpoints, llama.cpp, vLLM, etc). The Responses API: decouple the brain from the voice loop The main bottleneck in the system is LLM inference latency. To address that, we support external inference engines exposed through the Responses API protocol. The speech-to-speech engine therefore supports a second mode where the LLM lives in a separate process as long as it speaks the Responses API protocol. You launch your model server in one terminal, you launch the voice loop in another terminal, and the two talk over HTTP. Option 1: llama.cpp in one terminal, speech-to-speech in the other Terminal 1: llama.cpp server: llama-server -hf ggml-org/gemma-4-E4B-it-GGUF -np 2 -c 65536 -fa on --swa-full Terminal 2: speech-to-speech client: speech-to-speech \ --mode realtime \ --stt parakeet-tdt \ --tts qwen3 \ --llm_backend responses-api \ --model_name "unsloth/Qwen3-4B-Instruct-2507-GGUF" \ --responses_api_base_url "http://127.0.0.1:8080/v1" Option 2: vLLM in one terminal, speech-to-speech in the other Requires vLLM ≥ 0.21.0. Full support for the Responses API protocol, including tool-call streaming used by the speech-to-speech backend, landed in vLLM 0.21.0. Older versions will boot but trip up as soon as the assistant tries to call a tool. When serving a model through vLLM for this pipeline, three flags are effectively required: --enable-auto-tool-choice --tool-call-parser <tool_parser_name> — picks the per-family parser that turns the model's raw output into structured tool calls (e.g. qwen3_coder for Qwen3 instruct models, llama3_json for Llama 3, hermes for Hermes-style models, ...). --default-chat-template-kwargs '{"enable_thinking":false}' : disables the <think> reasoning channel for models that support it. For harder agentic tasks you can flip this to true and let the model reason, but for a natural-feeling conversation we strongly recommend keeping it off: every thinking token is latency the user hears as silence before the robot starts speaking. Terminal 1: vLLM inference server (Qwen/Qwen3-4B-Instruct-2507): vllm serve Qwen/Qwen3-4B-Instruct-2507 \ --port 8000 \ --host 127.0.0.1 \ --max-model-len 32768 \ --enable-auto-tool-choice \ --tool-call-parser qwen3_coder \ --default-chat-template-kwargs '{"enable_thinking":false}' \ --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":1}' The --speculative-config line enables Multi-Token Prediction (MTP). It is optional, but it has a great impact on end-to-end latency. Leave it on whenever the model supports it. Terminal 2: speech-to-speech client: speech-to-speech \ --mode realtime \ --stt parakeet-tdt \ --tts qwen3 \ --llm_backend responses-api \ --model_name "Qwen/Qwen3-4B-Instruct-2507" \ --responses_api_base_url "http://127.0.0.1:8000/v1" Option 3: Hugging Face Inference Endpoints Same protocol, but the model runs on a managed GPU on Hugging Face. Deploy any chat model as an Inference Endpoint, then point the voice loop at the endpoint URL: speech-to-speech \ --mode realtime \ --stt parakeet-tdt \ --tts qwen3 \ --llm_backend responses-api \ -

原始來源：Hugging Face Blog ↗

查看原始來源

鈦媒體生成式AI

Edge AI Daily 早報（6月19日）

AI Engineer World's Fair 2026規模再創新高，標誌AI工程從幕後走向舞臺中央。行業面臨結構性調整：楊立昆警示OpenAI年虧210億美元揭示商業模式脆弱性，Transformer之父轉投OpenAI反映人才爭奪白熱化。Anthropic多線佈局——語音支持七種語言、加入碳清除聯盟、落子首爾辦事處，展現生態擴張野心。監管壓力加劇，意大利依據DMA調查蘋果iCloud，巴西開放iOS側載佣金降至5%，蘋果圍牆花園持續崩塌。

2 小時前閱讀分析

36氪生成式AI

今天起，Claude Design要把設計師和程序員變成同一種人了

猝不及防！Anthropic深夜甩出Claude Design大更新，設計系統一鍵導入，代碼雙向同步，9大平臺一鍵導出。Anthropic設計師親自下場錄屏：AI跑了八輪自查，才敢把設計稿給你看。

15 小時前閱讀分析

IT之家生成式AI

OpenAI 成為 Rust 基金會白金會員，合計贊助 60 萬美元

OpenAI 正式成為 Rust 基金會白金會員，將提供總計 60 萬美元資金，用於支持 Rust 開源項目維護者及 Rust 創新實驗室等計劃。這標誌著 AI 巨頭對安全、高效系統編程語言的重視。 #OpenAI #Rust #開源

18 小時前閱讀分析

IT之家生成式AI

Claude Design 上線首周用戶破百萬，和 Claude Code 共享 AI 配額

Anthropic 今天（6 月 18 日）發佈公告，在宣佈 Claude Design 上線首周用戶規模突破 100 萬後，進一步強化和 Claude Code 的雙向聯動，實現從設計到編程的無縫工作流。

19 小時前閱讀分析

智東西生成式AI

谷歌時隔6年再發智能音箱，Gemini上桌，售價不到700元

智東西編譯 | 劉煜編輯 | 陳駿達智東西6月18日消息，谷歌昨日宣佈，其首款搭載居家版Gemini語音助手的智能音箱（Google Home Speaker）已開啟預售，將於當地時間6月25日正式上市，售價為99.99美元（約合人民幣677.03元）。在此之前，谷歌已有6年沒有推出過獨立智能音箱產品。谷歌這款智能音箱外觀近似球形，風格類似亞馬遜新一代Echo音箱與蘋果舊款音箱HomePod Mini。 ▲谷歌智能音箱（圖源：谷歌官網）使用音箱時，用戶只需通過口令“Hey Google”或“OK Google”喚醒Gemini，就可以繼續下達相應指令。這與谷歌舊款音箱、智能顯示屏等喚醒語音助手的方式相同。此外，用戶只要按照日常說話習慣下達命令，Gemini便能理解用戶意圖，相比之前大大提升溝通效率。一、加強短時對話記憶，會員可與Gemini不限次數對話谷歌此次推出的全新音箱升級諸多功能。其中，音箱搭載的Gemini語音助手擁有10款全新擬人化語音音色，用戶可以根據喜好自行選擇聲線。音箱還可支持用戶一次性下達多條語音指令，即使指令未能說對、說完整，用戶中途改口Gemini也能識別。 Gemini還具備多鏈路推理能力，落地到實際生活場景中比較實用。例如，用戶問：“我支持的足球隊下場比賽天氣如何？”Gemini收到指令後，會自動查詢賽事時間、舉辦地點，同時匹配相應時段天氣，再給出答覆。同時，Gemini加強了短時對話記憶，能承接上下文實現連續對話功能。即使用戶連續追問、甚至串聯多項任務、不重複交代前置條件，該語音助手也能實現來回連貫交流。 ▲谷歌Gemini對話場景（圖源：谷歌官網）不僅如此，Gemini搭配的連續對話功能，能讓應答後的音箱麥克風保持短暫收音，用戶無需重複喊“OK Google”就能繼續提問。該功能現已全面支持所有Gemini原生適配的語言，包括

22 小時前閱讀分析