Hugging Face Blog生成式AI

一鍵在 HF Jobs 上啟動 vLLM 伺服器

2026年6月26日 00:00

重點摘要

你現在可以透過單一指令，在 Hugging Face 基礎架構上啟動一個私有的、相容於 OpenAI 的 LLM 端點——無需佈建伺服器、不需要 Kubernetes，按秒計費。啟用後，你可以從筆電、筆記本或任何地方對其進行查詢。這是為測試、評估或批次生成快速啟動模型的最快方式。（如果你需要的是受管理的、可立即上線的服務，Inference Endpoints 才是你的選擇——文末會說明何時該選哪一種。）以下是完整流程。前置需求：需有付款方式或正向預付餘額（Jobs 按硬體使用量以每分鐘計費），以及 huggingface_hub >= 1.20.0。

站內 AI 整理稿

Back to Articles Run a vLLM Server on HF Jobs in One Command Published June 26, 2026 Update on GitHub Upvote - Quentin Gallouédec qgallouedec Follow You can spin up a private, OpenAI-compatible LLM endpoint on Hugging Face infrastructure with a single command — no servers to provision, no Kubernetes, pay-per-second. Once it's up, you can query it from your laptop, a notebook, or anywhere else. It's the quickest way to stand up a model for tests, evals, or batch generation. (If you're after a managed, production-ready service instead, that's what Inference Endpoints are for — more on when to pick which at the end.) Here's the whole thing end to end. Prerequisites A payment method or a positive prepaid credit balance (Jobs is billed per‑minute by hardware usage). huggingface_hub >= 1.20.0: pip install -U "huggingface_hub>=1.20.0". Logged in locally: hf auth login. Launch the server hf jobs run is docker run for HF infrastructure. We use the official vllm/vllm-openai image, ask for a GPU with --flavor, and expose vLLM's port with --expose: hf jobs run --flavor a10g-large --expose 8000 --timeout 2h \ vllm/vllm-openai:latest \ vllm serve Qwen/Qwen3-4B --host 0.0.0.0 --port 8000 --expose 8000 routes the container's port through HF's public jobs proxy (see the Serve Models guide for the full reference). The command prints the URL your server is reachable at: ✓ Job started id: 6a381ca1953ed90bfb947332 url: https://huggingface.co/jobs/qgallouedec/6a381ca1953ed90bfb947332 Hint: Exposed ports are reachable at (requires an HF token with read access to the job): https://6a381ca1953ed90bfb947332--8000.hf.jobs 6a381ca1953ed90bfb947332 is your job ID. Keep track of it, we'll need it. We'll use <job_id> as a placeholder for it in the rest of the post. Give it a couple of minutes to download weights and boot. When the logs show Application startup complete, you're live. Query it from anywhere vLLM speaks the OpenAI API, and every request just needs your HF token as a bearer token. The quickest way to hit it is curl: curl https://<job_id>--8000.hf.jobs/v1/chat/completions \ -H "Authorization: Bearer $(hf auth token)" \ -H "Content-Type: application/json" \ -d '{ "model": "Qwen/Qwen3-4B", "messages": [{"role": "user", "content": "Hello!"}], "chat_template_kwargs": {"enable_thinking": false} }' which returns the usual OpenAI-style JSON, with choices[0].message.content holding "Hello! How can I assist you today? 😊". Or, from Python, point the OpenAI client at the exposed URL and pass the token as the API key: from huggingface_hub import get_token from openai import OpenAI client = OpenAI( base_url="https://<job_id>--8000.hf.jobs/v1", api_key=get_token(), ) resp = client.chat.completions.create( model="Qwen/Qwen3-4B", messages=[{"role": "user", "content": "Hello!"}], extra_body={"chat_template_kwargs": {"enable_thinking": False}}, ) print(resp.choices[0].message.content) Hello! How can I assist you today? 😊 Quick health check before you start: curl https://<job_id>--8000.hf.jobs/v1/models -H "Authorization: Bearer $(hf auth token)" should list the model. 🔐 The endpoint is gated, not public. Every request must carry an HF token with read access to the job's namespace. A plain browser visit will be rejected. In effect, the jobs proxy is your API gate: access is scoped to you (and your org). That's fine for private use, but treat the URL accordingly: don't share it expecting it to be open, and don't paste your token into untrusted places. If you need finer-grained or public access, put a proper gateway in front instead. Or see HF Jobs or Inference Endpoints? below. Clean up Jobs are billed per second, so stop the server when you're done: hf jobs cancel <job_id> The --timeout you set is a safety net (it'll auto-stop), but cancelling explicitly is cheaper. An a10g-large runs at $1.50/hour — check hf jobs hardware for the full price list and pick the smallest flavor that fits your model. Going further: bigger models The same command scales to much larger models — pick a beefier --flavor and tell vLLM to shard the model across the GPUs with --tensor-parallel-size. For example, the 122B Qwen3.5 mixture-of-experts model on 2× H200: hf jobs run --flavor h200x2 --expose 8000 --timeout 2h \ vllm/vllm-openai:latest \ vllm serve Qwen/Qwen3.5-122B-A10B \ --host 0.0.0.0 --port 8000 --tensor-parallel-size 2 \ --max-model-len 32768 --max-num-seqs 256 --tensor-parallel-size should match the number of GPUs in the flavor (h200x2 → 2, h200x8 → 8). Run hf jobs hardware to see what's available and give bigger models a longer --timeout, since they take longer to download and load. For large models, H200 flavors are usually the best value. The --max-model-len 32768 --max-num-seqs 256 flags are specific to this model: Qwen3.5-122B is a hybrid Mamba/attention architecture with a 256K-token default context, which doesn't leave enough memory for vLLM's default batch settings. Capping the context length and concurrent-sequence count keeps it within the GPUs' memory. If a model fails to start with an out-of-memory or cache-block error, dialing these two down is the first thing to try. Everything else (the exposed URL, the OpenAI client, the token auth) stays exactly the same. Going further: Chat with it in a UI Prefer a chat window over curl? A few lines of Gradio point at the same endpoint. Add --reasoning-parser deepseek_r1 to the vllm serve command so Qwen3's thinking comes back as a separate field (not necessary, but helpful), then run this code locally (you'll just need the job ID): import gradio as gr from gradio import ChatMessage from huggingface_hub import get_token from openai import OpenAI client = OpenAI(base_url="https://<job_id>--8000.hf.jobs/v1", api_key=get_token()) def chat(message, history): messages = [{"role": m["role"], "content": m["content"]} for m in history if not m.get("metadata")] messages.append({"role": "user", "content": message}) stream = client.chat.completions.create(model="Qwen/Qwen3-4B", messages=messages, stream=True) thinking, answer = "", "" for chunk in stream: delta = chunk.choices[0].delta thinking += delta.model_extra.get("reasoning", "") answer += delta.content or "" out = [] if thinking.strip(): status = "done" if answer.strip() else "pending" out.append(ChatMessage(role="assistant", content=thinking, metadata={"title": "💭 Thinking", "status": status})) if answer.strip(): out.append(ChatMessage(role="assistant", content=answer)) yield out gr.ChatInterface(chat).launch() Run it, open http://127.0.0.1:7860, and chat — reasoning streams into the collapsible panel, the answer below. Going further: SSH into the running server Need to debug a startup failure, watch GPU memory, or tail logs interactively? You can open a shell straight into the running job. Launch it with --ssh and make sure your public key is registered at huggingface.co/settings/keys: hf jobs run --flavor a10g-large --expose 8000 --timeout 2h --ssh \ vllm/vllm-openai:latest \ vllm serve Qwen/Qwen3-4B --host 0.0.0.0 --port 8000 then connect with the job ID: hf jobs ssh <job_id> You're now inside the container, where you can run nvidia-smi, inspect the process, or poke at the model directly — which makes debugging and monitoring much easier than reading logs from the outside. SSH support requires huggingface_hub >= 1.20.0. Going further: Use it as a coding-agent backend with Pi The same endpoint can back a terminal coding agent. Pi is a provider-agnostic agent harness. Point it at the job and you get a Read/Write/Edit/Bash agent running on your own self-hosted model. One thing to set up first: agents drive the model through tool calls, and vLLM only accepts those if the server is launched with tool calling enabled. So relaunch with --enable-auto-tool-choice and a --tool-call-parser matching the model family (hermes for Qwen3). Agents also benefit from a stronger model, so this is a good place to bring in the bigger one: hf jobs run --flavor h200x2 --expose 8000 --timeout 2h \ vllm/vllm-openai:lat

原始來源：Hugging Face Blog ↗

查看原始來源

智東西生成式AI

ChatGPT不拼智商拼情商了？GPT-5.5 Instant更新，明天開始免費使用

這篇消息聚焦「ChatGPT不拼智商拼情商了？GPT-5.5 Instant更新，明天開始免費使用」。原始導語提到：GPT-5.5 Instant更新上線，重點提升建議、決策和日常對話體驗。從 AI 情報角度來看，這類內容值得關注其背後的技術進展、產品落地、產業競爭與後續市場影響。

2 小時前閱讀分析

36氪生成式AI

月之暗面黃震昕：Kimi的目標是和海外三家模型掰手腕

這篇消息聚焦「月之暗面黃震昕：Kimi的目標是和海外三家模型掰手腕」。原始導語提到：企業級AI難點並不在模型廠商這一側，在於如何去切入和推進企業完成AI轉型。從 AI 情報角度來看，這類內容值得關注其背後的技術進展、產品落地、產業競爭與後續市場影響。

3 小時前閱讀分析

MarkTechPost AI生成式AI

DeepReinforce 發布 Ornith-1.0：開源程式碼模型系列，可自行學習強化學習框架

DeepReinforce 發布了 Ornith-1.0，這是一個專為代理式程式碼任務打造的開源模型系列。該系列涵蓋四種尺寸，從 9B 密集模型到 397B 混合專家旗艦模型。所有檢查點皆以 MIT 授權條款在 Hugging Face 上發布。這些模型是在預訓練的 Gemma 4 和 Qwen 3.5 基礎上進行後訓練。多數程式碼代理會將模型與固定的人工設計框架搭配使用，而 Ornith-1.0 則學會自行編寫其框架。DeepReinforce 研究團隊報告指出，在同等規模的開放模型中，該模型達到了最先進的結果。摘要：Ornith-1.0 提供 9B、31B、35B-MoE 和 397B-MoE 四種尺寸，均採用 MIT 授權，基於 Gemma 4 和 Qwen 3.5。該模型在強化學習過程中自行學習其框架，同時優化框架與解決方案。Ornith-1.0-397B 在兩個主要基準測試中超越了 Claude Opus 4.7。

5 小時前閱讀分析

36氪生成式AI

跟Claude談個戀愛怎麼了？Nature最新研究：真能給人聊傻了

這篇消息聚焦「跟Claude談個戀愛怎麼了？Nature最新研究：真能給人聊傻了」。原始導語提到：別把AI當老公，容易聊出精神病從 AI 情報角度來看，這類內容值得關注其背後的技術進展、產品落地、產業競爭與後續市場影響。

7 小時前閱讀分析

36氪生成式AI

Fable 5即將復活，代碼已曝光？Anthropic CEO被白宮踢出來了

剛被「封印」，Fable 5就要滿血復活？最近，Claude Fable 5代碼痕跡曝光，開發者圈一片歡呼，而外媒爆料，Anthropic最近一路順利，背後竟是因為CEO被白宮趕下談判桌！

7 小時前閱讀分析

雷峰網生成式AI

這次是阿里！中國的大模型團隊快被 Anthropic 告完了

這是Anthropic迄今控訴的最大規模“模型蒸餾”案。作者丨高允毅編輯丨馬曉寧 01Anthropic已經告了四家中國AI公司短短四個月，四家中國頂級AI公司被Anthropic接連點名，且沒有停手的跡象。這一次，輪到阿里。2026年6月10日，Anthropic向美國參議院銀行委員會遞交了一封信，矛頭直指阿里Qwen團隊。報告披露了一串數字：從4月22日到6月5日，整整45天，阿里相關運營者利用2.5萬個賬號，完成了2880萬次交互。這是Anthropic迄今公開的最大規模“模型蒸餾”數據。2880萬次對話是什麼概念？放一個行業參照：目前主流的高質量SFT（監督微調）數據集，規模通常在數十萬到幾百萬條之間。2880萬次針對核心能力的定向交互，足以在特定任務域內，低成本“提純”出一個極具競爭力的專用模型。這引起了Anthropic的高度警惕。在他們看來，對方的行為目標極其精準，刀刀直指其最新旗艦模型 Mythos Preview 的核心底牌，軟件工程與智能體推理能力。Anthropic在信中將其定性為“迄今為止，中國公司試圖搭美國頂尖實驗室便車的最大規模嘗試”。梳理時間線可以發現，Anthropic的反擊正在顯著升級。2026年2月23日，Anthropic發佈了一篇博客文章《Detecting and Preventing Distillation Attacks》，公開點名三家中國AI實驗室：DeepSeek、月之暗面（Kimi）、MiniMax（稀宇科技）。報告顯示，約2.4萬個中國相關賬號對Claude發起了超過1600萬次交互，其中MiniMax超1300萬，月之暗面超340萬，DeepSeek超15萬。從1600萬次到2880萬次，規模在翻倍，Anthropic的反擊，也從2月份的“技術曝光”，升級到6月份“政治施壓”。而這次的收信人，銀行委員會主席蒂姆·

7 小時前閱讀分析

相關文章

ChatGPT不拼智商拼情商了？GPT-5.5 Instant更新，明天開始免費使用

月之暗面黃震昕：Kimi的目標是和海外三家模型掰手腕

DeepReinforce 發布 Ornith-1.0：開源程式碼模型系列，可自行學習強化學習框架

跟Claude談個戀愛怎麼了？Nature最新研究：真能給人聊傻了

Fable 5即將復活，代碼已曝光？Anthropic CEO被白宮踢出來了

這次是阿里！中國的大模型團隊快被 Anthropic 告完了