Hugging Face Blog生成式AI

AI代理如何串聯兩個Hugging Face Spaces,打造3D巴黎畫廊

2026年6月9日 10:46

重點摘要

一位AI代理透過串聯兩個Hugging Face Spaces,從零建構出一個展示巴黎景點的3D高斯潑濺網站。過程中,我未曾開啟任何圖像生成器或3D重建工具,代理直接呼叫兩個Hugging Face Spaces產出所有素材(圖像與3D潑濺),再將它們整合成一個電影般的展示介面。成果以靜態Space形式呈現:👉 mishig/monuments-de-paris。本文探討這項技術如何成為可能,以及為何我認為這預示著未來多媒體軟體開發的新方向——以模組化元件為基礎的建構方式。

站內 AI 整理稿

Back to Articles How an Agent Built a 3D Paris Gallery by Chaining Two Hugging Face Spaces Community Article Published June 9, 2026 Upvote 23 +17 Mishig Davaadorj mishig Follow An agent built a 3D Paris gallery from two Hugging Face Spaces. I asked a coding agent to build a beautiful website showcasing the monuments of Paris as 3D Gaussian splats. I never opened an image generator. I never touched a 3D reconstruction tool. The agent produced every asset (the images and the 3D splats) by calling two Hugging Face Spaces directly, then wired them into a cinematic viewer. Here's the result, live as a static Space: 👉 mishig/monuments-de-paris This post is about how that's possible now, and why I think it's a preview of how a lot of multimedia software gets built from here on. The building-block economy comes for multimedia Mitchell Hashimoto recently described a shift he calls the building block economy: the most effective path to software is no longer a polished monolith, but small, well-documented components that others (increasingly agents) can assemble. His key observation: AI is okay at building everything from scratch, but it is really good at gluing together proven pieces. That thesis has mostly been told with code libraries. But the same forces are hitting multimedia AI. The hard part of using a state-of-the-art image model, a video model, a TTS model, or a 3D reconstruction model was never the model. It was the integration: SDKs, weights, GPUs, input formats, polling. If each model were instead a documented, callable block, an agent could glue them together the same way it globs together npm packages. That's exactly what Hugging Face Spaces have quietly become. Every Space is a building block, via agents.md The Hub hosts thousands of state-of-the-art models (a huge share of them open-weights), and most are deployed as interactive Spaces. As of now, every Gradio Space also exposes a plain-text agents.md that tells an agent exactly how to call it: curl https://huggingface.co/spaces/VAST-AI/TripoSplat/agents.md returns everything needed in one shot: the schema URL, the call and poll templates, how to upload files, and the auth hint: API schema: GET .../gradio_api/info Call endpoint: POST .../gradio_api/call/v2/{endpoint} {"param_name": value, ...} Poll result: GET .../gradio_api/call/{endpoint}/{event_id} File inputs: POST .../gradio_api/upload -F "[email protected]" Auth: Bearer $HF_TOKEN No client library. No hardcoded integration. An agent reads that, and it can drive the Space end to end. Set an HF_TOKEN and you're going. You can find these instructions on any Gradio Space via its Agents button: The real unlock is chaining: the output of one Space becomes the input to the next. Prompt → image → 3D. That's the whole pipeline behind this gallery. The worked example: Paris monuments → splats The agent chained two Spaces: Image: an image-generation Space turned each monument into a clean, dark-background "specimen" shot (and the Eiffel Tower into a little diorama on a plinth). Prompt in, image out. Splat: VAST-AI/TripoSplat reconstructed a 3D Gaussian splat (.ply) from each single image. Image in, 3D out. Generated image Reconstructed splat The six source images the agent generated, all isolated on black, ready for single-image 3D reconstruction: From there the agent did the "glue" work too. It noticed TripoSplat outputs are Y-down and flipped them upright, auto-framed each monument, compressed the .ply files to .ksplat (~3× smaller, so they load fast), built a Three.js viewer with a scroll-to-switch and drag-to-rotate UI, and deployed the whole thing as a static Space. The only human inputs were taste-level: "make it zoomed out," "replace the obelisk with something better for splatting," "the transition lingers too long." Several of those steps were the agent reacting to reality. A wide glass pyramid splats poorly. A thin obelisk is dull. A single-view reconstruction infers the back. That is exactly the "outsourced R&D, fast iteration" loop the building-block economy predicts, except the R&D was a conversation. Two prompts, a whole new gallery The real test of a building block is how cheaply you can reuse it. Once this pipeline existed, spinning up entirely new galleries cost about one sentence each. "Create a similar Space with splats for Japan," then the same for Egypt, and the agent did the rest: six monument images, six splats, compression, a viewer, and a deployed Space, per country. 🏛️ Monuments of Egypt: the Great Pyramid, the Sphinx, Abu Simbel, the mask of Tutankhamun, Karnak, the Colossi of Memnon. <video autoplay loop muted playsinline width="100%" src=""> ⛩️ Monuments of Japan: Tokyo Tower, Himeji Castle, Kinkaku-ji, Osaka Castle, the Great Buddha of Kamakura, the Itsukushima torii. <video autoplay loop muted playsinline width="100%" src=""> Same two Spaces, same agents.md, only the prompts changed. That is the building-block economy in one line: the marginal cost of a new multimedia app falls toward the cost of describing it. Why this matters Models become composable. A SOTA splat model and a SOTA image model, from different orgs, chained with zero integration code. The Hub's open-weights catalog turns into a library of callable multimedia primitives. Agents prefer what's documented and reachable. agents.md makes a Space trivially reachable, so an agent will pick it over a model it has to set up by hand. That is the same dynamic Hashimoto flags for open-source libraries. The barrier was integration, and it's largely gone. "Turn a prompt into a rotating 3D monument" used to be a project. Here it was a step in a pipeline. Try it yourself Point your own agent at a Space's agents.md and let it cook: # image generation curl https://huggingface.co/spaces/ideogram-ai/ideogram4/agents.md # single-image to 3D gaussian splat curl https://huggingface.co/spaces/VAST-AI/TripoSplat/agents.md Paste either link into your coding agent (Claude Code, etc.), set your HF_TOKEN, and ask it to build something. The full, reproducible pipeline for this gallery, the scripts that hit those two agents.md endpoints, lives in the Space repo. The building blocks are sitting right there on the Hub. The agents already know how to glue. I asked a coding agent to build a beautiful website showcasing the monuments of Paris as 3D Gaussian splats. I never opened an image generator. I never touched a 3D reconstruction tool. The agent produced every asset (the images and the 3D splats) by calling two Hugging Face Spaces directly, then wired them into a cinematic viewer. Here's the result, live as a static Space: 👉 mishig/monuments-de-paris This post is about how that's possible now, and why I think it's a preview of how a lot of multimedia software gets built from here on. The building-block economy comes for multimedia Mitchell Hashimoto recently described a shift he calls the building block economy: the most effective path to software is no longer a polished monolith, but small, well-documented components that others (increasingly agents) can assemble. His key observation: AI is okay at building everything from scratch, but it is really good at gluing together proven pieces. That thesis has mostly been told with code libraries. But the same forces are hitting multimedia AI. The hard part of using a state-of-the-art image model, a video model, a TTS model, or a 3D reconstruction model was never the model. It was the integration: SDKs, weights, GPUs, input formats, polling. If each model were instead a documented, callable block, an agent could glue them together the same way it globs together npm packages. That's exactly what Hugging Face Spaces have quietly become. Every Space is a building block, via agents.md The Hub hosts thousands of state-of-the-art models (a huge share of them open-weights), and most are deployed as interactive Spaces. As of now, every Gradio Space also exposes a plain-text agents.md that tells an agent exactly how to call it: curl https://hugging

Related

相關文章

Claude Fable 5,名存實亡

assistant: 根據提供的內容,這似乎是一則關於AI模型服務的報導或評論。摘要如下:Claude的Fable 5模型在更新後性能大幅下滑,跑分結果出現斷崖式下跌。官方文檔揭露,用戶在付費使用Fable 5的過程中,實際運行的可能一直是舊版的Opus模型。此事件引發了對模型服務透明度的質疑。</think>Claude的Fable 5模型在更新後性能大幅下滑,跑分結果出現斷崖式下跌。官方文檔揭露,用戶在付費使用Fable 5的過程中,實際運行的可能一直是舊版的Opus模型。此事件引發了對模型服務透明度的質疑。

剛剛
智東西生成式AI

對話Kimi B端負責人黃震昕:把國產大模型搬上亞馬遜雲科技,未來與海外“御三家”掰手腕

月之暗面Kimi與亞馬遜雲科技展開四層合作,涵蓋基礎設施、平台服務、業務合作及垂直行業,藉此拓展全球市場。Kimi B端負責人黃震昕透露,公司提供業界最高人均算力,B端業務快速增長,並在Token效率、長程推理及Agent集群等方面取得技術突破,目標是與海外頂尖模型競爭。他預測,雖然算力成本上漲推升模型價格,但技術優化將持續提升性價比。

5 小時前
雷峰網生成式AI

算力之外的博弈:ICML 2026 透露了哪些學術硬通貨?

告別盲目刷榜,28頁 PPT 帶你摸透 ICML 新風向。 作者丨陳淑瑜 編輯丨岑峰 ICML 2026 的投稿量從去年的 12107 篇直接飆升至 23,918 篇,幾近翻倍。然而,最終的接收率卻牢牢釘在 26.56%,與去年幾乎持平。這一數據傳遞出一個明確的信號:並非競爭變得盲目激烈,而是學術評審標準經歷了一次深刻的“重新校準”。

6 小時前
智東西生成式AI

獨家:阿里全面禁用Claude

智東西 作者 | 李水青 編輯 | 雲鵬 智東西7月3日獨家獲悉,今日,阿里巴巴內部宣佈反向禁用Claude。阿里全員被要求卸載Anthropic相關產品,包括Sonnet、Opus、Fable等多個系列模型,以及Claude Code在內的Agent產品。禁令於7月10日正式生效。

8 小時前
智東西生成式AI

超190億!AI視頻最大單筆融資誕生,阿里騰訊百度都投了

快手旗下AI視頻生成業務「可靈AI」完成190.48億元融資,阿里、騰訊、百度均參與投資,快手持股比例降至約68.33%。可靈AI自2024年6月上線以來已更新30多次,2025年營收約11億元,年化收入運行率達5億美元。快手同時宣布首次授予員工股權獎勵,並計劃在未來12個月內推動可靈AI赴港上市。

11 小時前
MarkTechPost AI生成式AI

RAG-Anything 教學:在 Colab 中建立支援文字、表格、方程式與圖像的多模態檢索管道

本教學示範如何在 Google Colab 中建立 RAG-Anything 多模態檢索管道,支援文字、表格、方程式與圖像。流程包括安裝依賴、設定 OpenAI API、建立合成多模態報告與 PDF,並測試 naive、local、global 與 hybrid 等不同檢索模式。最終實現從內容列表格式插入資料,並透過多模態嵌入與視覺功能進行靈活檢索。

15 小時前