Hugging Face BlogAI Agent

五個實驗室，五種思維：在小模型上構建多模型金融劇

2026年6月6日 19:02

重點摘要

第二屆 Build Small 黑客松實戰報告：當一個新興經濟體中的每個代理都由不同實驗室的小型模型驅動，而玩家成為幕後操控的金融家時，會發生什麼？《千符森林》第一版是一個天氣神沙盒：五隻林地生物在一個微調的 0.5B 模型上交易商品，你透過衝擊幹預世界，觀察泡沫與崩潰的湧現。這是一個有趣的玩具，但更像是觀看而非遊玩。v2 將它重建成一款可供操作的遊戲。你化身「森林守護者」，一名影子金融家：以利息放貸、傳遞小道消息……

站內 AI 整理稿

Back to Articles Five labs, five minds: building a multi-model finance drama on small models Team Article Published June 6, 2026 Upvote - Lester Leong AdmiralTaco Follow build-small-hackathon A second Build Small Hackathon field report: what happens when each agent in an emergent economy runs on a different lab's small model, and the player becomes the financier pulling the strings. The first version of Thousand Token Wood was a weather-god sandbox: five woodland creatures on one fine-tuned 0.5B model traded goods, and you poked the world with shocks and watched bubbles and crashes emerge. It was a nice toy. It was also something you watched rather than played. v2 rebuilt it into a game you operate. You are the Patron of the Wood, a shadow financier: you lend at interest, whisper tips that may be true or planted, short the market, bribe, and broker alliances, while a magistrate hunts you for trading on what you should not know. The creatures remember how you treated them and scheme back. And the biggest change is under the hood: every creature now thinks with a different lab's small model. This is the engineering report. Heterogeneity is the product, not a constraint The obvious way to run a council of agents is one model, many prompts. v2 runs four: gpt-oss-20b (OpenAI), MiniCPM3-4B (OpenBMB), Nemotron-Mini-4B (NVIDIA), and a fine-tuned Qwen 0.5B of my own. The point is not novelty for its own sake. A market is interesting when the participants genuinely differ, and four labs' models trained on different data with different post-training are about as different as small models get. The owl hoards differently than the fox speculates. The council is a live argument, not a script. Standing four distinct models up on one platform surfaced the real lesson: the friction is almost entirely at the serving layer, not the modeling layer. Current vLLM (0.22.1) JIT-compiles kernels at load and needs the CUDA toolkit (nvcc) present. A lean base image does not ship it, so all four models failed identically with "could not find nvcc" until I based them on a CUDA devel image. This was not a gpt-oss quirk; it was universal to the vLLM version. One image fix unblocked everything. gpt-oss-20b runs in its native MXFP4 quantization and fits a 24GB L4 with room to spare; no high-end GPU needed. It also speaks a channel format that wraps the answer in an analysis preamble, so the consumer has to extract the final channel. MiniCPM3 needed trust_remote_code; Nemotron loaded clean. Per-model footguns, each a one-line config. The thing that made four heterogeneous models tractable was the same primitive that made one model tractable in v1: a tolerant JSON parse-and-repair layer that every model's output flows through. Different tokenizers and formatting habits produce different malformations; the parser drops what it cannot salvage and the simulation never crashes. Build that layer once and adding a model is a config entry, not a refactor. Information asymmetry needs a firewall The dramatic core of v2 is the insider tip. You can whisper a tip to a creature that is true (a real forecast of the next market mania the deck will draw, your genuine edge) or false (bait). Acting on a true tip and profiting raises your heat; cross a threshold and the magistrate opens an investigation that ends in a fine, frozen assets, or exile. For that to be a real game, the truth of a tip must be hidden from the creatures. They see the rumor text; they must never see the flag. This is a security property, not a UI nicety, and small-model agents make it sharp: everything the model could repeat back is whatever you put in its prompt. So the hidden flag lives off-prompt entirely (on the player's ledger), it is stripped from the public event record at construction, and the only thing the narrator ever summarizes is public events. A single test scans every creature's full prompt, every turn, for the banned tokens. That test is the most important one in the suite. When you give an agent secret information, assume it will leak unless a test proves it cannot. Memory is cheap drama if you bound it Creatures carry persistent relationships: a signed sentiment toward the Patron and toward each other, nudged by events (you shorted my crop, you repaid your loan, you allied me with a rival). A creature that turns hostile refuses your loans and quotes you worse; allied creatures stop undercutting each other and behave like a cartel. The trap is prompt inflation. Raw history grows without bound and a small model drowns in it. The fix is to never put history in the prompt: the model sees a one-line bucketed summary ("you feel warmly toward Oona, wary of the Patron"), capped to the few strongest feelings, derived from integer sentiment. Notes are kept for traces but bounded and never shown. The behavioral bias is part emergent (the summary nudges the model) and part mechanical (a strongly hostile creature deterministically refuses), so it is observable and testable rather than a hope. What actually happened A representative council run, with the full v2 mechanics live: Lever Result Models in the council 4 labs, all under the 32B cap, served on Modal Fine-tuned 0.5B reliability 0% self-buys, 100% valid offers (beats its 3B teacher) Truth firewall 0 leaks of a tip's hidden flag across every prompt scanned Insider tip edge a true-tip pre-position settles a positive P&L; a false tip does not Heat to investigation two clean suspicious wins cross the magistrate's line Ruin a margin call and a loan default banish a creature, who returns a chapter later A single seeded run exercising the Patron, the information war, relationships, and leverage end to end. Takeaways for building with small models A small model is a reliable format generator and an unreliable reasoner; you close the gap with structure, prompting, and a small fine-tune, not with scale. A heterogeneous council is more interesting than a homogeneous one and costs you only config once the serving layer is solid. Secret information given to an agent is a firewall problem, and the firewall belongs in the data flow, proven by a test, not in a prompt instruction. And persistent memory is the cheapest way to make agents feel alive, as long as the prompt only ever sees a bounded summary. Small models, big adventures. The whole council is open, and so are the traces. More from this author Job Searcher 2 June 6, 2026 Thousand Token Wood: shipping a multi-agent economy on a 3B model 2 June 5, 2026 Community EditPreview Upload images, audio, and videos by dragging in the text input, pasting, or clicking here. Tap or paste here to upload images Comment · Sign up or log in to comment Upvote -

原始來源：Hugging Face Blog ↗

查看原始來源

TechWebAI Agent

網易有道全面向AI轉型全場景Agent矩陣亮相圖博會

{"id":"39ef5947-b77a-4904-bf03-ff6264f08dc4","object":"response","model":"deepseek-v4-flash","output":[],"stop_reason":"max_output_tokens","usage":{"input_tokens":154,"output_tokens":200,"total_tokens":354}}

剛剛閱讀分析

Hugging Face BlogAI Agent

MosaicLeaks: Can your research agent keep a secret?

Back to Articles MosaicLeaks: Can your research agent keep a secret? Enterprise Article Published June 18, 2026 Upvote - Alexander Gurung agurung Follow ServiceNow Rafael Pardinas rafapi-snow Follow ServiceNow TL;DR Deep research agents increasingly combine private local documents with external tools like web retrieval, creating a privacy risk: an agent's external queries may leak sensitive information. MosaicLeaks proposes a new deep-research task with multi-hop questions that interleave public and private information. Across the models we tested, agents frequently leaked private information, and training only for task performance made it worse. We propose a mosaic-leakage-aware RL training method, Privacy-Aware Deep Research (PA-DR), which raises strict chain success (the share of chains

17 小時前閱讀分析