TinyFish Launches BigSet: An Open-Source Multi-Agent System That Builds Structured Live Datasets from Plain-English Descriptions

重點摘要
Building a structured dataset from the web is still a pipeline problem. You identify a data source, write or configure a scraper, design a schema, handle deduplication, schedule refreshes, and fix breakage when upstream sites change. That process stays roughly the same whether you do it once or a hundred times. TinyFish is releasing BigSet to address that workflow directly. Bigset is an open-source multi-agent system licensed under AGPL-3.0. It takes a natural-language description as input and returns a structured, exportable dataset built from live web data. The full codebase is available on GitHub. What is BigSet Bigset positions itself as the layer between a data requirement and a usable table. You describe what you want in a sentence. The system infers the schema, dispatches agents to
Building a structured dataset from the web is still a pipeline problem. You identify a data source, write or configure a scraper, design a schema, handle deduplication, schedule refreshes, and fix breakage when upstream sites change. That process stays roughly the same whether you do it once or a hundred times. TinyFish is releasing BigSet to address that workflow directly. Bigset is an open-source multi-agent system licensed under AGPL-3.0. It takes a natural-language description as input and returns a structured, exportable dataset built from live web data. The full codebase is available on GitHub. What is BigSet Bigset positions itself as the layer between a data requirement and a usable table. You describe what you want in a sentence. The system infers the schema, dispatches agents to gather data, deduplicates results, and produces a downloadable CSV or XLSX file. A practical example: you type “YC companies that are currently hiring engineers, with their funding stage, location, and number of open roles.” Bigset infers what columns that implies, finds the relevant entities on the web, and fills in the rows. You don’t specify a URL. You don’t configure selectors. You describe the data. A scheduled refresh feature lets datasets update automatically. You set a cadence — 30 minutes, 6 hours, 12 hours, daily, weekly — and the agents re-run on that schedule. The table stays current without re-running the task manually. One practical note: dataset generation takes 2–5 minutes. The agents are doing real web research — searching, fetching pages, and verifying data. It is not an instant result. How the Multi-Agent Architecture Works The architecture here is worth understanding concretely. BigSet is not a single LLM call with a web search tool attached. It runs a structured two-tier agent system. Step 1 — Schema Inference: When you submit a description, Claude Sonnet (accessed via OpenRouter) infers the dataset schema. This includes column names, data types, primary keys, and where to look for the data. This happens before any web access. The default is anthropic/claude-sonnet-4.6, but it is set by the SCHEMA_INFERENCE_MODEL env var and can be pointed at any OpenRouter model slug. Step 2 — Orchestrator Agent: A separate orchestrator agent runs broad discovery using TinyFish Search. It identifies which entities match your description and where to find them. The model defaults to Qwen (qwen/qwen3.7-max, via OpenRouter), configurable through POPULATE_ORCHESTRATOR_MODEL. Step 3 — Sub-Agent Fan-Out: The orchestrator dispatches sub-agents in parallel. Each sub-agent handles exactly one entity — one row in the final table. Each agent has a tool budget capped at 6 calls. It uses TinyFish Fetch to retrieve real page content, extracts the relevant fields, and inserts a row. Step 4 — Deduplication and Source Attribution: The system applies primary key deduplication. Each row carries source attribution — a traceable link to the web page the data came from. Quota enforcement per user is also applied at this stage. Step 5 — Export: The final result is a structured table available as CSV or XLSX download. Tech Stack LayerTechnologyFrontendNext.js 16, React 19, Tailwind 4BackendFastify, TypeScriptAuthClerkDatabaseConvex (self-hosted)AI OrchestrationMastra workflows + Vercel AI SDK + OpenRouterLLM — Schema InferenceClaude Sonnet via OpenRouterLLM — Orchestrator AgentQwen via OpenRouterData CollectionTinyFish Search, TinyFish Fetch, TinyFish BrowserTable ViewTanStack Table + react-window virtualizationExportsCSV (built-in) + XLSX via SheetJS How to Set It Up and Use It Bigset is self-hosted. You run it on your own infrastructure using Docker. Below is a complete walkthrough from clone to first dataset. Created by Marktechpost team Prerequisites You need Docker and Make installed. You also need API keys from three services before running anything. ServicePurposeWhere to get itTinyFishWeb search and page fetchingagent.tinyfish.ai/api-keysOpenRouterLLM calls (schema inference and agents)openrouter.ai/settings/keysClerkUser authenticationdashboard.clerk.com OpenRouter is pay-as-you-go. According to the README, $5–10 in credits is enough to start. Step 1 — Clone the repo and copy the env file Copy CodeCopiedUse a different Browsergit clone https://github.com/tinyfish-io/bigset.git cd bigset cp .env.example .env Open .env in your editor. You will fill in the variables below. Step 2 — Add your TinyFish API key TinyFish handles all web search and page fetching in Bigset. 1. Go to agent.tinyfish.ai/api-keys and create a key. 2. In your .env, set: Copy CodeCopiedUse a different BrowserTINYFISH_API_KEY=your_tinyfish_key_here Step 3 — Add your OpenRouter API key OpenRouter routes LLM calls to Claude Sonnet (for schema inference) and Qwen (for the orchestrator agent). 1. Go to openrouter.ai/settings/keys and create a key. 2. Add $5–10 in credits. 3. In your .env, set: Copy CodeCopiedUse a different BrowserOPENROUTER_API_KEY=your_openrouter_key_here Step 4 — Set up Clerk for authentication Clerk manages user sign-in. The setup takes approximately two minutes. 1. Go to dashboard.clerk.com and create a new application. 2. Choose a sign-in method (email, Google, or GitHub). 3. Go to Configure → API Keys and copy both keys: Copy CodeCopiedUse a different BrowserNEXT_PUBLIC_CLERK_PUBLISHABLE_KEY=pk_... CLERK_SECRET_KEY=sk_... 4. Go to Configure → JWT Templates, click New template, select the Convex template, and save it. 5. Go to Configure → Settings (or Domains) and copy the Issuer URL — it looks like https://your-app-name.clerk.accounts.dev: Copy CodeCopiedUse a different BrowserCLERK_JWT_ISSUER_DOMAIN=https://your-app-name.clerk.accounts.dev Step 5 — Start everything Copy CodeCopiedUse a different Browsermake dev make dev handles the full startup sequence: validates your .env, installs dependencies, starts Postgres and Convex, waits for Convex to be healthy, auto-generates the CONVEX_SELF_HOSTED_ADMIN_KEY (no manual step needed), pushes the Convex schema, and starts the frontend, backend, and Mastra. Once all services are ready, three URLs become available: ServiceURLBigset applocalhost:3500Convex dashboardlocalhost:6791Mastra Studio (workflow inspector)localhost:4111 Open localhost:3500 and click Get started to sign in. Step 6 (optional) — Load the curated public datasets Bigset ships with 9 curated datasets (AI companies hiring, GPU retail prices, frontier model pricing, and others). To load them: Copy CodeCopiedUse a different Browsermake seed-public-datasets This command is idempotent — safe to run more than once. Your full .env reference VariableRequiredSourceTINYFISH_API_KEYYesagent.tinyfish.ai/api-keysOPENROUTER_API_KEYYesopenrouter.ai → Settings → KeysNEXT_PUBLIC_CLERK_PUBLISHABLE_KEYYesClerk dashboard → API KeysCLERK_SECRET_KEYYesClerk dashboard → API KeysCLERK_JWT_ISSUER_DOMAINYesClerk dashboard → Settings/DomainsCONVEX_SELF_HOSTED_ADMIN_KEYAutoAuto-generated by make dev on first runRESEND_API_KEYOptionalFor dataset-ready email notificationsNEXT_PUBLIC_POSTHOG_KEYOptionalFor product analytics The .env.example also contains pre-filled local service URLs (CLIENT_ORIGIN, CONVEX_URL, NEXT_PUBLIC_CONVEX_URL) and optional model overrides (SCHEMA_INFERENCE_MODEL, POPULATE_ORCHESTRATOR_MODEL, INVESTIGATE_SUBAGENT_MODEL) that work as-is — leave them at their defaults unless you have a reason to change them. Useful commands during development CommandWhat it doesmake devStart everything, or recover from any broken statemake downStop all containers (data is preserved)make cleanStop containers, delete all data, and clear the admin keymake convex-pushDeploy Convex schema changes after editing frontend/convex/make seed-public-datasetsLoad the 9 curated public datasets If something breaks, run make dev again — it is designed to be self-healing. For a completely clean restart: run make clean then make dev. A Complete Worked Example: From One Sentence to a CSV Theory is easier to trust when you c
Related
相關文章
網易有道全面向AI轉型 全場景Agent矩陣亮相圖博會
{"id":"39ef5947-b77a-4904-bf03-ff6264f08dc4","object":"response","model":"deepseek-v4-flash","output":[],"stop_reason":"max_output_tokens","usage":{"input_tokens":154,"output_tokens":200,"total_tokens":354}}
MosaicLeaks: Can your research agent keep a secret?
Back to Articles MosaicLeaks: Can your research agent keep a secret? Enterprise Article Published June 18, 2026 Upvote - Alexander Gurung agurung Follow ServiceNow Rafael Pardinas rafapi-snow Follow ServiceNow TL;DR Deep research agents increasingly combine private local documents with external tools like web retrieval, creating a privacy risk: an agent's external queries may leak sensitive information. MosaicLeaks proposes a new deep-research task with multi-hop questions that interleave public and private information. Across the models we tested, agents frequently leaked private information, and training only for task performance made it worse. We propose a mosaic-leakage-aware RL training method, Privacy-Aware Deep Research (PA-DR), which raises strict chain success (the share of chains

騰訊老兵+大廠00後新銳,碼上飛想做的不只是AI Coding
這篇消息聚焦「騰訊老兵+大廠00後新銳,碼上飛想做的不只是AI Coding」。原始導語提到:已接入華為鴻蒙生態 從 AI 情報角度來看,這類內容值得關注其背後的技術進展、產品落地、產業競爭與後續市場影響。

Agent引爆網盤大戰,騰訊、百度、阿里齊聚,這次爭的不再是下載速度
這篇消息聚焦「Agent引爆網盤大戰,騰訊、百度、阿里齊聚,這次爭的不再是下載速度」。原始導語提到:網盤成了Agent新基建。 從 AI 情報角度來看,這類內容值得關注其背後的技術進展、產品落地、產業競爭與後續市場影響。

21年老牌企服公司的AI實驗:讓Agent跑一遍流程
這篇消息聚焦「21年老牌企服公司的AI實驗:讓Agent跑一遍流程」。原始導語提到:司盟企服接入騰訊雲WorkBuddy後,將海外郵件管理、審計理賬、訂單審核等高頻交付流程交給Agent先跑一遍 從 AI 情報角度來看,這類內容值得關注其背後的技術進展、產品落地、產業競爭與後續市場影響。
曹操出行宣佈啟動全面AI轉型,組織升級向AI原生公司邁進
曹操出行在2026國際汽車及供應鏈博覽會 上宣佈啟動全面AI轉型,併發布RoboX戰略,打造全球領先的物理AI移動科技平臺。與此同時,公司正式啟動組織升級,加快向AI原生公司邁進。為推動全面AI轉型,今年上半年,公司推進戰略聚焦,持續優化業務結構,主動收縮非核心業務,加快向AI原生公司轉型。