TinyFish Launches BigSet: An Open-Source Multi-Agent System That Builds Structured Live Datasets from Plain-English Descriptions
Building a structured dataset from the web is still a pipeline problem, but TinyFish's new open-source tool BigSet aims to simplify the process.

Building a structured dataset from the web is a cumbersome pipeline problem that involves multiple steps, from identifying a data source to handling deduplication and scheduling refreshes. TinyFish is releasing BigSet, an open-source multi-agent system licensed under AGPL-3.0, to address this workflow directly. BigSet takes a natural-language description as input and returns a structured, exportable dataset built from live web data.
The full codebase is available on GitHub. BigSet positions itself as the layer between a data requirement and a usable table. You describe what you want in a sentence, and the system infers the schema, dispatches agents to gather data, deduplicates results, and produces a downloadable CSV or XLSX file.
For example, you can type 'YC companies that are currently hiring engineers, with their funding stage, location, and number of open roles,' and BigSet will infer what columns that implies, find the relevant entities on the web, and fill in the rows. BigSet features a scheduled refresh feature that lets datasets update automatically. You set a cadence - 30 minutes, 6 hours, 12 hours, daily, weekly - and the agents re-run on that schedule.
The table stays current without re-running the task manually. One practical note: dataset generation takes 2-5 minutes, as the agents are doing real web research, searching, fetching pages, and verifying data. The architecture of BigSet involves a structured two-tier agent system.
Step 1 is Schema Inference, where Claude Sonnet infers the dataset schema, including column names, data types, primary keys, and where to look for the data. Step 2 is the Orchestrator Agent, which runs broad discovery using TinyFish Search to identify which entities match your description and where to find them. Step 3 is Sub-Agent Fan-Out, where the orchestrator dispatches sub-agents in parallel to handle each entity.
Step 4 is Deduplication and Source Attribution, where the system applies primary key deduplication and each row carries source attribution. Step 5 is Export, where the final result is a structured table available as CSV or XLSX download. BigSet is self-hosted, and you run it on your own infrastructure using Docker.
To get started, you need Docker and Make installed, as well as API keys from three services: OpenRouter, TinyFish, and Clerk. The process involves setting up the environment, installing dependencies, and starting the services. BigSet ships with 9 curated datasets, and you can load them with a single command.
The model reads your sentence and decides what a row means, picking columns, types, and a primary key, which is what later deduplication keys on. The orchestrator agent runs broad web search to answer one question: which entities exist? Each entity gets its own isolated sub-agent, running in parallel, with a hard tool budget.
Source: MarkTechPost