How to Use AgentTrove: Streaming 1.7M Agentic Traces and Building a Clean ShareGPT SFT Dataset in Python
This tutorial explores AgentTrove, one of the largest open-source collections of agentic interaction traces, and provides a hands-on guide on how to work with it efficiently.

['In this tutorial, we explore AgentTrove, one of the largest open-source collections of agentic interaction traces, and learn how to work with it efficiently. Instead of downloading the full dataset, we use streaming to inspect rows, detect the conversation schema, normalize agent turns, and understand how user, assistant, system, and tool messages are structured. We also build utilities to parse command-style assistant outputs, render complete trajectories in a readable format, and study how agents interact with tools across different tasks.
Additionally, we create a lightweight analytical workflow that samples thousands of traces, converts them into a DataFrame, summarizes turn-level statistics, visualizes important dataset patterns, and exports successful traces into a clean ShareGPT-style JSONL format for supervised fine-tuning.', 'We begin by installing the required libraries and importing the core tools needed for streaming, analysis, and visualization. We define the AgentTrove repository, open the dataset in streaming mode, and avoid downloading the full dataset locally. We then inspect the first row to understand the available columns and get an initial view of the dataset schema.
A defensive function is created to automatically detect the column that contains the conversation or trace data. We then normalize each turn into a consistent role-content format so that different dataset schemas can be handled smoothly.', 'We define a command-extraction utility that reads assistant responses and attempts to parse shell commands from JSON-style outputs. We clean possible code fences, load the content as JSON, and recursively search through common command-related fields.
This helps us identify tool-like actions inside agent trajectories and measure how often agents issue executable commands. A trace-rendering function is built to print the metadata and the full conversation trajectory in a readable format. We label each turn by role, truncate long messages for clarity, and show parsed commands under assistant messages.', 'We stream a sample of rows from AgentTrove and collect useful statistics, including turn counts, tool usage, total characters, and parsed command counts.
We store these lightweight features in a pandas DataFrame to make the dataset easier to summarize and analyze. We also print distribution tables for fields such as source, teacher model, model provider, and result to understand where the traces originate. Four visualizations are created to explore the sampled traces from different angles, plotting the top task sources, teacher models, turn-count distribution, and the relationship between assistant turns and parsed commands.', 'Finally, we define a success filter that retains traces marked as resolved, passed, correct, or positively rewarded.
Source: MarkTechPost