Building Supervised Fine-Tuning Data from NVIDIA Open-SWE-Traces: Trajectory Parsing, Patch Analysis, Token Budgets, and Tool-Use Metrics
In this tutorial , we explore the Open-SWE-Traces dataset as a practical resource for studying and preparing agentic software-engineering trajectories for fine-tuning.

from NVIDIA Open-SWE-Traces: Trajectory Parsing, Patch Analysis, Token Budgets, and Tool-Use Metrics">
In this tutorial , we explore the Open-SWE-Traces dataset as a practical resource for studying and preparing agentic software-engineering trajectories for fine-tuning. We stream the dataset directly from Hugging Face, so we can work with a large dataset efficiently in Google Colab without downloading everything locally. We inspect individual records, normalize multi-turn agent conversations, parse final code patches, extract useful metadata, and build an analysis DataFrame to understand trajectory length, tool usage, patch size, language distribution, and resolution outcomes. We then use these insights to create a curated supervised fine-tuning subset that keeps only high-quality trajectories based on success labels, token limits, language filters, and patch availability.
We start by installing and importing the core libraries needed for streaming, parsing, analysis, and visualization. We configure pandas and matplotlib to ensure our tables and plots remain readable in Google Colab. We also define the dataset name, agent/model combinations, sampling size, and SFT filtering settings that control the rest of the tutorial.
We define helper functions that make the dataset easier to process, even when fields appear in different formats. We normalize trajectories, extract message text, count roles, detect tool usage, parse code patches, and estimate token lengths. We build these utilities defensively so that our analysis remains stable across schema variations in large streamed datasets.
We stream a small sample of Open-SWE-Traces directly from Hugging Face instead of downloading the full dataset. We collect examples across agent and model combinations, then inspect the structure of a single record in detail. We walk through the first few trajectory messages and preview the final patch to understand what each training example contains.
We transform the raw streamed records into a structured pandas DataFrame for analysis. We extract trajectory-level features such as message counts, role counts, patch churn, token estimates, metadata fields, and tool-use counters. We also create resolution flags to compare successful and unsuccessful software-engineering trajectories.
We explore the dataset through language counts, resolution rates, scaffold/model comparisons, message-length distributions, patch-size distributions, and token-budget analysis. We visualize how trajectory length, token size, and tool usage vary across the sampled records. We use these plots and summaries to determine which examples are practical to fine-tune under different context-window limits.
Source: MarkTechPost