Task-Seeded Synthetic Q&A Generation for Nemotron Pretraining
A new approach to large-scale LLM development uses task-seeded synthetic Q&A generation to improve model performance on difficult reasoning and knowledge tasks.

In large-scale LLM development, the question is no longer simply how much data a model sees. It is also whether the data contains enough structured learning signals. General web, code, math, multilingual, and domain data provide a broad base.
Task-seeded synthetic Q&A complements them by adding compact, task-structured examples with a clear information need, a constrained response space, and explanations that connect evidence to an answer. In a 100B-token continuation experiment on the Nemotron-3 Nano model, task-seeded SDG improved MMLU-Pro by +1.8, average code by +1.9, commonsense understanding by +1.6, and GPQA by +11.1, while average math remained stable. This post describes a task-seeded synthetic Q&A generation workflow developed for Nemotron-family training, including Ultra and Super training runs.
The workflow uses training splits from broad public task families as capability seeds, generates new task-aligned examples, enriches them with reasoning and relevant knowledge, and filters them into curated synthetic datasets. Held-out evaluation and test data are excluded from generation. Downstream training recipes can then decide how to mix those datasets with the broader corpus.
The generation workflow is a compact loop: collect training-split seeds, normalize heterogeneous task records, generate new examples, enrich answers, and filter the resulting data. In the internal pipeline, we used roughly 70 public task datasets from lm-eval-harness, covering about 700 subtasks. For each task, we used only suitable training splits as SDG seeds; held-out test data was not used for generation, and tasks without suitable training data were excluded from seed collection.
Public task datasets are imperfect, but their training splits contain compact examples of how information is requested, constrained, and resolved. They capture useful correlations among task framing, domain knowledge, reasoning depth, candidate answers, and final response form. A model may see abundant raw text during pretraining and still benefit from synthetic data that makes those correlations explicit.
Task-seeded synthetic data addresses this gap by turning public task training splits into data generation templates. Using only suitable training splits from broad task families, we generate new examples that preserve useful properties of the source interaction. This lets us expose the model to reusable reasoning and knowledge-use patterns across task families, without tying the dataset to the surface format of one data source.
The seed pool covered both knowledge-intensive and reasoning-intensive tasks: For Nemotron Ultra and Super pretraining, we used a license-compatible subset of the generated data suitable for commercial model training. One practical formatting choice is to store semantic answer text rather than only option labels when possible. For example, writing the answer as "dirt trapped under the fingernails" gives the model a clearer training signal than only writing "B".
Source: Hugging Face