基於任務種子的合成問答生成用於Nemotron預訓練
重點摘要
在大型語言模型的開發中,問題不再只是模型看到多少數據,還在於數據是否包含足夠的結構化學習信號。一般網絡、程式碼、數學、多語言和領域數據提供了廣泛基礎,而基於任務種子的合成問答(SDG)通過添加緊湊、任務結構化的範例來補充它們,這些範例具有明確的資訊需求、受限的回應空間,以及將證據與答案聯繫起來的解釋。在Nemotron-3 Nano模型的1000億詞元延續實驗中,基於任務種子的SDG使MMLU-Pro提升1.8分,平均程式碼能力提升1.9分,常識推理能力也有所提升。
Back to Articles Task-Seeded Synthetic Q&A Generation for Nemotron Pretraining Enterprise + Article Published June 4, 2026 Upvote - Markus Kliegl mkliegl-nv Follow nvidia Dan Su sudandandansu1 Follow nvidia In large-scale LLM development, the question is no longer simply how much data a model sees. It is also whether the data contains enough structured learning signals. General web, code, math, multilingual, and domain data provide a broad base. Task-seeded synthetic Q&A complements them by adding compact, task-structured examples with a clear information need, a constrained response space, and explanations that connect evidence to an answer. In a 100B-token continuation experiment on the Nemotron-3 Nano model, task-seeded SDG improved MMLU-Pro by +1.8, average code by +1.9, commonsense understanding by +1.6, and GPQA by +11.1, while average math remained stable. This post describes a task-seeded synthetic Q&A generation workflow developed for Nemotron-family training, including Ultra and Super training runs. The workflow uses training splits from broad public task families as capability seeds, generates new task-aligned examples, enriches them with reasoning and relevant knowledge, and filters them into curated synthetic datasets. Held-out evaluation and test data are excluded from generation. Downstream training recipes can then decide how to mix those datasets with the broader corpus. Figure 1. The task-seeded SDG pipeline ends at curated generated data. Training mixture design and reported evaluations happen downstream. TL;DR We use public task training splits as capability seeds, not as examples to memorize. We frame the data through transfer learning across task families: a model can learn reusable behaviors from broad seed tasks, then apply them to related applications and evaluations. The pipeline generates similar questions and answer-enriched examples with reasoning and task-relevant context. Multiple-choice tasks are easier to verify; open generation tasks need task-specific extraction and filtering. In a 100B-token continuation experiment on the Nemotron-3 Nano model, task-seeded SDG improved MMLU-Pro by +1.8, average code by +1.9, commonsense understanding by +1.6, and GPQA by +11.1 while keeping average math stable. At A Glance Element Value Seed source Public task training splits available through lm-eval-harness Scale About 70 tasks and about 700 subtasks Data types Similar questions, answer-enriched samples, reasoning/context traces Verification Schema checks, format checks, deduplication, majority voted answer checks Training use Late-stage Nemotron-family training, including Ultra/Super workstreams Main result Gains on MMLU-Pro, code, commonsense, and GPQA in a 100B-token Nemotron-3 Nano continuation Generation Pipeline The generation workflow is a compact loop: collect training-split seeds, normalize heterogeneous task records, generate new examples, enrich answers, and filter the resulting data. In the internal pipeline, we used roughly 70 public task datasets from lm-eval-harness, covering about 700 subtasks. For each task, we used only suitable training splits as SDG seeds; held-out test data was not used for generation, and tasks without suitable training data were excluded from seed collection. The seed pool covered both knowledge-intensive and reasoning-intensive tasks: Seed group Approximate coverage Purpose Knowledge-intensive tasks 39 tasks, about 300 subtasks, about 3M seed samples Improve factual, scientific, multilingual, and domain-specific QA behavior Reasoning-intensive tasks 34 tasks, about 400 subtasks, about 1.5M seed samples Improve analytical reasoning, logical reasoning, math, code, and commonsense reasoning For Nemotron Ultra and Super pretraining, we used a license-compatible subset of the generated data suitable for commercial model training. The end-to-end process has five stages: Collect seed tasks. Enumerate available lm-eval-harness tasks, group them by output type, and keep only tasks with suitable training splits. Normalize records. Since each lm-eval-harness task defines its own fields and formatting in YAML, we convert task records into a unified JSONL-style schema. For multiple-choice tasks, the normalized record contains the question and candidate options. For generative tasks, it contains the question or prompt, plus context when the task provides it. Generate similar examples. Given a seed example, the generator creates a new question that preserves the underlying capability while changing the content. Enrich answers. The generator solves the generated questions and adds the final answer plus relevant reasoning, knowledge, or context. Filter and package. The pipeline applies schema checks, format checks, deduplication, and task-specific answer validation where possible. Multiple-choice data is easier to verify directly; generation-style data requires more cautious task-specific handling. One practical formatting choice is to store semantic answer text rather than only option labels when possible. For example, writing the answer as dirt trapped under the fingernails gives the model a clearer training signal than only writing B. Why Task-Seeded Data? Public task datasets are imperfect, but their training splits contain compact examples of how information is requested, constrained, and resolved. They capture useful correlations among task framing, domain knowledge, reasoning depth, candidate answers, and final response form. A model may see abundant raw text during pretraining and still benefit from synthetic data that makes those correlations explicit. Task-seeded synthetic data addresses this gap by turning public task training splits into data generation templates. Using only suitable training splits from broad task families, we generate new examples that preserve useful properties of the source interaction: task framing, such as whether the example asks for selection, generation, classification, or explanation; answer structure, such as multiple-choice options, short answers, free-form responses, or format-constrained outputs; domain and context, such as science, commonsense, factual knowledge, math, code, multilingual QA, or reading comprehension; difficulty and reasoning depth, such as whether the example requires a direct fact, a comparison among alternatives, or several reasoning steps; explanatory signal, such as task-relevant knowledge, reasoning, or context that helps connect the question to the answer. This lets us expose the model to reusable reasoning and knowledge-use patterns across task families, without tying the dataset to the surface format of one data source. Why Use Broader Seed Tasks? A useful way to interpret this pipeline is through transfer learning across task families. Many improvements do not come from learning a single task's surface format. They come from strengthening reusable behaviors that appear across many tasks: identifying the information need, applying relevant domain knowledge, separating plausible alternatives, following response constraints, doing multi-step reasoning, and grounding a final answer in the right context. Because of this, we do not generate from a narrow set of task formats. Instead, we collect a broader set of training-split seed samples from lm-eval-harness and use them to cover many neighboring capability regions. A science QA seed can help with commonsense physical reasoning. A logical reasoning seed can help with careful alternative comparison. A math or code seed can help with multi-step planning even when the final application is not exactly the same task. The goal is positive transfer learning across task families, while reducing the risk that the model simply learns the quirks of a single data source. This motivation is also consistent with earlier evidence in Nemotron Nano pretraining. We found that using AGIEval training data improved MMLU-Pro, suggesting that structured Q&A data from one task family can improve behavior outside t
Related
相關文章

GPT發AI原創新成果了
這篇消息聚焦「GPT發AI原創新成果了」。原始導語提到:AI實現藥物全自動研發,還遠嗎? 從 AI 情報角度來看,這類內容值得關注其背後的技術進展、產品落地、產業競爭與後續市場影響。

AI越強,越要“殺死”過去的自己
這篇消息聚焦「AI越強,越要“殺死”過去的自己」。原始導語提到:人類需要實現思維模式的轉變。 從 AI 情報角度來看,這類內容值得關注其背後的技術進展、產品落地、產業競爭與後續市場影響。
Salesforce CodeGen Tutorial: Generate, Validate, and Rerank Python Functions With Unit Tests and Safety Checks
In this tutorial, we implement an end-to-end workflow for Salesforce CodeGen. We load a CodeGen model from Hugging Face, prepare it for code generation, and use it to generate Python functions from natural-language prompts. We then move beyond basic inference by adding function extraction, syntax checking, static safety checks, unit-test-based validation, best-of-N candidate reranking, multi-step program synthesis, prompt-style experimentation, benchmark visualization, and artifact export. Through this workflow, we learn how CodeGen can be used not only as a code completion model but also as part of a structured code-generation pipeline that evaluates, filters, and organizes generated solutions. Loading the Salesforce CodeGen Model from Hugging Face Copy CodeCopiedUse a different Browserim

Transformer之父離開谷歌,奧特曼等了他十年
這篇消息聚焦「Transformer之父離開谷歌,奧特曼等了他十年」。原始導語提到:27億美元也沒能留住,Noam Shazeer追尋下一代架構。 從 AI 情報角度來看,這類內容值得關注其背後的技術進展、產品落地、產業競爭與後續市場影響。

Dario訪談首曝:Mythos被稱為“超級武器”
這篇消息聚焦「Dario訪談首曝:Mythos被稱為“超級武器”」。原始導語提到:在這場69分鐘完整訪談裡,Dario Amodei 說人類真正面對的不是某個突然降臨的奇點,而是一條已經開始垂直起飛的指數曲線。 從 AI 情報角度來看,這類內容值得關注其背後的技術進展、產品落地、產業競爭與後續市場影響。

用結構替代數據,因果世界模型如何重塑具身智能大腦
這篇消息聚焦「用結構替代數據,因果世界模型如何重塑具身智能大腦」。原始導語提到:因果世界模型需要一個標誌性的時刻來證明自己。 從 AI 情報角度來看,這類內容值得關注其背後的技術進展、產品落地、產業競爭與後續市場影響。