Dev Notes

Welcome to NeMo Data Designer Dev Notes! Here you'll find in-depth guides, tutorials, and insights about synthetic data generation.

May 28, 2026
14 min read

Mitigating Prompt Sensitivity: Manufacturing Robustness Through Diverse Preambles

Models behave differently based on how a question is phrased --- a "cynical senior dev" and a "curious student" get different answers to the same problem. Using NeMo Data Designer, we built a pipeline that generates hundreds of diverse prompt preambles with controlled variation across tone, strictness, verbosity, and answer format, then validates each one for compliance. These preambles feed into a YAML-driven training mixture pipeline that prepends diverse instructions to existing SFT data at scale. This approach is now used in Nemotron training mixtures to address the prompt-format brittleness observed in internal testing.

May 19, 2026
8 min read

Retriever SDG Plugin: Turn Your Documents Into Retriever Training Data

The data-designer-retrieval-sdg plugin turns source documents into grounded retriever training and BEIR evaluation data: bundle and chunk a corpus, generate multi-hop QA pairs, deduplicate and judge them, then export AutoModel-compatible artifacts.

May 5, 2026
7 min read

Have It Your Way: Customizing Data Designer with Plugins

A plugin framework for the custom pieces every real project ends up needing

Data Designer plugin extensions

Data Designer is built around a simple idea: describe the dataset you want, and let the framework handle execution. A config points to seed data, defines generated columns, picks models, and shapes the final records — no orchestration code required. Data Designer plugins keep that promise when a project needs something custom.

As of Data Designer v0.6.0, plugins are out of experimental mode and stable. They are the supported path for turning reusable project-specific logic into normal Data Designer components.

April 28, 2026
20 min read

Training a VLM to Understand Long Documents: An Iterative SDG Story

How do you teach a VLM to read charts, cross-reference tables, and reason over 100+ page PDFs? We generated ~11.4M synthetic visual question-answer pairs (~45B tokens, including questions, answers, thinking traces, and vision tokens) with NeMo Data Designer to improve long-document visual reasoning in a multimodal model. We used MMLongBench-Doc as our main evaluation target throughout the project, tracking both overall progress and the specific document-reasoning capabilities the model was still missing. In this post, we cover what worked and what didn't.

April 16, 2026
6 min read

Push Datasets to Hugging Face Hub

You just generated 10k multilingual greetings (or some other cool dataset). Now what — email a parquet file? Nah. Call .push_to_hub() and you've got a live dataset page on Hugging Face. Done and dusted 🚢.

April 14, 2026
27 min read

Engineering an Enterprise-Grade Text-to-SQL Dataset with NeMo Data Designer

While LLMs have mastered generic coding, Text-to-SQL remains one of the most challenging frontiers in enterprise AI. In many ways this is due to (i) SQL tasks relying on both code and data and (ii) real-world data and databases being quite messy. Focusing on careful data design that accounts for real-world diversity and complexity, we built a NeMo Data Designer pipeline that includes conditional sampling, three-stage LLM generation, code validators, and multi-dimensional judge scoring to generate reasoning-heavy text-to-SQL samples across PostgreSQL, MySQL, and SQLite, and automatically filter down to the highest quality 96.5k records. Each sample pairs a natural-language prompt and a fully synthetic database schema context with a target SQL query. To improve robustness and mimic the messiness of production databases, the pipeline injects distractor tables and columns into the schema context, forcing the model to learn to ignore irrelevant schema elements. The final dataset is validated and filtered through per-dialect syntax validators and five LLM-as-a-critic judges.

April 2, 2026
13 min read

Async All the Way Down

Data Designer's execution engine now schedules work at the cell level rather than the column level. Instead of running each column to completion before starting the next, the async engine dispatches a cell as soon as its specific upstream dependencies complete. Multi-model pipelines keep every endpoint saturated, and single-model pipelines benefit from AIMD-based adaptive concurrency. The result is faster pipelines with no changes to your config.

March 25, 2026
12 min read

Owning the Model Stack: Adaptive Concurrency FTW!

Picture this: you're generating a million-record dataset. Thirty two concurrent requests per model, three models in the pipeline, two providers. Everything hums along for the first ten minutes — then one provider starts returning 429s, your retry logic kicks in, and suddenly you're in a feedback loop where retries cause more 429s. The run stalls. You restart with lower concurrency, waste throughput for hours, and wonder if there's a better way.

There is. This post is about the native model client layer we built with adaptive throttling (a system that discovers provider capacity at runtime) replacing our dependency on LiteLLM along the way.

March 24, 2026
28 min read

Data Designer Got Skills

Lessons from building an agent-first CLI and skill for Data Designer

We just published the data-designer skill, which leverages agent-focused CLI commands in Data Designer to efficiently generate datasets. Just describe the dataset you want and your agent will craft the Data Designer configuration for you — schema design, validation, preview, generation — interactively or on full autopilot (just tell the agent to "be opinionated" or "surprise me").

March 12, 2026
19 min read

Search Agent SFT Data: Teaching LLMs to Browse the Web

Training search agents requires trajectory data --- the full multi-turn interaction showing how a model searches, reads, reasons, and answers. We built a four-stage pipeline that generates synthetic search trajectories from Wikidata knowledge graph paths, converts them into BrowseComp-style riddles using NeMo Data Designer, generates multi-step search rollouts with live web search via Tavily, and post-processes the results into SFT-ready training data.