How NVIDIA Builds Open Data for AI

Hugging Face Blog

How NVIDIA Builds Open Data for AI

Beginner

Hugging Face Blog3/10/2026

Key Summary

•NVIDIA is building and sharing big, well-organized open datasets so anyone can train, test, and improve AI models faster and more safely.
•They tackle the biggest AI bottleneck—getting enough high-quality, trustworthy data—by releasing datasets with clear recipes, licenses, and evaluation tools.
•Real-world datasets span robotics, self-driving, biology, language, and safety, including a huge Physical AI Collection and globally diverse autonomous vehicle data.
•Synthetic datasets like Nemotron Personas let teams practice with culturally authentic, privacy-safe profiles at national scale, dramatically boosting accuracy on real tasks.
•Specialized sets like La Proteina and SPEED-Bench help scientists design proteins and engineers measure decoding speed with fair, realistic tests.
•Nemotron training data evolved from general web text to higher-signal math, code, and STEM, teaching models to reason better and follow complex instructions.
•The ClimbMix pretraining recipe uses clustering and iteration to pick better text, cutting compute time by about a third compared to older data mixes.
•Open data is paired with open methods, licenses, and community feedback in an “open kitchen” so everyone can see the ingredients and how the dish is made.
•Results include large accuracy jumps (for example, 50.7% to 90.4% on NL→CQL translation), safer behavior, and faster development cycles.
•This approach, called extreme co-design, brings data, research, engineering, and policy together so bottlenecks get fixed end-to-end, not piecemeal.

Why This Research Matters

Open, high-quality datasets lower the cost and time it takes to build helpful AI tools we use every day, from chat assistants to translation and coding helpers. Better data for robotics and autonomous vehicles can make warehouses safer, deliveries faster, and streets more reliable. Multilingual and culturally grounded datasets reduce bias, so AI respects local languages and customs. Science-focused datasets like La Proteina can speed up discoveries in medicine and biology, potentially improving health outcomes. Realistic benchmarks prevent overhyped claims, helping teams choose the right settings for speed and quality. By sharing both ingredients and recipes, more schools, startups, and researchers can contribute, accelerating progress for everyone.

Reading Workflow

Turn this paper into a decision

Scan fast. Promote only the papers that survive triage.

No workflow history yet.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine building a giant LEGO castle, but most of the pieces are locked in other people’s rooms. You could design a brilliant castle, but without enough right pieces, you’d get stuck.

🥬 The Concept: Understanding AI Systems

What it is: An AI system is a computer program that learns patterns from data to make predictions, write text, control robots, and more.
How it works: 1) Feed it lots of examples, 2) let it practice predicting what comes next, 3) check how well it does, 4) adjust it with better data and feedback.
Why it matters: Without the right data, the AI learns the wrong things or not enough, like studying from a blurry, incomplete textbook.

🍞 Anchor: When you ask a chatbot for homework help, it answers well only if it has learned from clear, reliable examples of explanations and solutions.

🍞 Hook: You know how a public library lets everyone borrow books? Imagine if all the best school notes and science kits were also shared.

🥬 The Concept: Open Datasets

What it is: Open datasets are collections of information anyone can access, use, and build on (within a clear license).
How it works: 1) Collect data (text, images, robot sensor logs), 2) clean it (remove junk and duplicates), 3) label or structure it, 4) publish it with instructions and permissions.
Why it matters: Without open datasets, each team has to start from zero, which is slow, expensive, and sometimes impossible.

🍞 Anchor: NVIDIA posts datasets on platforms like Hugging Face, along with recipes on GitHub, so developers can start training models the same day instead of waiting months.

🍞 Hook: Have you tried baking cookies but had only half the ingredients? Even if you follow the recipe perfectly, the cookies flop.

🥬 The Concept: AI-Data Bottlenecks

What it is: AI-data bottlenecks are slowdowns caused by not having enough, the right kind, or the right quality of data.
How it works: 1) Data is hard to collect and label, 2) experts are busy, 3) rules and privacy matter, 4) teams keep data in silos, 5) models change, so you need to refresh data.
Why it matters: Without solving the bottleneck, training takes too long, costs too much, and models underperform or behave unsafely.

🍞 Anchor: Companies often spend months and millions preparing data before the first training run; NVIDIA’s open data shortens that wait dramatically.

🍞 Hook: Think of a championship relay race—runners only win if they pass the baton smoothly and plan the whole race together.

🥬 The Concept: Extreme Co-Design

What it is: Extreme co-design means data experts, AI researchers, engineers, and policy teams build the data, tools, and evaluation together from day one.
How it works: 1) Agree on goals (reasoning, safety, speed), 2) design data and benchmarks to match, 3) wire up compute and storage for scale, 4) publish and invite feedback, 5) iterate fast.
Why it matters: Without co-design, you fix one part (like the model) but another (like the data or benchmark) breaks, so progress stalls.

🍞 Anchor: NVIDIA and partners co-created datasets like SPEED-Bench and La Proteina, then refined them with community feedback to catch edge cases.

🍞 Hook: Picture a training gym with different stations—treadmills for endurance, weights for strength, ladders for agility. Balanced training makes a better athlete.

🥬 The Concept: Nemotron Training Datasets

What it is: Nemotron datasets are carefully built collections for pre-training (general knowledge) and post-training (behavior, reasoning, tools) of language models.
How it works: 1) Pre-training uses higher-signal text like math and code to teach reasoning, 2) Post-training adds instruction-following, safety, and step-by-step thinking, 3) Evaluation checks if the model improved.
Why it matters: Without the right mix, models may talk a lot but reason poorly, or be smart but unsafe.

🍞 Anchor: Moving from general web text to math+code boosted Nemotron’s reasoning; adding agent-style data helped it plan multi-step tasks safely.

The world before: Many teams treated data as an afterthought—grab whatever web text you can, hope for the best, and tweak the model. That worked for early gains but hit limits: messy, duplicated, biased, or narrow data led to brittle models. The problem: collecting, labeling, and validating high-quality, domain-specific data is slow and costly; datasets were siloed, and evaluation varied wildly, so comparing models was hard. Failed attempts: single-source mega-crawls (lots of noise), random-token benchmarks (unrealistic), and closed datasets (no reproducibility) each fell short. The gap: a shared, open data layer with clear recipes, trustworthy licenses, realistic benchmarks, and tight feedback loops across the community. Real stakes: better robotics in warehouses and hospitals, safer self-driving, fairer chatbots across languages, faster drug discovery, and cheaper, greener training. That’s why NVIDIA’s “open kitchen” model—share ingredients, show the cooking—matters now.

02Core Idea

🍞 Hook: You know how a great recipe card lists the ingredients, steps, and tips so anyone can cook a tasty meal? Imagine AI had recipe cards too.

🥬 The Concept: The Aha! Moment

What it is: The key insight is that open, co-designed datasets plus open recipes and benchmarks create a shared “data layer” that speeds up trustworthy AI for everyone.
How it works: 1) Build high-signal, diverse datasets across domains, 2) publish with clear licenses and training/eval recipes, 3) co-design with community feedback, 4) iterate rapidly so models, data, and tests improve together.
Why it matters: Without a shared data layer, each team repeats the same expensive steps, progress is slow, and results are hard to compare.

🍞 Anchor: NVIDIA’s releases—like the Physical AI Collection, Nemotron Personas, La Proteina, SPEED-Bench, and ClimbMix—act like master recipe cards others can follow and improve.

Multiple analogies:

Kitchen analogy: Ingredients (datasets), recipe cards (training scripts), taste tests (benchmarks). Share them and everyone cooks better, faster.
Sports analogy: A league with standard rules (benchmarks) and shared training drills (datasets) makes athletes comparable and improves the whole sport.
Library analogy: Curated shelves (high-signal data), study guides (recipes), and practice exams (benchmarks) help students learn faster and prove mastery.

Before vs After:

Before: Data was noisy, secretive, and inconsistent; benchmarks didn’t match real tasks; training was expensive and slow to repeat.
After: Datasets are open, structured, and targeted; benchmarks reflect real use; training can be reproduced and optimized, cutting costs and boosting quality.

Why it works (intuition, no equations):

Signal density: Concentrating math, code, and STEM raises the chance each training step teaches something useful, like studying key chapters instead of random pages.
Structure and diversity: Carefully crafted personas and retrieval triplets cover languages, regions, and question types, reducing bias and improving generalization.
Realistic evaluation: SPEED-Bench uses semantically meaningful text across lengths, so decoding speed and quality trade-offs are measured fairly.
Feedback loops: Publishing methods and data invites the community to find edge cases and propose fixes, accelerating improvement.

Building blocks (each with Sandwich explanations where introduced):

🍞 Hook: Ever sort your school notes so the most helpful ones are on top? 🥬 The Concept: Signal Density

What it is: Signal density means packing training data with examples that teach valuable skills (like math proofs or code), not fluff.
How it works: 1) Collect candidate text, 2) deduplicate globally, 3) rewrite or filter to keep math/code/LaTeX intact, 4) cluster to keep the highest-value slices.
Why it matters: Without it, the model wastes time on filler and learns slowly. 🍞 Anchor: Nemotron-CC-Math and Nemotron-CC-Code preserve math and code formatting, improving reasoning compared to generic web crawls.

🍞 Hook: Imagine practicing conversations with a friendly cast of classmates from all over the world. 🥬 The Concept: Synthetic Personas

What it is: Persona datasets create culturally authentic, privacy-safe profiles at population scale to train language models.
How it works: 1) Ground personas in real demographic distributions, 2) generate realistic language patterns, 3) cover many regions and languages, 4) use them to train and evaluate fairness and accuracy.
Why it matters: Without broad, authentic voices, models can be biased or weak outside a few regions. 🍞 Anchor: Using 2M personas, CrowdStrike boosted NL→CQL translation accuracy from 50.7% to 90.4%, nearly doubling correctness.

🍞 Hook: Think of a coach picking the best drills and dropping the weak ones after each practice. 🥬 The Concept: ClimbMix

What it is: ClimbMix is a 400B-token pretraining mixture built by clustering embeddings and iteratively refining which text to keep.
How it works: 1) Embed documents, 2) cluster similar ones, 3) sample higher-quality clusters, 4) retrain and re-score, 5) repeat to climb toward better data.
Why it matters: Without adaptive selection, you spend compute on low-value text. 🍞 Anchor: ClimbMix cut H100 training time by about a third compared to a previous setup, making strong models cheaper and greener to train.

🍞 Hook: When you test how fast you read, you use real stories, not random letters. 🥬 The Concept: SPEED-Bench (Speculative Decoding)

What it is: A benchmark to measure draft-and-verify decoding with realistic prompts across topics and lengths.
How it works: 1) A Qualitative Split spans 11 categories, 2) a Throughput Split buckets inputs from 1K–32K tokens, 3) you build Pareto curves to see speed–quality trade-offs.
Why it matters: Without realistic tests, speed claims don’t hold up in real applications. 🍞 Anchor: Teams use SPEED-Bench to compare Nemotron draft performance on long prompts, apples-to-apples.

🍞 Hook: Picture a study guide with question, passage, and answer that teaches you to find facts quickly. 🥬 The Concept: Retrieval Triplets for RAG

What it is: Triplets (query, passage, answer) train and test embedding and retrieval-augmented generation (RAG) systems.
How it works: 1) Generate diverse questions, 2) pair with the right passages, 3) provide precise answers, 4) fine-tune embeddings and evaluate retrieval quality.
Why it matters: Without good triplets, the model can’t find the right info when context is huge. 🍞 Anchor: Fine-tuning on Retrieval-Synthetic-NVDocs-v1 raised NDCG@10 by 11%, meaning better ranking of the right passages.

Together, these building blocks form a transparent, reusable data layer that others can adopt, inspect, and extend.

03Methodology

At a high level: Real-world needs → (Co-design goals) → (Collect or generate data) → (Clean, deduplicate, structure) → (Specialize by domain) → (Publish with licenses + recipes) → (Evaluate on open benchmarks) → (Community feedback) → (Iterate and scale).

Step 1: Define goals with extreme co-design

What happens: Data strategists, researchers, infra, and policy agree on target skills (e.g., reasoning, tool use, multilingual safety) and constraints (privacy, licenses, cost).
Why this step exists: If the goal isn’t shared, you might build a fast dataset that teaches the wrong skills or violates policy.
Example: For Nemotron’s reasoning, teams prioritized math, code, and STEM; for agent use, they targeted multi-step traces and tool telemetry.

Step 2: Collect and/or synthesize data

What happens: Gather real data (robotics sensors, AV cameras, documentation) and make synthetic data (personas, science problems, retrieval triplets) to cover gaps safely.
Why: Real data anchors models to the world; synthetic data scales safely where PII or scarcity is a risk.
Example: Nemotron Personas generate millions of culturally grounded profiles (e.g., 21M for India) without exposing personal records.

Step 3: Clean, deduplicate, and raise signal density

What happens: Remove duplicates globally, filter low-quality content, preserve formats (LaTeX, code), rewrite for clarity when appropriate, and cluster embeddings.
Why: Reduces noise and makes each training step count.
Example: Nemotron-CC-Math/Code preserve structure so equations and functions stay intact; ClimbMix clusters and re-samples to keep high-value text.

Step 4: Structure and label for tasks

What happens: Turn raw text into usable formats—triplets for RAG, benchmarks for decoding, dialogs for instruction-following, proofs for math.
Why: Models learn best when examples match the task shape.
Example: Retrieval-Synthetic-NVDocs-v1 builds 110,000 (query, passage, answer) triplets from 15,000 NVIDIA docs to train embeddings and RAG.

Step 5: Specialize by domain (multimodal too)

What happens: Create domain suites: robotics trajectories and grasps; AV multi-sensor logs across 25 countries; biology structures (La Proteina); software engineering dialogs (Nemotron-SWE).
Why: Domain context teaches skills general text can’t, like sensor fusion or molecular geometry.
Example: The Physical AI Collection includes 500K+ robot trajectories and 57M grasps across grippers and sensors for training vision-language-action models like GR00T.

Step 6: Publish with clear licenses and recipes

What happens: Release datasets on Hugging Face with permissive licenses where possible (e.g., CC-BY-NC-4.0 for ClimbMix), plus training/eval scripts on GitHub.
Why: Clear permissions and ready-made code lower adoption friction.
Example: Teams can fine-tune nvidia/llama-nemotron-embed-1b-v2 on NVDocs triplets in ~2 hours on $8×A100$ GPUs.

Step 7: Evaluate with realistic, standardized benchmarks

What happens: Use SPEED-Bench for speculative decoding across 11 categories and lengths; use RAG metrics like NDCG@10; use task-specific accuracy for legal QA or NL→CQL.
Why: Apples-to-apples comparisons prevent cherry-picking and guide optimization.
Example: SPEED-Bench’s Throughput Split (1K–32K tokens) lets teams build Pareto curves for speed vs. quality on real texts, not random tokens.

Step 8: Iterate with community feedback (open kitchen)

What happens: Partners and open-source users stress-test, report edge cases, and extend datasets to new regions and tasks.
Why: More eyes find more issues; diverse uses reveal blind spots.
Example: Runway built GWM-Robotics using GR00T data; Lightwheel used it to refine policies. Their findings inform the next release.

Step 9: Evolve pre- and post-training stacks

What happens: Pre-training adds higher-signal mixes (math, code, STEM); post-training adds structured instruction, proofs, agent traces, and safety RL.
Why: Pre-training builds raw capability; post-training shapes behavior and reliability.
Example: Nemotron-Science, Nemotron-Math-Proofs, Nemotron-Agentic, Nemotron-SWE, plus safety sets like Nemotron-Agentic-Safety (11K tool-use traces) and Nemotron-RL (900K tasks) create a full “training gym.”

Concrete mini-walkthroughs

A) Building a retrieval dataset (NVDocs)

Input: 15,000 public NVIDIA docs
Steps:
1. Generate diverse questions (factual, relational, procedural, temporal, causal, visual)
2. Match each question to a passage and produce an exact answer
3. Filter for clarity and coverage
4. Fine-tune an embedding model on 110,000 triplets
Output: Better retrieval quality; observed +11% NDCG@10 in-domain
Why it’s clever: Targets realistic query types rather than only simple fact lookups.

B) Crafting a high-signal pretraining mix (ClimbMix)

Input: Large candidate web/text pools
Steps:
1. Compute embeddings for documents
2. Cluster by semantic similarity
3. Sample clusters with high estimated value
4. Train and measure downstream quality and efficiency
5. Iterate, updating which clusters to keep
Output: 400B-token mix that improved training efficiency and performance
Why it’s clever: It closes the loop—data selection learns from model outcomes.

C) Evaluating speculative decoding (SPEED-Bench)

Input: Real semantic prompts across 11 categories and lengths (1K–32K)
Steps:
1. Run draft-and-verify decoding across splits
2. Measure throughput and accuracy
3. Build Pareto curves to see best trade-offs
Output: A fair, practical metric for deployment choices
Why it’s clever: Replaces random-token tests with realistic content, revealing true performance.

Secret sauce

Extreme co-design: Goals, data, infra, eval designed together.
Signal-first curation: Math/code/STEM and structure preservation.
Synthetic at scale, grounded in reality: Personas and proofs are large-scale yet culturally and formally anchored.
Open kitchen loop: Publish data and methods; accept community patches; iterate fast.

04Experiments & Results

The test: NVIDIA and partners measured whether open, co-designed datasets improve accuracy, safety, and efficiency across realistic tasks and domains.

The competition: Prior baselines included generic web crawls, random-token benchmarks, small or region-limited datasets, and closed resources that were hard to reproduce. Community alternatives like FineWeb-Edu provided useful comparisons for pretraining recipes.

Scoreboard with context:

Scaling open access: Over 2 petabytes of AI-ready data across 180+ datasets and 650+ open models. Context: This is like moving from a single school library to a national network—researchers don’t start from scratch.
Robotics and AV: Physical AI Collection includes 500K+ trajectories, 57M grasps, and 15TB multimodal logs; AV data spans ~1,700 hours across 25 countries and 2,500+ cities. Context: Broad coverage improves robustness across weather, roads, and sensor setups.
Adoption: 10M+ downloads of GR00T-related data; used by Runway (GWM-Robotics) and Lightwheel to refine policies. Context: Wide reuse means the data is practically valuable, not just a demo.
Personas in production: CrowdStrike used 2M personas to raise NL→CQL translation accuracy from 50.7% to 90.4%. Context: That’s like going from guessing on half the test to getting an A.
Legal QA in Japan: NTT Data and APTO boosted accuracy from 15.3% to 79.3% and cut attack success rates from 7% to 0%. Context: Major leap in both correctness and safety.
Biology: La Proteina achieved a 73% structural diversity boost versus baselines. Context: More diverse structures broaden what protein models can learn and design.
Retrieval: Fine-tuning nvidia/llama-nemotron-embed-1b-v2 on NVDocs triplets increased NDCG@10 by 11%. Context: Better top-10 ranking means users see correct passages sooner.
Pretraining efficiency: ClimbMix became the default in NanoChat Speedrun after being highlighted for the largest Time-to-GPT-2 improvement; it reduced H100 compute time by roughly 33% compared to FineWeb-Edu. Context: Same or better model quality in two-thirds the time is a big budget and climate win.
Multilingual excellence: Nemotron-Nano-9B-v2-Japanese reached the top of the Nejumi leaderboard. Context: Shows that targeted data can lift small models to state-of-the-art in-language performance.
Ecosystem impact: Post-training blends contributed to ServiceNow’s Apriel Nemotron 15B / Apriel 1.6 Thinker surpassing Gemini 2.5 Flash and Qwen3 at the 15B scale, and to Hugging Face’s popular SmolLM3. Context: Open data improved not just NVIDIA models but partner models too.

Surprising findings and lessons:

Synthetic can be practical: Properly grounded synthetic personas can massively lift real enterprise tasks while avoiding PII.
Realistic benchmarks matter: SPEED-Bench’s semantic splits changed which decoding settings looked best, revealing that random-token speedups were misleading for real prompts.
Diversity drives durability: Geographically and semantically diverse data made models more reliable across unfamiliar conditions (e.g., different road signs, languages, and writing styles).
Open methods accelerate trust: Publishing not just data but the recipes and licenses made external validation and iteration much faster.

05Discussion & Limitations

Limitations:

License boundaries: Some datasets (e.g., ClimbMix under CC-BY-NC-4.0) restrict commercial use; teams must check licenses carefully.
Synthetic bias: Even grounded personas and synthetic QA can reflect hidden modeling choices; ongoing audits are needed.
Coverage gaps: Despite global reach, some languages, dialects, or edge driving scenarios may still be underrepresented.
Maintenance load: High-quality open data requires continuous updates, quality checks, and de-duplication to stay fresh.
Benchmark drift: As models and tasks evolve, benchmarks like SPEED-Bench must expand categories and lengths to stay predictive of real performance.

Required resources:

Compute for fine-tuning and evaluation (e.g., $8×A100$ for a 2-hour embedding fine-tune); storage and bandwidth for multi-terabyte datasets; MLOps for data/version tracking.
Domain expertise for labeling or validating (e.g., legal QA, protein structures, robotics policies).

When not to use:

Highly proprietary or regulated domains with data you cannot legally share or combine.
Ultra-specialized tasks where open data lacks the necessary rare patterns or device-specific telemetry.
On-device micro-models with tiny memory if datasets aren’t adapted to the target footprint.

Open questions:

How to quantify and minimize synthetic bias at population scale while preserving cultural authenticity?
What are the best measures of reasoning progress beyond task accuracy (e.g., trace faithfulness, tool-use safety)?
How to co-design data with energy efficiency targets so training stays green as models grow?
Can we standardize safety telemetry formats across agents to make cross-project comparisons easier?
How to sustain long-term stewardship—funding, governance, and curation—of large open datasets?

06Conclusion & Future Work

Three-sentence summary: NVIDIA’s open data program treats datasets, methods, and benchmarks as shared recipe cards, creating a transparent data layer that speeds up trustworthy AI. By co-designing high-signal, domain-specific data with realistic evaluations, they cut costs, raise accuracy, and improve safety across language, robotics, biology, and more. Community feedback closes the loop, making each new release better than the last.

Main achievement: Demonstrating that extreme co-design plus openly released, high-signal datasets (ClimbMix, Personas, La Proteina, SPEED-Bench, NVDocs triplets, and more) can simultaneously improve model quality, fairness, efficiency, and reproducibility at ecosystem scale.

Future directions: Broaden multilingual and geographic coverage; expand agentic safety and reinforcement learning datasets; deepen biology and multimodal science; standardize telemetry for tool-using agents; and continue publishing methods so others can replicate and extend them. Expect tighter integrations between dataset selection, energy-efficient training, and deployment-time retrieval.

Why remember this: In AI, the model is only as good as its data. By opening the kitchen—ingredients, recipes, and taste tests—NVIDIA shows how to turn data from a bottleneck into a flywheel, helping the whole community build smarter, safer systems faster.

Practical Applications

•Fine-tune an embedding model on Retrieval-Synthetic-NVDocs-v1 to immediately improve your company’s technical-document search.
•Use Nemotron Personas to stress-test multilingual chatbots for fairness, tone, and instruction-following across regions before deployment.
•Adopt ClimbMix as your pretraining data recipe to cut compute costs while maintaining model quality.
•Benchmark speculative decoding settings with SPEED-Bench to select the best speed–quality trade-off for long-context applications.
•Bootstrapping domain-specific QA (e.g., legal or security) by mixing open Nemotron post-training data with a small amount of proprietary examples.
•Train or refine robotics policies using the Physical AI Collection’s trajectories and grasps, then validate in simulation before real-world trials.
•Prototype a sovereign AI assistant by starting with the appropriate national-scale persona dataset and adding local knowledge bases.
•Accelerate drug discovery research by pretraining structure-aware models on La Proteina to learn diverse protein representations.
•Improve retrieval-augmented generation systems by generating in-domain triplets and following NVIDIA’s open recipe for evaluation.
•Stand up repeatable evaluations across teams by standardizing on SPEED-Bench and shared RAG metrics like NDCG@10.

Version: 1