DataChef: Cooking Up Optimal Data Recipes for LLM Adaptation via Reinforcement Learning

Yicheng Chen; Zerun Ma; Xinchen Xie; Yining Li; Kai Chen

DataChef: Cooking Up Optimal Data Recipes for LLM Adaptation via Reinforcement Learning

Intermediate

Yicheng Chen, Zerun Ma, Xinchen Xie et al.2/11/2026

arXiv

Key Summary

•DataChef teaches a large language model to be a smart data chef: it plans and codes full data pipelines that turn messy datasets into great training meals for other models.
•Instead of waiting days to train a model and see if the data was good, DataChef uses a fast “Data Verifier” taste-tester that scores small samples instantly, guiding learning with rewards.
•The system learns with reinforcement learning (GRPO), plus a cold-start phase where it studies good plans and code so it doesn’t write broken pipelines.
•DataChef builds full recipes: it picks sources, filters junk, standardizes formats, synthesizes chain-of-thought, de-duplicates, mixes datasets, and outputs clean training dialogs.
•Across six unseen tasks (math, code, physics, finance, climate, language), DataChef’s recipes match a top closed model (Gemini-3-Pro) and beat strong open baselines.
•On math, a DataChef recipe adapts a 1.7B-base model to score 66.7 on AIME’25—better than industry-tuned baselines with expert recipes.
•A key insight is using a proxy reward from an LLM-as-a-judge that strongly correlates with real downstream scores, making online RL practical and cheap.
•Ablations show all parts matter: cold-start avoids reward hacking, dense verifier scores beat binary rewards, and end-to-end training outperforms splitting planning and coding.
•This shifts AI from hand-made data pipelines to self-cooking, self-improving systems that can adapt to new domains faster and cheaper.
•The approach opens doors to safer, more accurate, and domain-specialized models by continuously optimizing the data they learn from.

Why This Research Matters

Good data makes good models, and DataChef automates how to make that good data fast. This means smaller teams can build strong, domain-specific AIs—like reliable math tutors, medical assistants, or climate advisors—without months of manual data wrangling. Because the fast, rubric-based Data Verifier correlates with real performance, teams can iterate quickly and confidently. Automated recipes also reduce human error and make pipelines reproducible, safer, and easier to audit. As AI shifts from model-centric to data-centric improvement, tools like DataChef become essential for scaling quality, fairness, and adaptability across fields. Finally, this work points to self-improving systems that refine not just their answers but the very training experiences that teach them.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

You know how a great meal depends on both the chef’s skill and the quality of the ingredients? For big AI models, the “ingredients” are the training data, and the “cooking” is the data pipeline that cleans, mixes, and shapes that data. Before this work, many teams believed the secret to strong models was mostly about building bigger ovens (larger models) or cooking longer (more compute). But as models matured, it became clear: the menu (what data you feed and how you prepare it) decides a lot of the taste.

🍞 Top Bread (Hook) Imagine a school science fair where everyone brings raw materials, but only some kids know how to turn cardboard, glue, and paint into a prize-winning volcano. 🥬 Filling (The Actual Concept)

What it is: A data processing pipeline is a step-by-step assembly line that turns messy raw data into clean training examples.
How it works: (1) Load raw data from different places, (2) filter out noise and mistakes, (3) standardize formats, (4) add helpful info like explanations, (5) mix sources wisely, (6) remove duplicates, (7) output in a training-ready format.
Why it matters: Without a pipeline, you feed the model junk or mismatched examples, and it learns the wrong things—like studying French to pass a math test. 🍞 Bottom Bread (Anchor) A climate QA pipeline might pick climate-related questions, turn them into multiple choice, ensure answers are single letters (A–D), and save everything in the exact format the model needs.

The problem: building these pipelines—the full “data recipe”—was mostly hand-made. Experts carefully chose datasets, ordered steps, wrote code, and iterated for weeks to get good results. Even when people used LLMs to help with single steps like filtering or synthesis, humans still designed the whole recipe. Exploring all possible combinations of steps, orders, and parameters is way too big to try manually.

Earlier attempts tried toolkits that let you plug blocks together or agents that follow static prompts. These helped, but they still: (1) leaned on humans to plan the pipeline, (2) needed expensive full model training to see if a recipe worked, and (3) couldn’t scale exploration to the huge space of choices. The feedback loop—“cook data → train model for hours or days → see if it tasted good”—was too slow and costly.

🍞 Top Bread (Hook) You know how recipes aren’t just ingredients? They’re the full plan: steps, timing, tools, and how to present the dish. 🥬 Filling (The Actual Concept)

What it is: A data recipe is the pipeline plus the resulting dataset—both the cooking plan and the finished meal.
How it works: You specify sources, write the processing code (filtering, formatting, synthesis), run it, and produce the final training data.
Why it matters: Without clear recipes, you can’t reproduce results, compare approaches, or improve systematically. 🍞 Bottom Bread (Anchor) For math reasoning, a recipe might pick math datasets, generate step-by-step solutions, remove duplicates, and standardize to a dialogue format the model expects.

So what was missing? A way for an AI to (1) read a task description and available datasets, (2) plan and code the entire recipe end-to-end, and (3) learn from quick, cheap signals rather than full training runs. This paper fills that gap by turning recipe creation into a reinforcement learning problem with a fast “taste test.”

🍞 Top Bread (Hook) Imagine a kitchen robot that can cook a full dinner by itself, learning from quick bites instead of waiting till every dish is done. 🥬 Filling (The Actual Concept)

What it is: End-to-end data recipe generation is automatically creating the full plan and code for preparing training data from raw sources to final output.
How it works: The model reads the task + datasets, proposes a plan, writes executable code, runs it, and evaluates samples for quality—all in one loop.
Why it matters: Without this, you’re stuck hand-crafting pipelines or waiting on slow feedback, which blocks rapid improvement. 🍞 Bottom Bread (Anchor) Given AIME’25 math as the goal and a list of math datasets, the system writes and runs a pipeline that outputs perfect-format math Q&A with chain-of-thought.

Real stakes: Faster, better recipes make better models—doctors’ assistants that answer accurately, finance bots that handle regulations, climate assistants that explain weather, and tutors that solve math carefully. Automating this recipe work means small teams can build domain experts quickly, instead of only giant labs with huge budgets. It also pushes us toward self-evolving AI: systems that improve the quality of the data they learn from, not just their parameters.

02Core Idea

The “Aha!” moment in one sentence: Teach an LLM to be a data chef that plans and codes full data recipes, then learn which recipes work using a fast, reliable taste test (a Data Verifier) instead of waiting for expensive full trainings.

Explain it three ways:

Kitchen analogy: The model is a chef. It chooses ingredients (datasets), follows cooking steps (filter, synthesize, standardize), plates the dish (final format), and asks a taste-tester for a quick score. The chef keeps improving recipes using those scores.
Sports analogy: The model is a coach designing drills (data steps). A scrimmage judge gives instant grades on small plays (sample checks), so the coach can refine practice plans without playing a full season every time.
City planner analogy: The model plans roads (pipeline), builds them (code), runs a few test drives (sample evaluation), and adjusts the map based on quick traffic feedback.

🍞 Top Bread (Hook) You know how puppies learn tricks faster with small treats for each good step, not just a big prize at the end? 🥬 Filling (The Actual Concept)

What it is: Reinforcement learning (RL) lets the model try recipe ideas and get rewards when the resulting data looks good.
How it works: (1) Propose a recipe plan + code, (2) execute to get data, (3) sample and score with a Data Verifier, (4) update the policy to make higher-scoring recipes more likely.
Why it matters: Without RL, the model can’t systematically improve from experience across many tasks. 🍞 Bottom Bread (Anchor) If a recipe generates off-topic or malformed data, the score drops; if it’s clean, relevant, and correct, the score rises, so future recipes copy those winning moves.

Before vs. after: Before, humans guessed which steps mattered and waited for full model trainings to validate. After, the policy LLM writes and runs pipelines itself and learns from quick, well-correlated scores. This flips the bottleneck: data recipes evolve in hours, not weeks.

Why it works intuitively: The Data Verifier’s rubric-based judgments align with actual downstream performance across diverse tasks (strong positive correlations). Dense, fine-grained feedback (A/B/C/D/E with scores) teaches the policy which specific choices help: formatting outputs, staying on-topic, avoiding wrong answers, and using the right synthesis strategy.

Building blocks:

A diverse task pool (31 benchmarks across 19 domains) ensures the chef practices many cuisines.
Cold-start supervised fine-tuning gives the chef basic cooking and coding skills so it doesn’t write broken scripts.
Group Relative Policy Optimization (GRPO) focuses learning on better-than-average candidates within each batch, with KL regularization to stay stable.
Execution penalties discourage reward hacking (like returning empty datasets just to avoid mistakes).
End-to-end integration outperforms splitting planning and coding across separate models during training.

🍞 Top Bread (Hook) Think of a fair judge who can quickly tell if a quiz question and answer match the rules and are correct. 🥬 Filling (The Actual Concept)

What it is: A Data Verifier is an LLM-as-a-judge that grades small samples of the produced dataset using a clear rubric.
How it works: It checks validity, format, correctness, and task alignment, assigning categories: Invalid, Format Error, Incorrect, Task Mismatch, or Pass (with scores like 0, 0, 0, 0.4, 1.0).
Why it matters: Without this quick, reliable taste test, you’d have to fully train a model each time to see if the data was any good—too slow and costly. 🍞 Bottom Bread (Anchor) If the benchmark wants a single-letter answer and the sample gives a paragraph, the Verifier flags Task Mismatch; if it’s on-topic and correct, it’s a Pass.

🍞 Top Bread (Hook) Imagine getting a sticker chart where different stickers mean different levels of success; more detailed stickers teach faster. 🥬 Filling (The Actual Concept)

What it is: A proxy reward is a stand-in score that predicts real success (downstream accuracy) but is much cheaper to compute.
How it works: Sample a subset of the produced data, grade each with the Verifier, average scores, and apply penalties if execution failed or formats broke.
Why it matters: Without a good proxy, RL can’t scale; with a strong proxy, learning is fast and generalizes across domains. 🍞 Bottom Bread (Anchor) On climate QA, a few dozen graded samples can predict which recipe will train a better model—no need to train and test the full model for every try.

03Methodology

At a high level: Task description + benchmark + available datasets → Policy LLM proposes a plan and code (the recipe) → Execute to produce training data → Sample and score with a Data Verifier → Use the score as reward to update the policy via RL → Deploy the best recipe for real model fine-tuning.

Inputs and problem setup

Each task T has: (I) a natural language instruction describing the goal and evaluation protocol, (τ) the benchmark metric, and (D) a set of available datasets. The policy’s job is to produce r = (g, d), where g is an executable pipeline (plan + Python code) and d is the resulting training data.

Cold-start supervised fine-tuning (SFT) Why this step exists: Starting RL from scratch made the model write trivial or broken code to avoid penalties (reward hacking). Cold-start gives it solid basics. How it works:

The team used a strong reasoning model to write high-quality plans and a strong coding model to implement them. They executed and filtered out failures, keeping only valid plan+code pairs.
The policy LLM (Qwen3-32B) was fine-tuned on ~5K such examples, teaching it to separate planning and coding in its outputs and to adhere to the required tool APIs and output formats. Example: For math, a plan might select OpenR1-Math and MetaMathQA, add chain-of-thought synthesis for harder problems, deduplicate, and enforce a ShareGPT dialog format.

Recipe generation: plan and code

The policy outputs: (a) a natural language plan that lists selected sources with justifications and the full workflow, and (b) an executable Python script implementing the plan.
Typical steps: load datasets, content filtering and subsetting (e.g., climate-related topics), standardization (dialog format), data augmentation (e.g., generate multiple choice or chain-of-thought), validation, deduplication, and final dump. What breaks without it: If code and plan aren’t coupled, the script may ignore constraints (like answer-only outputs), causing format or task mismatches.

🍞 Top Bread (Hook) You know how following a list of steps helps you bake cookies the same way every time? 🥬 Filling (The Actual Concept)

What it is: A data processing pipeline is the concrete, ordered code steps that transform raw inputs into clean training examples.
How it works: It calls utility functions (load_remote_dataset, select_by_filter, generate_dataset_with_llm, deduplicate, format_to_sharegpt, dump_dataset) in a sensible order.
Why it matters: It ensures repeatable, auditable results that match the benchmark’s required I/O format. 🍞 Bottom Bread (Anchor) In ClimaQA, the pipeline converts free-form Q&A into 4-option multiple choice with single-letter answers, then saves standardized training dialogs.

Data Verifier: fast, rubric-based grading

The Verifier (LLM-as-a-judge) checks a random sample of produced data and classifies each item into: Invalid (0), Format Error (0), Incorrect (0), Task Mismatch (0.4), or Pass (1.0).
The final reward is the average over samples, with penalties for empty outputs (execution failure) or broken global format. What breaks without it: RL has no fast signal to improve; exploration stalls because full downstream trainings are too expensive per attempt.

🍞 Top Bread (Hook) Think of a math teacher who spot-checks a few problems to judge how well you learned, instead of grading every single worksheet. 🥬 Filling (The Actual Concept)

What it is: A proxy reward summarizes how good the dataset looks without training a model on it.
How it works: Sample K items, grade each via rubric, average scores, subtract penalties for failures, and feed this as reward to the policy.
Why it matters: It makes online RL practical and lets the policy learn quickly which choices improve downstream results. 🍞 Bottom Bread (Anchor) If multiple choice answers are consistently single letters and correct, the reward climbs; if outputs drift into essays, the reward falls.

Reinforcement learning with GRPO

For each task, sample a group of G candidate recipes. Compute their rewards via the Verifier. Convert to group-relative advantages (each recipe’s score compared to the group’s mean and variance). Update the policy to prefer better-than-group recipes, with a KL term to stay near a reference policy.
This stabilizes training: the model doesn’t wildly drift, and it learns which operations (e.g., deduplication + chain-of-thought synthesis) are consistently helpful. What breaks without it: Plain policy gradients can be unstable; the model may overfit to oddities or collapse to trivial scripts.

Task pool and training dynamics

31 benchmarks across 19 domains; 25 used for training, 6 held out (physics, math, code as in-domain; finance, climate, CHID as out-of-domain). The pool ensures variety so the policy generalizes.
Training uses batch sampling of tasks, with 8 candidates per task per step, temperature sampling for diversity, and small learning rates to converge steadily.

Inference-time use

At test time, the model generates 32 recipes; you can pick the best-scoring one (by the Verifier) or train on a random valid one for fair evaluation.
For real deployment, you’d likely pick the top-scoring recipe or combine a few.

The secret sauce

Dense, well-designed rubric rewards: More informative than a binary “ran vs. failed.”
Cold-start SFT: Prevents reward hacking and teaches reliable tool use.
End-to-end integration: Training the same model to both plan and code beat decoupled approaches during training.
Execution penalties: Encourage real, high-quality outputs—not empty or ultra-safe scripts.

04Experiments & Results

The test: Does a recipe that looks good to the Data Verifier actually lead to a better fine-tuned model? And can the policy generate such recipes reliably across new tasks?

What they measured and why

DVS_avg@32: The average Data Verifier Score across 32 independently generated recipes. This reflects how often and how well the policy produces valid, high-quality datasets (stability and expected quality).
DBS: The real downstream score after fine-tuning a base model (Qwen3-1.7B-Base, 3 epochs) on one randomly sampled valid recipe from the 32. This checks true end performance. Why both: DVS shows recipe quality without huge compute; DBS confirms that high DVS translates to real gains.

The competition

Parameter-matched open baseline: Qwen3-32B.
Open-source flagships: Kimi-K2-Instruct and a combo planner+coder system (Qwen3-Next-80B for planning + Kimi-K2 for coding).
Closed-source top-tier: Gemini-3-Pro.
References: SOURCE_avg and SOURCE_best (train directly on raw sources, averaged or best single), and EXPERT (industry-grade post-training on expert-curated recipes).

The scoreboard with context

Across six held-out tasks (physics, AIME’25 math, LiveCodeBench code; climate QA, finance QA, CHID language), DataChef-32B’s average DVS and DBS matched Gemini-3-Pro closely and surpassed strong open-source baselines.
Oracle setting (pick the best of the 32 generated recipes) shows the policy can craft exceptional pipelines: 66.7 on AIME’25 and 46.3 on ClimaQA—better than expert-curated baselines in those domains.
Compared to a powerful open combo (Qwen3-Next+Kimi-K2), DataChef-32B improved average DVS by about +8–9% and DBS by about +7–11% across in-domain and out-of-domain tasks. That’s like moving from a solid B to a low A while being cheaper to iterate.
Crucially, DataChef beats SOURCE_best on most tasks in Oracle mode, proving it isn’t just picking one good dataset—it’s making better recipes via mixing, filtering, and smart synthesis.

Surprising and insightful findings

Verifier correlation: The Data Verifier’s scores correlate strongly and consistently with real downstream performance (average Pearson r≈0.59 across six tasks), often beating other data-quality metrics. This validates the proxy reward as a dependable stand-in.
RL matters most out-of-domain: Reinforcement learning especially boosted generalization to new domains (finance, climate, CHID) while keeping in-domain gains steady.
Cold-start is essential: Removing SFT caused the policy to generate simpler, safer, but low-utility scripts—classic reward hacking. SFT helped it confidently use advanced ops like synthesis and dedup.
Dense vs. sparse rewards: Replacing the fine-grained verifier scores with a flat “success” reward degraded performance, showing the importance of detailed feedback.
Plan/coder split: Using a strong external coder helps at inference, but training the policy to plan-and-code end-to-end produced the best results overall—tight coupling seems to teach the right biases.

Concrete example in action

Climate QA: The generated recipe filtered climate-related questions, converted open answers into four-option MCQ with single-letter outputs, deduplicated, and standardized dialogs. The Verifier rewarded correct, on-topic, format-perfect samples. This recipe trained a model that scored competitively on ClimaQA.
Math reasoning: Recipes selected math datasets, synthesized chain-of-thought for harder problems, enforced exact output formats, and removed duplicates—contributing to the 66.7 AIME’25 result.

05Discussion & Limitations

Limitations

LLM-as-a-judge tradeoff: Using a general Verifier maximizes coverage but may miss fine-grained issues in niche fields (e.g., advanced medical sub-specialties). Specialized verifiers per domain could improve precision.
Still needs compute: While far cheaper than training per recipe, running code, LLM-based synthesis, and repeated verifier calls still require significant resources and careful sandboxing.
Data availability: If suitable sources don’t exist for a niche task, the system can’t conjure ground truth; synthesis helps, but quality may cap out.
Strict I/O contracts: Extremely rigid formats (e.g., constrained JSON schemas with edge cases) can be brittle; minor deviations cause Task Mismatch penalties.

Required resources

A capable policy LLM (here, 32B) with tool-using code generation skills.
A code execution environment with dataset utilities, dedup tools, and formatters.
Access to an LLM-as-a-judge for the Verifier (prompted with the rubric) and, optionally, an auxiliary coder.
Storage and bandwidth for pulling multiple datasets and saving processed outputs.

When not to use

Ultra-niche domains with tiny or proprietary datasets where external verification is unreliable or forbidden.
Tasks where the evaluation protocol is unknown or unstable, making alignment and format checks impossible.
Situations with extremely limited budget where even verifier calls and dataset processing are too costly.

Open questions

Better proxies: Can we build hybrid verifiers (symbolic + LLM) for higher precision and interpretability? Can we include safety/fairness or compute cost directly in multi-objective rewards?
Continual learning: How do we maintain and refresh recipes as benchmarks evolve, preventing drift and contamination?
Robustness: How to harden against reward hacking or silent format failures at scale?
Transfer: Can smaller student policies learn from a larger teacher policy’s recipe habits to reduce inference cost?
Human-in-the-loop: What’s the best way to combine expert nudges with automated exploration for specialized domains?

06Conclusion & Future Work

In three sentences: DataChef turns data preparation into an end-to-end learning problem: a policy LLM plans and codes full data recipes, and a fast Data Verifier scores samples so the policy improves with reinforcement learning. With a diverse task pool, cold-start SFT, dense rubric rewards, and GRPO, DataChef reliably generates practical recipes that match a top closed model and sometimes surpass expert-curated pipelines. This shows AI can now self-cook the data it learns from, accelerating adaptation across domains.

Main achievement: Replacing slow, manual, trial-and-error pipeline design with an automated, RL-driven system guided by a well-validated proxy reward that correlates with real downstream performance.

Future directions: Build specialized verifiers for niche areas, incorporate multi-objective rewards (quality, safety, cost), compress the policy for cheaper inference, and support continual recipe evolution without contamination. Explore teacher–student distillation so smaller models inherit strong recipe instincts.

Why remember this: As models saturate on raw scale, data quality and preparation become the main lever. DataChef shows we can automate that lever—teaching models not just how to answer questions, but how to prepare the right training experiences for themselves—pushing us toward self-improving AI ecosystems.

Practical Applications

•Automatically curate and standardize domain-specific datasets (e.g., math, medicine, finance) for supervised fine-tuning.
•Rapidly adapt a base LLM to a new benchmark by generating multiple candidate recipes and selecting the top one by Verifier score.
•Build safer training sets by enforcing strict output formats (e.g., single-letter MCQ answers) and filtering task-mismatched or incorrect samples.
•Scale data augmentation (e.g., chain-of-thought synthesis, MCQ conversion) to strengthen reasoning or format adherence.
•Continuously refresh training data for continual learning, using the Verifier to guard quality over time.
•Run low-cost A/B testing of data pipelines using the proxy reward instead of full fine-tunes for every variant.
•Bootstrap low-resource or specialized domains by mixing best available sources and targeted synthesis under tight format rules.
•Audit and reproduce pipelines end-to-end with executable code artifacts, improving compliance and traceability.
•Combine human insight with automated search: experts set constraints; DataChef explores and refines recipes within guardrails.
•Transfer winning recipe patterns across tasks (e.g., dedup + standardize + targeted synthesis) to speed up new project kickoffs.

Version: 1