Reasoning Core: A Scalable Procedural Data Generation Suite for Symbolic Pre-training and Post-Training

Valentin Lacombe; Valentin Quesnel; Damien Sileo

Reasoning Core: A Scalable Procedural Data Generation Suite for Symbolic Pre-training and Post-Training

Beginner

Valentin Lacombe, Valentin Quesnel, Damien Sileo3/2/2026

arXiv

Key Summary

•Reasoning Core is a tool that automatically creates a huge variety of logic and math puzzles, checks every answer with real solvers, and lets you smoothly dial the difficulty up or down.
•It focuses on deep, foundational skills like planning, first‑order logic, grammar parsing, causal reasoning with Bayesian networks, and solving equations.
•Each task can include a solver-made 'show your work' trace so models can learn step-by-step thinking from the start.
•The same API works for pre-training (feeding models lots of practice) and for reinforcement learning with verifiable rewards (giving models points only when provably correct).
•Unlike template puzzles, Reasoning Core samples from very broad distributions (e.g., many random PDDL domains), which prevents overfitting to a few examples.
•A new grammar framework (gramforge) adds 'bushiness' control and context sensitivity to generate rich, structurally diverse code, language, and logic data.
•Mixing Reasoning Core data into pre-training improved downstream reasoning while keeping, and sometimes slightly improving, general language modeling quality.
•Zero-shot tests show these tasks are challenging even for frontier models like GPT-5, especially at higher difficulty.
•Datasets (about 5B pre-training tokens and 2B post-training tokens) and code are released under MIT, with containerized external solvers for reliable checking.
•This suite provides a scalable, legally safe, and reproducible way to build neurosymbolic reasoning skills before and during post-training.

Why This Research Matters

Reasoning Core turns reasoning practice into something you can scale, verify, and pace—like a gym that never runs out of safe, effective workouts. It teaches models durable skills such as planning, logic, and structured parsing, which show up in everyday tools: better homework helpers, safer software assistants, and more reliable data analysis. Because every example is solver-checked and procedurally novel, you avoid noisy labels, licensing risks, and benchmark contamination. The single difficulty knob enables smooth curricula, so training matches a model’s current level. Optional step-by-step traces help models learn how to think, not just what to answer. And the same interface works for both pre-training and RL with verifiable rewards, unifying today’s training and tomorrow’s advanced optimization.

Reading Workflow

Turn this paper into a decision

Scan fast. Promote only the papers that survive triage.

No workflow history yet.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook) Imagine learning soccer only by watching games online, without ever practicing drills like passing, dribbling, or shooting. You might pick up some moves, but your core skills would stay shaky.

🥬 The Concept: The World Before

What it is: Before this work, language models mostly learned from web text, which is like watching games—lots of examples but not many targeted, verifiable drills for reasoning.
How it works (before):
1. Gather huge web datasets.
2. Train models to predict the next word.
3. Hope general reasoning “emerges” from patterns in text.
Why it matters: This misses core practice on logic, planning, and math with guaranteed right-or-wrong checks. Models can sound smart but struggle on step-by-step reasoning or new problem shapes.

🍞 Bottom Bread (Anchor) A student who only reads math solutions (web text) might talk about algebra, but freeze when given a fresh system of equations to solve.

🍞 Top Bread (Hook) You know how puzzle books often repeat the same few patterns? After a while, you get good at those exact tricks but not at brand-new ones.

🥬 The Concept: The Problem (Lack of distributional breadth)

What it is: Many procedural generators rely on fixed templates or classic puzzles (like only BlocksWorld for planning), so they don’t cover enough variety to teach broad, transferable reasoning.
How it works (problem):
1. Pick a small set of puzzle templates.
2. Fill in blanks with random numbers/words.
3. Train the model.
Why it matters: The model overfits to patterns. Change a small rule, and performance collapses.

🍞 Bottom Bread (Anchor) If you only practice solving mazes that are $10×10$ squares, a twisty circle maze might stump you—even if it’s not harder, just different.

🍞 Top Bread (Hook) Imagine a teacher who can invent unlimited fresh practice problems, checks every answer with a calculator, and adjusts difficulty smoothly—like turning a dimmer knob.

🥬 The Concept: Procedural Generation

What it is: Procedural generation creates new problems automatically using algorithms.
How it works:
1. Define rules for a domain (e.g., planning, logic).
2. Randomly sample many different instances.
3. Use an external solver to verify the correct answers.
Why it matters: You get endless, diverse, and trustworthy practice data without hand-crafting each puzzle.

🍞 Bottom Bread (Anchor) Like a video game that builds a new level every time you play—and the game engine itself confirms that the level is beatable.

🍞 Top Bread (Hook) Think of a dimmer switch on a lamp—you don’t just have off or on; you have smooth control from cozy to super bright.

🥬 The Concept: Curriculum Design with a Difficulty Knob

What it is: A single continuous parameter that smoothly scales problem difficulty across tasks.
How it works:
1. Map the knob value to task parameters (e.g., plan length, proof depth).
2. Use stochastic rounding for discrete parts (like number of variables).
3. Adjust timeouts and sampling so generation stays efficient at all levels.
Why it matters: Teachers can ramp up difficulty as the model improves, preventing boredom (too easy) or frustration (too hard).

🍞 Bottom Bread (Anchor) Start with a two-step plan like “pick up key, open door,” then dial up to a dozen steps with detours and locked rooms.

🍞 Top Bread (Hook) When you do a math worksheet, you want the answer key made by a real calculator, not a guess.

🥬 The Concept: External Solver Verification

What it is: Each generated problem is checked by a trusted solver (like a theorem prover, planner, or algebra system).
How it works:
1. Generate a problem.
2. Ask an expert tool (e.g., Vampire/E for logic, FastDownward for PDDL, Sympy for equations) for the ground-truth answer.
3. Store the verified answer and a reward function to score model outputs.
Why it matters: You get objective, unambiguous correctness—perfect for both supervised learning and RL with verifiable rewards.

🍞 Bottom Bread (Anchor) Like grading a math quiz by re-computing each result on a scientific calculator so you know exactly who’s right.

🍞 Top Bread (Hook) You know how it helps to see the steps of a recipe, not just the final cake?

🥬 The Concept: Chain-of-Thought Traces

What it is: Optional solver-derived or algorithmic step-by-step explanations included with some problems.
How it works:
1. While solving, log intermediate steps (e.g., bottom-up arithmetic evaluation).
2. Store the sequence as a trace next to the final answer.
3. Use these traces during training to model stepwise reasoning.
Why it matters: Models don’t just memorize answers; they learn how to think through problems.

🍞 Bottom Bread (Anchor) For (3+4.5)min(8,12)-2**2, the trace shows: 3+4.5=7.5; min=8; 7.58=60; 60-4=56.

To make this concrete, Reasoning Core targets core domains that build lasting skills: planning (PDDL over many randomized domains), full first-order logic with equality, grammar parsing/generation for arbitrary CFGs, causal reasoning over random Bayesian networks, and systems of equations. Each has a built-in difficulty knob, external verification, and often includes step-by-step traces. Unlike suites that offer many narrow puzzles, Reasoning Core offers fewer, deeper generators covering very broad distributions—so improvement generalizes beyond a few fixed patterns.

Finally, the stakes: better reasoning in AI shows up in daily life—more reliable assistants for homework (logic, equations), safer planning for robots (PDDL), clearer code tools (grammar/regex tasks), sounder data analysis (tables, Bayesian reasoning), and models that can truly explain their steps. That’s why a scalable, verifiable, curriculum-ready source of symbolic reasoning data matters.

02Core Idea

🍞 Top Bread (Hook) Imagine teaching a student by giving them endless, freshly invented, well-checked puzzles that grow with them—so they practice the skill itself, not just a few tricks.

🥬 The Concept: The "Aha!" Moment

What it is: The key insight is to pre-train and post-train language models on procedurally generated, solver-verified symbolic tasks drawn from broad distributions, with a single difficulty knob and optional step-by-step traces.
How it works:
1. Generate diverse instances across foundational domains (planning, logic, grammar, equations, causality).
2. Verify each instance with external solvers; attach rewards and (when possible) reasoning traces.
3. Mix this data into pre-training (and later RL) so models build reusable reasoning primitives.
Why it matters: Without breadth plus verification, models overfit templates or learn noisy habits; with both, they develop durable, transferable reasoning skills.

🍞 Bottom Bread (Anchor) It’s like practicing music with an app that makes new songs each day, knows the right notes, and records your fingering steps—so you learn technique, not just one tune.

Multiple analogies (three ways):

Sports analogy: Instead of only scrimmages (web text), you drill core moves—passes, sprints, shots (symbolic tasks)—with a coach whistle (verifier) and adjustable resistance (difficulty knob).
Cooking analogy: Rather than copying a few recipes, you master techniques—saute, bake, season—by making endless variations while a smart oven (solver) checks temperature and timing.
Video game analogy: Levels are always new (procedural), bosses are fair and test real skills (verified), and difficulty slides smoothly so you never plateau.

Before vs After:

Before: Narrow templates, fixed puzzles, and unverified answers risk teaching shortcuts.
After: Broad, randomized domains with rigorous checking and curriculum control teach the underlying reasoning, not the pattern.

🍞 Top Bread (Hook) You know how a toolbox lets you build many kinds of furniture—not just one chair?

🥬 The Concept: Distributional Generality

What it is: Sampling from wide, principled distributions (e.g., random STRIPS planning domains, arbitrary first-order logic formulas) so tasks cover many shapes of the same core idea.
How it works:
1. Define domain-generating rules that can vary structure, size, and content.
2. Sample across those choices to avoid repeating narrow patterns.
3. Verify correctness so breadth doesn’t break reliability.
Why it matters: Skills learned this way transfer far better to new problems.

🍞 Bottom Bread (Anchor) Just as a runner trains on hills, tracks, and trails, not only treadmills, models train on many problem terrains.

🍞 Top Bread (Hook) Think of growing a tree: you don’t want a single tall stick—you want many branches.

🥬 The Concept: Grammar-Based Generation (gramforge) with Bushiness

What it is: A grammar framework that generates text/code/logic with control over depth and lateral branching (“bushiness”), plus context sensitivity for things like variable scope.
How it works:
1. Write grammars that can yield multiple synchronized outputs (e.g., English + logic form).
2. Sample with a bushiness factor to expand structure sideways, not just deeper.
3. Track state (like variables) to stay consistent beyond simple CFGs.
Why it matters: You get rich, varied structures instead of long but skinny ones, improving learning.

🍞 Bottom Bread (Anchor) A generated Python function includes loops and conditionals that reference the right variables in the right places—like a real program, not a toy snippet.

Why it works (intuition):

Curriculum: Starting easy and scaling difficulty lets models form stable building blocks before tackling complex compositions.
Verification: External solvers remove label noise, giving clear signals (great for RLVR too).
Breadth: Randomized domains prevent shortcut learning, nudging models to acquire algorithms (e.g., search, unification, parsing) in their weights.
Traces: Seeing the steps helps models learn procedures, not just ends.

Building blocks:

Generator modules for core domains (planning, FOL, CFGs, Bayesian nets, equations).
External solver connectors with containerized tools for portability.
A unified API: generat $e_e$ xample(level), answer, metadata.trace, scor $e_a$ nswer.
A difficulty scheduler to build curricula.
Scalable data production: parallelism, timeouts, distribution balancing.

🍞 Bottom Bread (Anchor) Put together, it’s like a school-in-a-box that can invent, grade, and pace infinite logic and math lessons tailored to the learner, every day.

03Methodology

High-level pipeline: Input → [Task & difficulty] → [Procedural generation] → [External verification + optional trace] → [Training example + reward]

Step-by-step (with the Sandwich pattern for key steps):

🍞 Top Bread (Hook) Picture ordering from a menu: you pick the dish and how spicy you want it.

🥬 The Concept: Task & Difficulty Selection

What it is: Choose a domain (e.g., planning, logic) and set a single difficulty knob.
How it works:
1. Call ge $t_t$ ask(name) to select a generator.
2. Pass level=k (real-valued) to control parameters (proof depth, plan length, variables).
3. Internally, discrete parameters use stochastic rounding and timeouts scale with level.
Why it matters: It standardizes curricula across very different domains.

🍞 Bottom Bread (Anchor) Example: t = ge $t_t$ ask('planning'); ex = t.generat $e_e$ xample(level=3) yields a longer multi-step plan problem than level=0.

🍞 Top Bread (Hook) Imagine a factory that can stamp out endless shapes following blueprints.

🥬 The Concept: Procedural Instance Generation

What it is: Algorithms sample fresh problems from broad distributions defined per task.
How it works:
1. Randomize structure (e.g., PDDL domains: actions, preconditions, effects).
2. Randomize content (objects, names, constants) and representation (e.g., graph formats).
3. Use grammar-based generators (gramforge) with bushiness and context sensitivity for language/code.
Why it matters: Prevents overfitting to specific templates and ensures structural diversity.

🍞 Bottom Bread (Anchor) Two planning problems might have completely different action vocabularies and goal shapes, yet test the same planning skill.

🍞 Top Bread (Hook) When you bake, a thermometer keeps you honest about the cake’s center.

🥬 The Concept: External Solver Verification & Reward

What it is: Each instance is checked by a specialized tool, and scor $e_a$ nswer() provides a verifiable reward.
How it works:
1. For logic: call Vampire/E to prove entailment/contradiction.
2. For planning: use FastDownward to validate plans (accept any valid plan, with length penalty).
3. For equations: use Sympy for exact solutions/simplifications.
Why it matters: Reliable labels and reward signals reduce noise and enable RLVR.

🍞 Bottom Bread (Anchor) Give a proposed plan; the planner replays it to confirm it really reaches the goal—no guesswork.

🍞 Top Bread (Hook) It’s easier to learn a magic trick when someone shows each move.

🥬 The Concept: Chain-of-Thought (CoT) Traces

What it is: Optional reasoning steps alongside answers.
How it works:
1. Record intermediate reasoning from solvers or evaluators (e.g., BFS frontier for graph pathfinding).
2. Attach as ex.metadata.cot for training.
3. Include CoT in ~50% of examples to teach stepwise reasoning without over-reliance.
Why it matters: Encourages procedural thinking in models.

🍞 Bottom Bread (Anchor) Arithmetic traces compute bottom-up: 3+4.5=7.5; min(8,12)=8; 7.5*8=60; 60-4=56.

🍞 Top Bread (Hook) Think of multiple chefs cooking at once in a well-organized kitchen.

🥬 The Concept: Scalable Data Production

What it is: Parallel generation with safeguards to keep throughput high and distributions healthy.
How it works:
1. Many single-thread workers coordinate via file locks to scale across CPUs.
2. Timeouts auto-scale with difficulty; stalled processes are killed cleanly.
3. A balancing key caps over-frequent labels to avoid skew.
Why it matters: Keeps data diverse, correct, and fast to produce.

🍞 Bottom Bread (Anchor) Batches don’t end up 90% “True” for se $t_e$ quality; balancing keeps labels even.

Concrete data examples:

PDDL planning: Prompt describes actions/initial state/goals; answer is a valid action sequence. Scoring accepts any valid plan (shorter preferred).
First-order logic (logi $c_n$ li): Natural-language premises/hypothesis mapped to FOL; solver verifies entailment/contradiction/neutral.
CFG parsing: Given a grammar and a string, output a tree; Earley parser ensures unambiguity and correctness.
Bayesian networks: Compute posteriors under observations (Rung 1) or interventions (Rung 2) and compare via Jensen–Shannon divergence.
Equation systems: Solve for a variable; Sympy verifies exactness; also supports “No solution”/“Multiple solutions.”

Secret sauce:

Distributional generality: randomized domains beat fixed puzzles for transfer.
Verified supervision: reduces label noise and unlocks RLVR.
CoT seeding: early exposure to steps fosters algorithmic habits.
Grammar bushiness + context sensitivity: richer, more realistic structures than plain PCFG sampling.
One-knob difficulty: simple but powerful curricula across many domains.

Tiny bit of math intuition (with examples):

Negative log-likelihood (NLL) measures how “surprised” the model is by the right answer: $\text{NLL} = -\log p(\text{answer})$ . For example, if a model assigns probability $p=0.20$ to the correct answer, $\text{NLL} = -\log 0.20 \approx 1.609$ (since $\log 0.20 \approx -1.609$ ).
Linear systems represent multiple equations at once. Example: $2x + y = 7$ and $x - y = 1$ ; solving gives $x=4$ , $y=3$ (check: $2\cdot 4 + 3 = 11$ oops—so fix: use $2x + y = 11$ and $x - y = 1$ , then $x=4$ , $y=3$ , and $2\cdot 4 + 3 = 11$ ).

04Experiments & Results

🍞 Top Bread (Hook) Imagine testing a new workout plan: first see if top athletes sweat with it, then check if regular players actually improve over a season.

🥬 The Concept: The Tests

What it is: Two main evaluations—zero-shot difficulty checks on frontier models (GPT‑5 family) and supervised fine-tuning where Reasoning Core data is mixed into training.
How it works:
1. Zero-shot: Prompt GPT‑5 variants on many tasks at easy (0) and hard (5) levels; measure average reward (task-specific scoring with verifiers).
2. Supervised fine-tuning: Train small models by mixing natural-language corpora (FineWeb, SYNTH for pre-training; Dolci for instruction-tuning) with Reasoning Core at ratios $r\in{0$ ,0.1,0.3,0.5,1.0}; evaluate Negative Log-Likelihood (NLL) on held-out sets and on PlatinumBench (a reasoning suite).
Why it matters: Shows both that the tasks are genuinely hard and that the mixed data actually improves reasoning without hurting general language modeling.

🍞 Bottom Bread (Anchor) It’s like confirming the drills challenge pros and also make students’ game-day scores better.

The competition/baselines:

Standard web-style pre-training only (no Reasoning Core).
Reasoning Gym (complementary focus; many tasks but narrower distributions per task), referenced qualitatively.
Internal ablations across mixing ratios r.

Scoreboard with context:

Zero-shot GPT‑5: Average rewards drop at higher difficulty for all assessed tasks, confirming the difficulty knob works and that even frontier models struggle—especially on planning, full FOL, CFG parsing, Bayesian inference, and graph reasoning.
Supervised fine-tuning (small models, ~0.5B tokens baseline per run):
- Adding Reasoning Core consistently lowers PlatinumBench answer NLL versus no‑RC baselines—equivalent to raising GPA from a B to a solid A‑ across varied reasoning quizzes.
- Validation loss on the natural-language datasets also slightly improves or stays the same—so reasoning practice does not hurt general language skills; it can help them.
- Best tradeoff around r=0.5 (roughly one symbolic token per two natural tokens), implying a third of tokens being symbolic after mixing.

A touch of math to read the metric:

NLL: $\text{NLL} = -\log p(\text{answer})$ . Lower is better. For example, if a model moves from $p=0.10$ to $p=0.25$ on the right answer, NLL drops from $-\log 0.10 \approx 2.303$ to $-\log 0.25 \approx 1.386$ , a sizable gain.

Surprising findings:

Reasoning practice can slightly improve general language modeling loss—suggesting symbolic patterns (like structure and compositionality) may help next-token prediction.
CoT traces in only ~50% of examples were enough to see benefits; full-time traces may not be necessary for small models.
Despite breadth, procedural data is cheap to produce, so adding tokens can be net-positive even if total token count grows.

Caveats:

Results are on small models (<100M) and modest token budgets; they are promising but not definitive for very large scales.
No RLVR training curves are included here; the suite is RL-ready but large-scale RL was scoped to future work.

05Discussion & Limitations

🍞 Top Bread (Hook) Even the best Swiss Army knife has tools it doesn’t include—and you still need a steady hand to use it well.

🥬 The Concept: Honest Assessment

Limitations (what this can’t do yet):
1. Scope: Focused on formal/symbolic domains; transfer to messy, real-world text or vision tasks is plausible but unproven.
2. Scale: Experiments use small models/data; behavior at GPT-scale or multi-trillion token runs is an open question.
3. RLVR: Although every task exposes verifiable rewards, no end-to-end RL results are reported here.
4. Residual noise: Despite solver checks and audits, a small fraction of generated items may contain inconsistencies.
Required resources:
- CPU-heavy generation with external solvers (containerized), parallelism across many cores is recommended; modest GPU for SFT experiments.
- Storage for billions of tokens if you replicate the released corpora.
When not to use:
1. If you need purely natural, stylistic prose data (e.g., creative writing) without formal structure.
2. If your training budget forbids extra tokens and you can’t swap out any web data.
3. If you require domain-specific knowledge (e.g., medicine) not covered by symbolic tasks.
Open questions:
1. How to best schedule the difficulty knob over long training runs?
2. What’s the optimal symbolic-to-natural mixing ratio at larger scales?
3. How well do these skills transfer to tool-use agents or multimodal settings?
4. Which traces (algorithmic vs. proof-derived) help most, and when?

🍞 Bottom Bread (Anchor) Think of Reasoning Core as a strong gym for logic muscles; it won’t teach you poetry, and we still need to learn the best workout plan for marathoners, but it reliably builds strength where it counts for reasoning.

06Conclusion & Future Work

Three-sentence summary:

Reasoning Core is a scalable suite that procedurally generates broad, verifiable symbolic tasks—with a smooth difficulty knob and optional step-by-step traces—for pre-training, post-training, and RL with verifiable rewards.
By emphasizing distributional generality across foundational domains (planning, FOL, CFGs, Bayesian nets, equations) and pairing each with external solvers, it supplies endless clean practice that teaches genuine reasoning rather than template tricks.
Mixing this data into training improved downstream reasoning while preserving, or slightly improving, general language modeling, and the tasks remain challenging even for frontier models.

Main achievement:

Unifying broad procedural generation, rigorous solver verification, curriculum control, and trace supervision in one API and releasing it at scale (with code and billions of tokens) under a permissive license.

Future directions:

Full-scale RLVR studies with massive rollouts and curriculum schedules; exploration of transfer to non-symbolic domains and multimodal tasks; automated difficulty pacing driven by online performance.

Why remember this:

It’s a practical, reproducible way to build neurosymbolic skills early: infinite, diverse, and verifiable puzzles that help models learn how to think step-by-step—not just talk like they can.

Practical Applications

•Pre-train small and medium LMs with a 1:2 mix of symbolic to natural tokens to boost downstream reasoning.
•Instruction-tune assistants with solver-verified logic and math problems to improve reliability on step-by-step tasks.
•Use RLVR with score_answer for tasks like planning and parsing to safely reward only provable correctness.
•Generate unlimited, license-safe datasets for classroom-style reasoning practice and benchmark creation.
•Seed chain-of-thought behavior early by including traces in about half of symbolic examples.
•Train code assistants with grammar-based code generation that respects variable scope and control flow.
•Build curriculum schedules by gradually increasing the difficulty knob as validation accuracy plateaus.
•Stress-test frontier models zero-shot with broad, randomized versions of planning, FOL, and Bayesian tasks.
•Create robust data for table question answering and format conversion with exact structured scoring.
•Rapidly prototype new symbolic tasks by extending gramforge grammars and plugging in external solvers.

Version: 1