Reasoning Core: A Scalable Procedural Data Generation Suite for Symbolic Pre-training and Post-Training
Key Summary
- ā¢Reasoning Core is a tool that automatically creates a huge variety of logic and math puzzles, checks every answer with real solvers, and lets you smoothly dial the difficulty up or down.
- ā¢It focuses on deep, foundational skills like planning, firstāorder logic, grammar parsing, causal reasoning with Bayesian networks, and solving equations.
- ā¢Each task can include a solver-made 'show your work' trace so models can learn step-by-step thinking from the start.
- ā¢The same API works for pre-training (feeding models lots of practice) and for reinforcement learning with verifiable rewards (giving models points only when provably correct).
- ā¢Unlike template puzzles, Reasoning Core samples from very broad distributions (e.g., many random PDDL domains), which prevents overfitting to a few examples.
- ā¢A new grammar framework (gramforge) adds 'bushiness' control and context sensitivity to generate rich, structurally diverse code, language, and logic data.
- ā¢Mixing Reasoning Core data into pre-training improved downstream reasoning while keeping, and sometimes slightly improving, general language modeling quality.
- ā¢Zero-shot tests show these tasks are challenging even for frontier models like GPT-5, especially at higher difficulty.
- ā¢Datasets (about 5B pre-training tokens and 2B post-training tokens) and code are released under MIT, with containerized external solvers for reliable checking.
- ā¢This suite provides a scalable, legally safe, and reproducible way to build neurosymbolic reasoning skills before and during post-training.
Why This Research Matters
Reasoning Core turns reasoning practice into something you can scale, verify, and paceālike a gym that never runs out of safe, effective workouts. It teaches models durable skills such as planning, logic, and structured parsing, which show up in everyday tools: better homework helpers, safer software assistants, and more reliable data analysis. Because every example is solver-checked and procedurally novel, you avoid noisy labels, licensing risks, and benchmark contamination. The single difficulty knob enables smooth curricula, so training matches a modelās current level. Optional step-by-step traces help models learn how to think, not just what to answer. And the same interface works for both pre-training and RL with verifiable rewards, unifying todayās training and tomorrowās advanced optimization.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
š Top Bread (Hook) Imagine learning soccer only by watching games online, without ever practicing drills like passing, dribbling, or shooting. You might pick up some moves, but your core skills would stay shaky.
š„¬ The Concept: The World Before
- What it is: Before this work, language models mostly learned from web text, which is like watching gamesālots of examples but not many targeted, verifiable drills for reasoning.
- How it works (before):
- Gather huge web datasets.
- Train models to predict the next word.
- Hope general reasoning āemergesā from patterns in text.
- Why it matters: This misses core practice on logic, planning, and math with guaranteed right-or-wrong checks. Models can sound smart but struggle on step-by-step reasoning or new problem shapes.
š Bottom Bread (Anchor) A student who only reads math solutions (web text) might talk about algebra, but freeze when given a fresh system of equations to solve.
š Top Bread (Hook) You know how puzzle books often repeat the same few patterns? After a while, you get good at those exact tricks but not at brand-new ones.
š„¬ The Concept: The Problem (Lack of distributional breadth)
- What it is: Many procedural generators rely on fixed templates or classic puzzles (like only BlocksWorld for planning), so they donāt cover enough variety to teach broad, transferable reasoning.
- How it works (problem):
- Pick a small set of puzzle templates.
- Fill in blanks with random numbers/words.
- Train the model.
- Why it matters: The model overfits to patterns. Change a small rule, and performance collapses.
š Bottom Bread (Anchor) If you only practice solving mazes that are squares, a twisty circle maze might stump youāeven if itās not harder, just different.
š Top Bread (Hook) Imagine a teacher who can invent unlimited fresh practice problems, checks every answer with a calculator, and adjusts difficulty smoothlyālike turning a dimmer knob.
š„¬ The Concept: Procedural Generation
- What it is: Procedural generation creates new problems automatically using algorithms.
- How it works:
- Define rules for a domain (e.g., planning, logic).
- Randomly sample many different instances.
- Use an external solver to verify the correct answers.
- Why it matters: You get endless, diverse, and trustworthy practice data without hand-crafting each puzzle.
š Bottom Bread (Anchor) Like a video game that builds a new level every time you playāand the game engine itself confirms that the level is beatable.
š Top Bread (Hook) Think of a dimmer switch on a lampāyou donāt just have off or on; you have smooth control from cozy to super bright.
š„¬ The Concept: Curriculum Design with a Difficulty Knob
- What it is: A single continuous parameter that smoothly scales problem difficulty across tasks.
- How it works:
- Map the knob value to task parameters (e.g., plan length, proof depth).
- Use stochastic rounding for discrete parts (like number of variables).
- Adjust timeouts and sampling so generation stays efficient at all levels.
- Why it matters: Teachers can ramp up difficulty as the model improves, preventing boredom (too easy) or frustration (too hard).
š Bottom Bread (Anchor) Start with a two-step plan like āpick up key, open door,ā then dial up to a dozen steps with detours and locked rooms.
š Top Bread (Hook) When you do a math worksheet, you want the answer key made by a real calculator, not a guess.
š„¬ The Concept: External Solver Verification
- What it is: Each generated problem is checked by a trusted solver (like a theorem prover, planner, or algebra system).
- How it works:
- Generate a problem.
- Ask an expert tool (e.g., Vampire/E for logic, FastDownward for PDDL, Sympy for equations) for the ground-truth answer.
- Store the verified answer and a reward function to score model outputs.
- Why it matters: You get objective, unambiguous correctnessāperfect for both supervised learning and RL with verifiable rewards.
š Bottom Bread (Anchor) Like grading a math quiz by re-computing each result on a scientific calculator so you know exactly whoās right.
š Top Bread (Hook) You know how it helps to see the steps of a recipe, not just the final cake?
š„¬ The Concept: Chain-of-Thought Traces
- What it is: Optional solver-derived or algorithmic step-by-step explanations included with some problems.
- How it works:
- While solving, log intermediate steps (e.g., bottom-up arithmetic evaluation).
- Store the sequence as a trace next to the final answer.
- Use these traces during training to model stepwise reasoning.
- Why it matters: Models donāt just memorize answers; they learn how to think through problems.
š Bottom Bread (Anchor) For (3+4.5)min(8,12)-2**2, the trace shows: 3+4.5=7.5; min=8; 7.58=60; 60-4=56.
To make this concrete, Reasoning Core targets core domains that build lasting skills: planning (PDDL over many randomized domains), full first-order logic with equality, grammar parsing/generation for arbitrary CFGs, causal reasoning over random Bayesian networks, and systems of equations. Each has a built-in difficulty knob, external verification, and often includes step-by-step traces. Unlike suites that offer many narrow puzzles, Reasoning Core offers fewer, deeper generators covering very broad distributionsāso improvement generalizes beyond a few fixed patterns.
Finally, the stakes: better reasoning in AI shows up in daily lifeāmore reliable assistants for homework (logic, equations), safer planning for robots (PDDL), clearer code tools (grammar/regex tasks), sounder data analysis (tables, Bayesian reasoning), and models that can truly explain their steps. Thatās why a scalable, verifiable, curriculum-ready source of symbolic reasoning data matters.
02Core Idea
š Top Bread (Hook) Imagine teaching a student by giving them endless, freshly invented, well-checked puzzles that grow with themāso they practice the skill itself, not just a few tricks.
š„¬ The Concept: The "Aha!" Moment
- What it is: The key insight is to pre-train and post-train language models on procedurally generated, solver-verified symbolic tasks drawn from broad distributions, with a single difficulty knob and optional step-by-step traces.
- How it works:
- Generate diverse instances across foundational domains (planning, logic, grammar, equations, causality).
- Verify each instance with external solvers; attach rewards and (when possible) reasoning traces.
- Mix this data into pre-training (and later RL) so models build reusable reasoning primitives.
- Why it matters: Without breadth plus verification, models overfit templates or learn noisy habits; with both, they develop durable, transferable reasoning skills.
š Bottom Bread (Anchor) Itās like practicing music with an app that makes new songs each day, knows the right notes, and records your fingering stepsāso you learn technique, not just one tune.
Multiple analogies (three ways):
- Sports analogy: Instead of only scrimmages (web text), you drill core movesāpasses, sprints, shots (symbolic tasks)āwith a coach whistle (verifier) and adjustable resistance (difficulty knob).
- Cooking analogy: Rather than copying a few recipes, you master techniquesāsaute, bake, seasonāby making endless variations while a smart oven (solver) checks temperature and timing.
- Video game analogy: Levels are always new (procedural), bosses are fair and test real skills (verified), and difficulty slides smoothly so you never plateau.
Before vs After:
- Before: Narrow templates, fixed puzzles, and unverified answers risk teaching shortcuts.
- After: Broad, randomized domains with rigorous checking and curriculum control teach the underlying reasoning, not the pattern.
š Top Bread (Hook) You know how a toolbox lets you build many kinds of furnitureānot just one chair?
š„¬ The Concept: Distributional Generality
- What it is: Sampling from wide, principled distributions (e.g., random STRIPS planning domains, arbitrary first-order logic formulas) so tasks cover many shapes of the same core idea.
- How it works:
- Define domain-generating rules that can vary structure, size, and content.
- Sample across those choices to avoid repeating narrow patterns.
- Verify correctness so breadth doesnāt break reliability.
- Why it matters: Skills learned this way transfer far better to new problems.
š Bottom Bread (Anchor) Just as a runner trains on hills, tracks, and trails, not only treadmills, models train on many problem terrains.
š Top Bread (Hook) Think of growing a tree: you donāt want a single tall stickāyou want many branches.
š„¬ The Concept: Grammar-Based Generation (gramforge) with Bushiness
- What it is: A grammar framework that generates text/code/logic with control over depth and lateral branching (ābushinessā), plus context sensitivity for things like variable scope.
- How it works:
- Write grammars that can yield multiple synchronized outputs (e.g., English + logic form).
- Sample with a bushiness factor to expand structure sideways, not just deeper.
- Track state (like variables) to stay consistent beyond simple CFGs.
- Why it matters: You get rich, varied structures instead of long but skinny ones, improving learning.
š Bottom Bread (Anchor) A generated Python function includes loops and conditionals that reference the right variables in the right placesālike a real program, not a toy snippet.
Why it works (intuition):
- Curriculum: Starting easy and scaling difficulty lets models form stable building blocks before tackling complex compositions.
- Verification: External solvers remove label noise, giving clear signals (great for RLVR too).
- Breadth: Randomized domains prevent shortcut learning, nudging models to acquire algorithms (e.g., search, unification, parsing) in their weights.
- Traces: Seeing the steps helps models learn procedures, not just ends.
Building blocks:
- Generator modules for core domains (planning, FOL, CFGs, Bayesian nets, equations).
- External solver connectors with containerized tools for portability.
- A unified API: generatxample(level), answer, metadata.trace, scornswer.
- A difficulty scheduler to build curricula.
- Scalable data production: parallelism, timeouts, distribution balancing.
š Bottom Bread (Anchor) Put together, itās like a school-in-a-box that can invent, grade, and pace infinite logic and math lessons tailored to the learner, every day.
03Methodology
High-level pipeline: Input ā [Task & difficulty] ā [Procedural generation] ā [External verification + optional trace] ā [Training example + reward]
Step-by-step (with the Sandwich pattern for key steps):
š Top Bread (Hook) Picture ordering from a menu: you pick the dish and how spicy you want it.
š„¬ The Concept: Task & Difficulty Selection
- What it is: Choose a domain (e.g., planning, logic) and set a single difficulty knob.
- How it works:
- Call geask(name) to select a generator.
- Pass level=k (real-valued) to control parameters (proof depth, plan length, variables).
- Internally, discrete parameters use stochastic rounding and timeouts scale with level.
- Why it matters: It standardizes curricula across very different domains.
š Bottom Bread (Anchor) Example: t = geask('planning'); ex = t.generatxample(level=3) yields a longer multi-step plan problem than level=0.
š Top Bread (Hook) Imagine a factory that can stamp out endless shapes following blueprints.
š„¬ The Concept: Procedural Instance Generation
- What it is: Algorithms sample fresh problems from broad distributions defined per task.
- How it works:
- Randomize structure (e.g., PDDL domains: actions, preconditions, effects).
- Randomize content (objects, names, constants) and representation (e.g., graph formats).
- Use grammar-based generators (gramforge) with bushiness and context sensitivity for language/code.
- Why it matters: Prevents overfitting to specific templates and ensures structural diversity.
š Bottom Bread (Anchor) Two planning problems might have completely different action vocabularies and goal shapes, yet test the same planning skill.
š Top Bread (Hook) When you bake, a thermometer keeps you honest about the cakeās center.
š„¬ The Concept: External Solver Verification & Reward
- What it is: Each instance is checked by a specialized tool, and scornswer() provides a verifiable reward.
- How it works:
- For logic: call Vampire/E to prove entailment/contradiction.
- For planning: use FastDownward to validate plans (accept any valid plan, with length penalty).
- For equations: use Sympy for exact solutions/simplifications.
- Why it matters: Reliable labels and reward signals reduce noise and enable RLVR.
š Bottom Bread (Anchor) Give a proposed plan; the planner replays it to confirm it really reaches the goalāno guesswork.
š Top Bread (Hook) Itās easier to learn a magic trick when someone shows each move.
š„¬ The Concept: Chain-of-Thought (CoT) Traces
- What it is: Optional reasoning steps alongside answers.
- How it works:
- Record intermediate reasoning from solvers or evaluators (e.g., BFS frontier for graph pathfinding).
- Attach as ex.metadata.cot for training.
- Include CoT in ~50% of examples to teach stepwise reasoning without over-reliance.
- Why it matters: Encourages procedural thinking in models.
š Bottom Bread (Anchor) Arithmetic traces compute bottom-up: 3+4.5=7.5; min(8,12)=8; 7.5*8=60; 60-4=56.
š Top Bread (Hook) Think of multiple chefs cooking at once in a well-organized kitchen.
š„¬ The Concept: Scalable Data Production
- What it is: Parallel generation with safeguards to keep throughput high and distributions healthy.
- How it works:
- Many single-thread workers coordinate via file locks to scale across CPUs.
- Timeouts auto-scale with difficulty; stalled processes are killed cleanly.
- A balancing key caps over-frequent labels to avoid skew.
- Why it matters: Keeps data diverse, correct, and fast to produce.
š Bottom Bread (Anchor) Batches donāt end up 90% āTrueā for sequality; balancing keeps labels even.
Concrete data examples:
- PDDL planning: Prompt describes actions/initial state/goals; answer is a valid action sequence. Scoring accepts any valid plan (shorter preferred).
- First-order logic (logili): Natural-language premises/hypothesis mapped to FOL; solver verifies entailment/contradiction/neutral.
- CFG parsing: Given a grammar and a string, output a tree; Earley parser ensures unambiguity and correctness.
- Bayesian networks: Compute posteriors under observations (Rung 1) or interventions (Rung 2) and compare via JensenāShannon divergence.
- Equation systems: Solve for a variable; Sympy verifies exactness; also supports āNo solutionā/āMultiple solutions.ā
Secret sauce:
- Distributional generality: randomized domains beat fixed puzzles for transfer.
- Verified supervision: reduces label noise and unlocks RLVR.
- CoT seeding: early exposure to steps fosters algorithmic habits.
- Grammar bushiness + context sensitivity: richer, more realistic structures than plain PCFG sampling.
- One-knob difficulty: simple but powerful curricula across many domains.
Tiny bit of math intuition (with examples):
- Negative log-likelihood (NLL) measures how āsurprisedā the model is by the right answer: . For example, if a model assigns probability to the correct answer, (since ).
- Linear systems represent multiple equations at once. Example: and ; solving gives , (check: oopsāso fix: use and , then , , and ).
04Experiments & Results
š Top Bread (Hook) Imagine testing a new workout plan: first see if top athletes sweat with it, then check if regular players actually improve over a season.
š„¬ The Concept: The Tests
- What it is: Two main evaluationsāzero-shot difficulty checks on frontier models (GPTā5 family) and supervised fine-tuning where Reasoning Core data is mixed into training.
- How it works:
- Zero-shot: Prompt GPTā5 variants on many tasks at easy (0) and hard (5) levels; measure average reward (task-specific scoring with verifiers).
- Supervised fine-tuning: Train small models by mixing natural-language corpora (FineWeb, SYNTH for pre-training; Dolci for instruction-tuning) with Reasoning Core at ratios rā{0,0.1,0.3,0.5,1.0}; evaluate Negative Log-Likelihood (NLL) on held-out sets and on PlatinumBench (a reasoning suite).
- Why it matters: Shows both that the tasks are genuinely hard and that the mixed data actually improves reasoning without hurting general language modeling.
š Bottom Bread (Anchor) Itās like confirming the drills challenge pros and also make studentsā game-day scores better.
The competition/baselines:
- Standard web-style pre-training only (no Reasoning Core).
- Reasoning Gym (complementary focus; many tasks but narrower distributions per task), referenced qualitatively.
- Internal ablations across mixing ratios r.
Scoreboard with context:
- Zero-shot GPTā5: Average rewards drop at higher difficulty for all assessed tasks, confirming the difficulty knob works and that even frontier models struggleāespecially on planning, full FOL, CFG parsing, Bayesian inference, and graph reasoning.
- Supervised fine-tuning (small models, ~0.5B tokens baseline per run):
- Adding Reasoning Core consistently lowers PlatinumBench answer NLL versus noāRC baselinesāequivalent to raising GPA from a B to a solid Aā across varied reasoning quizzes.
- Validation loss on the natural-language datasets also slightly improves or stays the sameāso reasoning practice does not hurt general language skills; it can help them.
- Best tradeoff around r=0.5 (roughly one symbolic token per two natural tokens), implying a third of tokens being symbolic after mixing.
A touch of math to read the metric:
- NLL: . Lower is better. For example, if a model moves from to on the right answer, NLL drops from to , a sizable gain.
Surprising findings:
- Reasoning practice can slightly improve general language modeling lossāsuggesting symbolic patterns (like structure and compositionality) may help next-token prediction.
- CoT traces in only ~50% of examples were enough to see benefits; full-time traces may not be necessary for small models.
- Despite breadth, procedural data is cheap to produce, so adding tokens can be net-positive even if total token count grows.
Caveats:
- Results are on small models (<100M) and modest token budgets; they are promising but not definitive for very large scales.
- No RLVR training curves are included here; the suite is RL-ready but large-scale RL was scoped to future work.
05Discussion & Limitations
š Top Bread (Hook) Even the best Swiss Army knife has tools it doesnāt includeāand you still need a steady hand to use it well.
š„¬ The Concept: Honest Assessment
- Limitations (what this canāt do yet):
- Scope: Focused on formal/symbolic domains; transfer to messy, real-world text or vision tasks is plausible but unproven.
- Scale: Experiments use small models/data; behavior at GPT-scale or multi-trillion token runs is an open question.
- RLVR: Although every task exposes verifiable rewards, no end-to-end RL results are reported here.
- Residual noise: Despite solver checks and audits, a small fraction of generated items may contain inconsistencies.
- Required resources:
- CPU-heavy generation with external solvers (containerized), parallelism across many cores is recommended; modest GPU for SFT experiments.
- Storage for billions of tokens if you replicate the released corpora.
- When not to use:
- If you need purely natural, stylistic prose data (e.g., creative writing) without formal structure.
- If your training budget forbids extra tokens and you canāt swap out any web data.
- If you require domain-specific knowledge (e.g., medicine) not covered by symbolic tasks.
- Open questions:
- How to best schedule the difficulty knob over long training runs?
- Whatās the optimal symbolic-to-natural mixing ratio at larger scales?
- How well do these skills transfer to tool-use agents or multimodal settings?
- Which traces (algorithmic vs. proof-derived) help most, and when?
š Bottom Bread (Anchor) Think of Reasoning Core as a strong gym for logic muscles; it wonāt teach you poetry, and we still need to learn the best workout plan for marathoners, but it reliably builds strength where it counts for reasoning.
06Conclusion & Future Work
Three-sentence summary:
- Reasoning Core is a scalable suite that procedurally generates broad, verifiable symbolic tasksāwith a smooth difficulty knob and optional step-by-step tracesāfor pre-training, post-training, and RL with verifiable rewards.
- By emphasizing distributional generality across foundational domains (planning, FOL, CFGs, Bayesian nets, equations) and pairing each with external solvers, it supplies endless clean practice that teaches genuine reasoning rather than template tricks.
- Mixing this data into training improved downstream reasoning while preserving, or slightly improving, general language modeling, and the tasks remain challenging even for frontier models.
Main achievement:
- Unifying broad procedural generation, rigorous solver verification, curriculum control, and trace supervision in one API and releasing it at scale (with code and billions of tokens) under a permissive license.
Future directions:
- Full-scale RLVR studies with massive rollouts and curriculum schedules; exploration of transfer to non-symbolic domains and multimodal tasks; automated difficulty pacing driven by online performance.
Why remember this:
- Itās a practical, reproducible way to build neurosymbolic skills early: infinite, diverse, and verifiable puzzles that help models learn how to think step-by-stepānot just talk like they can.
Practical Applications
- ā¢Pre-train small and medium LMs with a 1:2 mix of symbolic to natural tokens to boost downstream reasoning.
- ā¢Instruction-tune assistants with solver-verified logic and math problems to improve reliability on step-by-step tasks.
- ā¢Use RLVR with score_answer for tasks like planning and parsing to safely reward only provable correctness.
- ā¢Generate unlimited, license-safe datasets for classroom-style reasoning practice and benchmark creation.
- ā¢Seed chain-of-thought behavior early by including traces in about half of symbolic examples.
- ā¢Train code assistants with grammar-based code generation that respects variable scope and control flow.
- ā¢Build curriculum schedules by gradually increasing the difficulty knob as validation accuracy plateaus.
- ā¢Stress-test frontier models zero-shot with broad, randomized versions of planning, FOL, and Bayesian tasks.
- ā¢Create robust data for table question answering and format conversion with exact structured scoring.
- ā¢Rapidly prototype new symbolic tasks by extending gramforge grammars and plugging in external solvers.