CHIMERA: Compact Synthetic Data for Generalizable LLM Reasoning

Xinyu Zhu; Yihao Feng; Yanchao Sun; Xianzhi Du; Pingzhi Li; Olli Saarikivi; Yun Zhu; Yu Meng

CHIMERA: Compact Synthetic Data for Generalizable LLM Reasoning

Beginner

Xinyu Zhu, Yihao Feng, Yanchao Sun et al.3/1/2026

arXiv

Key Summary

•CHIMERA is a small (about 9,000 examples) but very carefully built synthetic dataset that teaches AI to solve hard problems step by step.
•It fixes three big data problems: getting started without long examples (cold start), covering many subjects beyond math, and avoiding slow, expensive human grading.
•The data is made in three stages: expand subjects into many topics, write clear problems with checkable answers, and synthesize long, detailed solutions.
•Two different strong models double-check every problem and answer, so only high-confidence items stay in the dataset.
•A 4-billion-parameter model fine-tuned on CHIMERA scores close to or even better than much larger models on tough benchmarks like GPQA-Diamond, AIME, HMMT, and Humanity’s Last Exam.
•Even just supervised fine-tuning on CHIMERA (without RL) gives most of the gains; RL adds a bit more.
•Compared to other synthetic datasets, CHIMERA is harder and longer, which gives stronger learning signals.
•Multiple-choice training hurt performance, but CHIMERA’s free-form, step-by-step problems improved real reasoning.
•The pipeline is fully automated and scalable, so new subjects and topics can be added without human annotation.
•This shows quality, coverage, and verified long reasoning matter more than just having a giant pile of data.

Why This Research Matters

Better reasoning means safer, smarter AI helpers in school, work, and science. CHIMERA shows that with the right kind of data—long, verified, and broadly covered—even small models can think through hard problems reliably. This lowers the cost for labs, startups, and schools to build strong reasoning systems without huge annotation teams. It also encourages transparent thinking because solutions include the full chain of steps, making it easier to audit and improve. The automated pipeline lets communities expand to new subjects quickly as needs change. Ultimately, this shifts AI from guessers to genuine problem-solvers across many fields.

Reading Workflow

Turn this paper into a decision

Scan fast. Promote only the papers that survive triage.

No workflow history yet.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you’re learning to solve tricky puzzles. If your workbook only shows the final answer and not the steps, it’s really hard to learn how to think it through.

🥬 The Concept: Data-Centric Challenges are the real-world roadblocks that stop AI from getting the right practice data for learning to reason. How it works:

Cold start: models need long step-by-step examples to learn a reasoning style, but such seeds are scarce.
Narrow coverage: most open datasets are math-heavy while real life spans physics, chemistry, biology, CS, and more.
Annotation bottleneck: expert problems are so hard that reliable human solutions and explanations are slow and costly. Why it matters: Without solving these, even great training tricks (like RL or special decoders) won’t help much, because the model never sees the right kind of practice. 🍞 Anchor: It’s like trying to learn chess only from end-positions. You need move-by-move tutorials across many openings and game types.

🍞 Hook: You know how a good teacher doesn’t just say the answer—they show every step on the board?

🥬 The Concept: Chain-of-Thought (CoT) Trajectories are long, explicit step-by-step solutions that show how to think, not just what to think. How it works:

The model reads a hard problem.
It writes a multi-step plan, calculations, and checks.
It arrives at a final, verifiable answer. Why it matters: Without CoT, models learn to guess or use shortcuts; with CoT, they learn robust methods they can reuse on new problems. 🍞 Anchor: When someone shows you each algebra step, you can do a new problem by following the same pattern.

🍞 Hook: Imagine a library that only has math books. Helpful… until you face a biology quiz.

🥬 The Concept: Domain Coverage means including many subjects and subtopics so the model practices reasoning in many ways. How it works:

Start from a few big subjects (like math, physics, CS).
Expand into a huge topic tree (over 1,000 topics).
Make problems that touch different skills and styles of thinking. Why it matters: Without broad coverage, models overfit to one style (e.g., math tricks) and stumble on other sciences or mixed problems. 🍞 Anchor: If you only train to sprint, you’ll struggle at a marathon; variety builds general strength.

🍞 Hook: Picture two referees watching the same game to be sure the call is fair.

🥬 The Concept: Automated Validation Pipeline is a robot referee team that checks if each synthetic problem is clear and its answer correct—without humans in the loop. How it works:

Draft problem + answer.
Two independent models verify: Is the question well-posed? Is the answer right?
Another strong model writes a long solution; a checker compares its final answer to the official one. Why it matters: Without auto-checking, errors pile up; noisy data teaches bad habits. 🍞 Anchor: It’s like spell-check and fact-check for every math and science question before it enters the workbook.

🍞 Hook: Before calculators, doing big arithmetic by hand was slow. Before CHIMERA, getting lots of expert reasoning data was slow too.

🥬 The Concept: Synthetic Data Generation means using strong models to create new, realistic training examples at scale. How it works:

Expand subjects into many topics.
Generate self-contained, verifiable problems with answers.
Produce long step-by-step solutions and keep the ones that check out. Why it matters: Without synthetic data, we can’t get enough high-quality reasoning practice for modern models. 🍞 Anchor: It’s like an automatic worksheet factory that prints only clean, correct, step-by-step problems across many subjects.

The world before: LLMs could already impress us with reasoning, mostly thanks to special post-training (like supervised fine-tuning and reinforcement learning) on curated examples. But progress hit a wall because the right kind of data—hard, long, clean, and broad—wasn’t available at scale. Many open datasets were short, math-centric, or multiple-choice, which encouraged guessing and elimination rather than genuine multi-step reasoning. And asking humans to write frontier-level solutions with full explanations was too slow and costly.

The problem: Three connected pains—cold start (few long CoTs to learn the style), limited domain coverage (reasoning doesn’t transfer well), and annotation bottlenecks (experts are expensive)—made it hard to train smaller open models that could generalize.

Failed attempts: 1) Using small explanations or just final answers taught models too little process. 2) Scaling up multiple-choice questions made models better at test-taking tricks, not reasoning. 3) Naive synthetic data (without strict validation) leaked errors and inconsistencies that confused training.

The gap: A compact, difficult, multi-subject dataset with long CoTs and fully automated quality control.

Real stakes: Better reasoning powers safer assistants, stronger tutoring, scientific help, and decision support. Crucially, it puts strong reasoning within reach of smaller teams that can’t hire large annotation crews.

02Core Idea

🍞 Hook: Think of a tiny box of tools that’s so well chosen you can fix almost anything.

🥬 The Concept: The Aha! is that a small, carefully crafted set of long, verified, and broadly covered synthetic problems can train a small model to reason like a much bigger one. How it works:

Build a broad topic tree across 8 disciplines and 1,000+ topics.
Generate self-contained, checkable problems with concise answers.
Add long, step-by-step solutions from a strong reasoner; verify them automatically. Why it matters: Quality, coverage, and verified long steps beat sheer size. This makes strong reasoning affordable and reproducible. 🍞 Anchor: A 4B model trained on CHIMERA rivals or approaches giants on GPQA-Diamond, AIME, HMMT, and HLE.

Three analogies:

Training wheels: Long CoTs are like steady hands on the bike. With them, the small model learns balance (reasoning steps) it can reuse anywhere.
Recipe book: Each problem is a detailed recipe with ingredients (facts), steps (logic), and a taste test (verification). Following many recipes builds a great chef.
Sports practice: Instead of only taking penalty shots (multiple choice), the team runs full drills (free-form reasoning), learns plays (strategies), and reviews film (validation), so they perform in any game.

Before vs After:

Before: Big, mixed-quality datasets or multiple-choice training led to shortcutting; small models lagged far behind huge ones.
After: With CHIMERA’s long, verified, multi-domain CoTs, a 4B model jumps to near-large-model performance on elite tests, and benefits more from sampling (pass@k) at inference.

🍞 Hook: You know how seeing a full worked solution makes the logic click?

🥬 The Concept: Why It Works is about teaching the process, not just the answer. How it works:

Long CoTs give reusable patterns (planning, checking, abstraction).
Broad subjects prevent overfitting to one trick.
Dual verifiers and answer-checking control noise.
Hard problems avoid saturation, so the model keeps learning. Why it matters: Without these four, models plateau, overfit, or learn to guess. 🍞 Anchor: On GPQA-Diamond, CHIMERA-trained models improve more as you sample more solutions (pass@k), showing deeper coverage of valid solution paths.

Building blocks:

Subject Expansion: grow a hierarchical taxonomy to ensure structured breadth.
Problem Generation: make self-contained, verifiable, PhD-level questions with unique answers.
Solution Synthesis: produce long reasoning trajectories and label correctness.
Automated Cross-Validation: two independent judges vet problems/answers; a checker confirms solution consistency.
Training: use SFT on correct CoTs; use RL with LLM-based rewards on harder items.

🍞 Anchor: This combo lets a compact 9K dataset act like a precision training kit rather than a random problem pile.

03Methodology

At a high level: Seed subjects → Topic expansion → Problem + answer drafting → Dual-model validity check → Long solution synthesis → Answer correctness check → Curate dataset → SFT then RL → Evaluate.

🍞 Hook: Imagine planting a few trees and ending up with a whole forest of trails to explore.

🥬 The Concept: Subject Expansion builds a hierarchical map of topics from a few big subjects. How it works:

Start with broad subjects (math, physics, CS, chemistry, biology, literature, history, linguistics).
Use a strong model to list many fine-grained topics, sample more for math, and deduplicate.
Save a clean taxonomy covering over 1,000 topics. Why it matters: Without a map, you’ll wander and miss key areas; with it, coverage is systematic and adjustable. 🍞 Anchor: From “Physics,” you branch to “Topological Quantum Field Theory,” where a lens space partition function problem can live.

🍞 Hook: A good riddle tells you everything you need—but no extra fluff.

🥬 The Concept: Problem Generation creates self-contained, verifiable, hard problems with unique answers. How it works:

For each topic, draft a PhD-level problem and concise answer.
Enforce clarity and verifiability via prompts.
Use two independent models to validate well-posedness and correctness; keep only those both approve. Why it matters: Without clarity and unique answers, training becomes noisy and confusing. 🍞 Anchor: A math sieve problem specifies all variables and asymptotics, ending with a compact checkable expression.

🍞 Hook: Watching a full magic trick exposes the method behind the wow.

🥬 The Concept: Solution Synthesis adds the long Chain-of-Thought trajectory for each validated problem. How it works:

A state-of-the-art reasoning model writes detailed steps.
The final answer is compared to the official key; mark correct/incorrect.
Correct CoTs feed supervised fine-tuning; unanswered or incorrect-traj items still help RL. Why it matters: Without long steps, models memorize answers instead of learning methods. 🍞 Anchor: The TQFT example walks through homomorphism counts and lands exactly on gcd(n,p)/n.

🍞 Hook: Two referees reduce bad calls.

🥬 The Concept: Cross-Model Verification keeps the dataset clean by requiring agreement. How it works:

Verifier A and Verifier B check problem quality and answer correctness.
A correctness checker aligns synthesized solution’s final answer with the official one.
Reject disagreements or ambiguities. Why it matters: Without agreement, subtle model biases slip through and pollute training. 🍞 Anchor: If one model is lenient but the other flags ambiguity, that item doesn’t make the cut.

End-to-end recipe with example data:

Input: Subject = Physics → Topic = Topological Quantum Field Theory.
Generate problem: Compute Z(L(p,q)) in a Dijkgraaf–Witten TQFT for G = $Z_n$ .
Validate: Two models confirm the problem is clear, answer is unique and correct.
Synthesize solution: A long derivation uses group homomorphisms and properties of lens spaces.
Correctness check: Extract final answer, compare to key gcd(n,p)/n → mark True.
Store: (subject, topic, problem, answer, reasoning, correctness).

Training setup:

SFT: Train on verified-correct CoTs (batch 256, LR 1e-5).
RL: CISPO for one epoch (batch 256, LR 1e-6, 8 rollouts), mixing SFT data with previously unsolved but now solvable problems. Rewards from an LLM judge score the rollouts.
Decoding for eval: long context, temperature 0.6, top-p 0.95, top-k 20; multiple samples per problem to compute pass@k.

Secret sauce:

Long CoTs (teach process),
Structured breadth (hierarchical taxonomy),
Dual verifiers (quality control),
Decoupled stages (easy to extend/modify),
Use of both correct CoTs (SFT) and challenging leftovers (RL) to cover skill gaps.

04Experiments & Results

🍞 Hook: If you study with better problems, you don’t just ace one test—you do better on many.

🥬 The Concept: The Test Suite checks general reasoning with hard, diverse benchmarks. How it works:

Scientific depth: GPQA-Diamond (graduate-level, retrieval-resistant).
Competition math: AIME 2024/25/26; HMMT Feb/Nov 2025.
Broad, tough questions: Humanity’s Last Exam (text-only subset).
Measure with pass@1 and pass@k by sampling multiple solutions per problem. Why it matters: Without rigorous, diverse tests, we can’t trust improvements to generalize. 🍞 Anchor: It’s like testing a runner on sprints, middle distance, and hills to see real fitness.

Competition settings:

Base: Qwen3-4B-Thinking-2507 without extra training.
OpenScience fine-tuned model.
CHIMERA fine-tuned model.
Context from larger models (e.g., DeepSeek-R1, Qwen3-235B) shows how close a 4B model can get.

Scoreboard with context:

CHIMERA vs Base: On GPQA-Diamond, pass@1 rises about +4.3 points; on AIME24, +5.3; HMMT Feb +6.5; HMMT Nov +9.7; HLE +1.7. That’s like turning steady B’s into A’s across several classes.
CHIMERA vs OpenScience: Multiple-choice fine-tuning made things worse than base, while CHIMERA’s free-form CoTs made things better—evidence that guessing strategies don’t teach real reasoning.
Inference-time scaling: As we sample more solutions (pass@k), the CHIMERA model pulls further ahead (e.g., GPQA-Diamond pass@ $8 ≈ 90$ .7% vs 81.5% base). That’s like finding the right solution path more often because you know more ways to solve it.
SFT-only ablation: Most of the gains come from SFT on correct CoTs; RL adds extra polish.

Surprising findings:

Small but smart beats big but noisy: A 9K example dataset, carefully built, lifted a 4B model toward 70B–235B performance on elite benchmarks.
Difficulty matters: The base model nearly saturates older synthetic sets (~76–89% accuracy), but scores much lower on CHIMERA (~37.5%), leaving room to learn—exactly what we want from training data.
Multiple-choice can backfire: It encourages elimination tactics over step-by-step derivations, which hurts transfer to free-form reasoning tasks.

🍞 Anchor: After training with long, verified solutions across many subjects, the model not only answers more questions right but also benefits more from extra tries, just like a student who learned many solution methods.

05Discussion & Limitations

Limitations:

Generator dependence: If the strong model that writes solutions has biases or gaps, those can slip in (though dual checks reduce this).
Coverage skew: Math is nearly half the set; some areas may still be underrepresented.
English-only, text-only: No multilingual or multimodal problems here.
Long-context compute: Very long CoTs need more memory and time to train and evaluate.
Frontier drift: As benchmarks get harder, data may need refreshing to stay challenging.

Required resources:

Access to strong models for expansion, generation, verification, and rewards.
GPUs with long-context capability to handle 10K-token-plus sequences.
Sampling budget for fair pass@k evaluation.

When not to use:

Tasks that need fresh web retrieval or tools (this is pure reasoning supervision).
Multimodal tasks (images, math diagrams) without extension.
Safety-critical deployments without human review.
Non-English targets without translation/adaptation.

Open questions:

How to measure “difficulty” beyond base-model accuracy—can we predict which samples teach the most?
Can active selection or curriculum design squeeze even more from compact datasets?
How to blend human minimal oversight with automated pipelines for edge cases?
Extending beyond text: code execution traces, theorem provers, or tool-augmented solutions.
Bias/fairness audits for synthetic reasoning and topic selection.
Smarter verifiers that check step-by-step logic, not just final answers.

06Conclusion & Future Work

Three-sentence summary: CHIMERA is a compact, synthetic, long-explanation dataset spanning eight disciplines and over a thousand topics, built with fully automated cross-model validation. Training a 4B model on CHIMERA yields strong gains on GPQA-Diamond, AIME, HMMT, and HLE, rivaling or approaching far larger models. This shows that high-quality, diverse, verified long reasoning—more than raw scale—drives generalizable reasoning ability.

Main achievement: Proving that a small but carefully engineered dataset with long, verified CoTs and broad coverage can unlock big-model-level reasoning in a small model.

Future directions: Add more subjects and languages, tighten step-by-step verification, incorporate tools (solvers, code, retrieval) for hybrid reasoning, and design curricula that adapt difficulty over time. Explore multimodal reasoning and new reward models that grade intermediate steps.

Why remember this: It reframes the recipe for reasoning—prioritize process supervision, structured breadth, and robust validation over sheer size. With CHIMERA, strong reasoning becomes more accessible, reproducible, and efficient for the entire community.

Practical Applications

•Bootstrapping small open models into strong reasoners using a compact, high-quality dataset.
•Creating course-quality practice sets with full solutions for science and math tutoring AIs.
•Training RL reasoning policies with verified final answers and long CoTs as guidance.
•Rapidly adding new subjects/topics via the expansion stage to target niche domains.
•Auditing and improving synthetic data quality with dual-verifier checks and answer matching.
•Designing curricula that start with SFT on correct CoTs and then use RL on harder leftovers.
•Benchmarking data difficulty by testing base-model accuracy before training.
•Replacing multiple-choice fine-tuning with free-form reasoning to avoid shortcut learning.
•Building contamination-aware training corpora by measuring n-gram overlaps to test sets.
•Prototyping multilingual or multimodal extensions by swapping generators/verifiers in the pipeline.

Version: 1