Learn Hard Problems During RL with Reference Guided Fine-tuning

Yangzhen Wu; Shanda Li; Zixin Wen; Xin Zhou; Ameet Talwalkar; Yiming Yang; Wenhao Huang; Tianle Cai

Learn Hard Problems During RL with Reference Guided Fine-tuning

Intermediate

Yangzhen Wu, Shanda Li, Zixin Wen et al.3/1/2026

arXiv

Key Summary

•ReGFT is a simple pre-RL step that shows the model partial human hints, then makes it solve problems in its own words, creating correct, model-style solutions for hard questions.
•This helps fix reward sparsity in RL, where the model often gets zero points on tough problems because it never samples a correct solution.
•Unlike directly training on full human solutions (which the model struggles to imitate), ReGFT keeps the model’s natural reasoning style while still benefiting from expert guidance.
•Starting RL (DAPO) from a ReGFT checkpoint speeds up learning and reaches a higher final accuracy on AIME 2024, AIME 2025, and Beyond-AIME.
•ReGFT unlocks extra problems that normal sampling never solves (about 5.85% more on OmniMath), giving RL more positive signals to learn from.
•ReFT (training on self-generated correct traces only) improves early speed but hits a lower ceiling than ReGFT, especially on the hardest benchmarks.
•ReGFT improves inference-time scaling (pass@k) across both small and large k, meaning more shots at test time reliably find correct answers.
•The method is orthogonal to the RL algorithm; even with a strong system like DAPO, ReGFT still adds gains.
•ReGFT trains only on hard problems (low pass rate) to avoid overfitting and maximize learning where it’s needed.
•Overall, ReGFT turns human references into model-aligned trajectories, solving the zero-reward stall and enabling stronger math reasoning with RL.

Why This Research Matters

Better math reasoning means smarter, more trustworthy assistants for education, science, and engineering. ReGFT reduces wasted compute and time by preventing RL from stalling on hard problems with zero rewards. Schools and learners can benefit from AI tutors that solve and explain tougher problems, while still showing their work in a natural, understandable style. Scientists and developers gain models that are more robust to difficulty spikes and make better use of extra test-time attempts (stronger pass@k). Because ReGFT is a pre-RL enhancement, it plays nicely with many RL algorithms and can be added to existing pipelines. This approach could generalize to coding, data analysis, and planning—anywhere partial hints exist and final answers can be checked.

Reading Workflow

Turn this paper into a decision

Scan fast. Promote only the papers that survive triage.

No workflow history yet.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

Let’s take a quick tour of the key ideas, sandwich-style, so the rest of the story makes sense.

🍞 Hook: You know how in video games you learn faster when you get points for good moves? 🥬 The Concept (Reinforcement Learning): RL is training by trial and error with rewards for good outcomes.

How it works: 1) Try an action. 2) Get a reward or not. 3) Do more of what earns rewards.
Why it matters: Without rewards, the model doesn’t know what’s good or bad, so it can’t improve. 🍞 Anchor: Solving a math puzzle and getting a point only if your final answer is correct.

🍞 Hook: Imagine a treasure hunt with hardly any gold stars to find. 🥬 The Concept (Reward Sparsity): Reward sparsity means positive feedback is rare.

How it works: The model tries many solutions but rarely hits the exact correct one, so it gets almost no rewards.
Why it matters: With too few rewards, RL stalls and stops learning. 🍞 Anchor: If you do 100 tries and get 0 stickers, it’s hard to know what to change next.

🍞 Hook: Think of studying from a teacher’s answer key. 🥬 The Concept (Supervised Fine-Tuning, SFT): SFT teaches the model by showing it correct solutions.

How it works: Show input + correct output and train the model to imitate.
Why it matters: Without SFT, the model might not learn basic patterns before RL. 🍞 Anchor: Practicing long division by copying the worked solution first.

🍞 Hook: Imagine a step-by-step recipe to bake a cake. 🥬 The Concept (Chain-of-Thought, CoT): CoT is a step-by-step reasoning path to the answer.

How it works: Write down each reasoning step toward the final answer.
Why it matters: Without steps, the model may guess instead of understanding. 🍞 Anchor: A recipe that says: mix, pour, bake, cool, frost—one step at a time.

🍞 Hook: Picture taking notes in your own handwriting so they’re easier for you to use. 🥬 The Concept (Self-Generated Trajectories): These are solutions the model writes itself.

How it works: The model solves problems and records its own step-by-step paths.
Why it matters: If it only copies others, it may not truly learn or generalize. 🍞 Anchor: Your own math scratchwork versus copying a friend’s.

🍞 Hook: Think of practicing only the moves you already did correctly. 🥬 The Concept (ReFT): Fine-tuning on the model’s own correct solutions.

How it works: Sample multiple tries, keep the correct ones, train on those.
Why it matters: If there are zero correct tries on hard problems, there’s nothing to train on. 🍞 Anchor: You can’t practice “what worked” if nothing worked yet on the toughest trick.

🍞 Hook: Imagine a coach giving a hint, then you finish the problem in your own words. 🥬 The Concept (ReGFT): Use partial human solutions as hints but make the model generate its own reasoning.

How it works: 1) Show a partial reference. 2) Model solves from scratch in its style. 3) Train on these aligned, correct traces.
Why it matters: Without staying in the model’s own style, direct copying doesn’t generalize; without hints, hard problems stay unsolved. 🍞 Anchor: A teacher circles key steps, and you fill in the full solution yourself.

🍞 Hook: Think of a strict answer checker that only gives a point if the final answer is right. 🥬 The Concept (Verifier in RLVR): A rule-based checker gives reward 1 for a correct final answer, 0 otherwise.

How it works: The model’s whole reasoning is judged only by the final answer.
Why it matters: If no correct final answers appear, rewards are zero and learning stalls. 🍞 Anchor: A scantron machine that only counts the bubble you fill, not your scratchwork.

🍞 Hook: Imagine a coach who tweaks both drills and how points count to keep practice effective. 🥬 The Concept (DAPO): An RL method that improves stability and sample efficiency with special clipping and dynamic sampling.

How it works: 1) Decouple how much to trust each trajectory. 2) Group by reward diversity to get useful updates.
Why it matters: Without stability and diversity, learning can collapse or be noisy. 🍞 Anchor: Rotating drills and fair scoring so everyone improves and no one trick dominates practice.

🍞 Hook: Think of taking k shots at the hoop instead of just one. 🥬 The Concept (pass@k): The chance that at least one of k tries is correct.

How it works: Sample many answers; success if any are right.
Why it matters: Shows how well a model benefits from more attempts at test time. 🍞 Anchor: If you miss the first shot but sink one of the next five, you still “score at k=6.”

The world before: RL for math reasoning (RLVR) depends on finding at least some correct model-generated solutions so the verifier can give positive rewards. On truly hard problems, base models often produce zero correct solutions, so RL gets no signal—classic reward sparsity. People tried two main things: (1) Direct SFT on full human references (easy to copy, hard to generalize because it doesn’t match the model’s natural style), and (2) ReFT (train on the model’s own correct traces), which helps only where the model already succeeds at least sometimes. The gap: we need correct, model-aligned traces for problems the model currently cannot solve at all.

The real stakes: If we can unlock hard reasoning, we get better math tutors, more reliable scientific assistants, and safer decision tools that reason clearly instead of guessing. We also waste less compute on fruitless RL rollouts and make progress possible in domains where exact correctness is rare and precious.

02Core Idea

The “aha!”: Give the model partial teacher hints but make it solve in its own voice, then fine-tune on those model-written, hint-guided solutions so RL later sees more positives—even on previously unsolvable problems.

Three analogies:

Training wheels: The teacher’s hint is a steady hand on the bike; you still pedal and balance yourself, so you truly learn to ride.
Treasure map: The hint draws landmarks but leaves gaps; you navigate the path, so you actually discover the treasure.
Cooking class: The chef shows which spices matter; you do the cooking, so the dish matches your style but still turns out right.

Before vs After:

Before: RL on hard problems often sees zero rewards. Direct SFT on full references doesn’t stick. ReFT speeds early learning but doesn’t raise the final ceiling.
After: ReGFT raises baseline competence by creating correct, model-style solutions on hard items. RL (e.g., DAPO) now frequently sees positive rewards, learns faster, and reaches a higher plateau.

Why it works (intuition, no equations):

Distribution alignment: The model best learns from examples that look like its own writing. Reference hints nudge structure without forcing foreign phrasing.
Reward densification: By unlocking more correct final answers on hard items, we feed RL many more positive signals.
Coverage expansion: Hints unlock reasoning paths the model wouldn’t discover alone, widening the space of solvable problems.
Stability for RL: A stronger start reduces the zero-reward regime and lowers the chance of bad updates.

Building blocks (sandwich-style for the key piece):

🍞 Hook: Imagine your teacher highlights a few key steps but asks you to do the whole problem. 🥬 The Concept (Reference-Guided Trajectories): Model-written solutions produced while seeing partial references as hints.

How it works: 1) Pick hard problems (low pass rate). 2) Show partial human solution (e.g., first 80% sentences). 3) Model solves from scratch; verifier checks final answer. 4) Keep correct, model-styled traces.
Why it matters: Without hints, hard problems remain unsolved; without model-style writing, learning won’t generalize. 🍞 Anchor: The teacher circles the strategy; you write the full proof that passes the answer checker.

Put together, ReGFT gives RL something to grab onto: a steady flow of correct, model-aligned solutions for the very problems that used to produce silence. That’s the simple shift that changes the entire learning curve.

03Methodology

At a high level: Problem + (partial reference hint) → model generates its own reasoning → verifier keeps correct ones → fine-tune on a mix of self-correct and hint-guided traces → start RL (DAPO) from this stronger checkpoint.

Step-by-step recipe:

Select the hard set.

What: Identify problems the base model rarely solves (e.g., less than 25% accuracy under 16 samples).
Why: Focus training where competence is missing; avoid overfitting on easy problems.
Example: From 4,428 OmniMath problems, flag those with pass@16 below 25%.

Two samplers for data creation: normal and reference-guided.

What: For each hard problem, run normal sampling and also reference-guided sampling (show partial human solution as a hint, e.g., first 80% sentences).
Why: Normal sampling may find a few correct paths the model already knows; hints unlock extra paths the model couldn’t reach alone.
Example: On a geometry problem, the hint mentions using similar triangles; the model then writes its own full proof leveraging that idea.

Verify correctness with a rule-based checker.

What: Keep only trajectories whose final boxed answer matches the ground truth.
Why: RLVR uses final-answer correctness as the sole reward signal; we want clean positives.
Example: If the model writes a long proof but the final number is wrong, it’s discarded.

Build the fine-tuning mix.

What: Combine (a) self-generated correct traces (as in ReFT) and (b) reference-guided correct traces.
Why: Self-generated traces preserve the model’s style; guided traces expand competence on hard items. The mix balances familiarity and new skills.
Example: 60% guided, 40% self-generated (illustrative; tune as needed).

Supervised fine-tuning (pre-RL).

What: Train the model on the mixed dataset so it increases likelihood of these successful trajectories.
Why: This raises baseline pass rates on hard problems before RL begins.
Example: After SFT, the model now solves a chunk of previously impossible combinatorics questions.

Start reinforcement learning with DAPO.

What: From the ReGFT checkpoint, run RL with dynamic sampling and decoupled clipping; sample many responses per prompt (e.g., 64) to explore.
Why: The stronger start produces more positive rewards, so RL learns faster and reaches higher accuracy.
Example: Training curves on AIME 2024 rise quicker and end higher than starting from the raw model.

Evaluate inference-time scaling (pass@k).

What: At test time, sample many answers per problem to estimate pass@k.
Why: Shows how well the model uses extra attempts to find a correct solution.
Example: ReGFT + DAPO keeps improving as k grows, indicating broader coverage of correct solutions.

What breaks without each step:

Skip hard-set selection: You waste capacity on easy items, gain less on truly hard ones.
Skip reference-guided sampling: Many hard problems remain unsolved pre-RL; RL still sees zeros.
Skip verification: Noisy positives hurt both SFT and RL.
Skip mixing with self-generated traces: Over-reliance on human phrasing may not transfer to the model’s style.
Skip DAPO: You miss stability and sample-efficiency advantages; gains shrink.

Concrete data example:

Prompt: “Please reason step by step, and put your final answer within $boxed{}$ .”
Guided prompt adds: “Hint: {partial reference solution}. Given the partial reference… you must solve it by yourself.”
If the model outputs a correct final boxed answer, keep it as a training example; otherwise discard.

The secret sauce:

Partial hints preserve autonomy: The model must still traverse the reasoning path, so examples stay in-distribution for the model.
Competence first, RL second: Pre-RL competence prevents zero-reward stalls and makes downstream RL updates meaningful.
Diversity via two samplers: Normal plus guided sampling broadens the set of solvable paths, supporting better pass@k scaling.

Implementation notes (lightweight):

Base model: Qwen3-4B-2507-Instruct.
Data: OmniMath (4,428 problems with references).
Generation: temperature 0.7, top-p 0.9, max length 16,384 tokens.
RL: DAPO with dynamic sampling and decoupled clipping; response size per prompt commonly 64. These choices aren’t the core of ReGFT—they show ReGFT plays nicely with a strong, modern RL stack.

04Experiments & Results

The tests and why:

Supervised accuracy on hard training items: shows whether ReGFT actually creates and learns from more correct trajectories before RL.
RL training curves (AIME 2024, AIME 2025, Beyond-AIME): checks speed of learning and final plateau.
Inference-time scaling (pass@k): tests if more attempts at test time reliably produce correct answers, indicating broader solution coverage.

Competition (baselines):

Raw model (no pre-RL finetuning).
ReFT (only self-generated correct traces, no reference hints).
Direct SFT on full human references (no model-style generation).

Scoreboard with context:

Pre-RL supervised pass rate on OmniMath hard problems rises with ReGFT (e.g., around 50.1% vs. lower for raw), which is like moving from a shaky C+ to a solid B on the toughest homework.
RL training from ReGFT consistently outpaces raw across AIME 2024, AIME 2025, and Beyond-AIME: faster early gains and higher final accuracy—think finishing the race minutes ahead while others still improve but plateau lower.
ReFT vs. ReGFT: ReFT helps early speed but ReGFT wins at the finish. On Beyond-AIME (the toughest), ReFT can even trail the raw+DAPO final, while ReGFT keeps a clear margin. That’s like two students both cramming well, but only one truly understands and aces the final.
Direct SFT on human references underperforms both during SFT and after RL. This shows that copying a teacher’s full solution doesn’t stick if it doesn’t match the model’s natural style.

Surprising (and important) findings:

Reference-guided sampling solved about 5.85% extra problems that normal sampling never solved, proving hints can unlock new solution paths the model wouldn’t stumble upon alone.
ReGFT’s advantage persists as k increases in pass@k. The gains aren’t just “lucky first-shot hits”; they reflect better coverage of correct solutions even when you take many shots.
Increasing RL response size (e.g., from 16 to 64) helps everyone, but ReGFT+DAPO still leads, showing scaling exploration and better initialization are complementary, not substitutes.

Numbers made meaningful (examples drawn from reported trends):

AIME 2024: ReGFT starts faster and ends higher than raw—like going from a B- to an A- while raw climbs to a B+.
AIME 2025 and Beyond-AIME: ReGFT keeps the top spot; ReFT improves early but can plateau lower, especially on the hardest benchmark.
OmniMath: normal sampling solves ~68.6%; reference-guided reaches ~70.8% and, crucially, recovers problems normal sampling never touches.

Takeaway: ReGFT densifies rewards on hard items, which accelerates RL and raises the ceiling. It’s not just faster learning—it’s learning more things.

05Discussion & Limitations

Limitations:

Needs reference solutions: ReGFT assumes access to human-written solutions for training problems, which may not exist in every domain.
Verifier brittleness: Rule-based checkers can misjudge open-ended or proof-style answers, causing false negatives that hide real progress.
Hint quality and length: Too little hint may not help; too much can risk shallow imitation. Tuning matters.
Domain transfer: What works in math with exact checkers may be harder in fuzzy domains (e.g., law, open-ended science writing).
Compute: Generating many trajectories (normal + guided) and verifying them is resource-intensive.

Required resources:

A dataset with problems and reference solutions (for hints).
A verifier for final answers (ideally automatic and reliable).
Compute to sample multiple trajectories per problem in both SFT and RL.

When not to use:

No references available: If you lack any teacher-style solutions, ReGFT can’t form guided trajectories.
Unverifiable tasks: If you can’t check correctness automatically, you can’t safely filter positives.
Style-critical domains: Where strict imitation of human prose is required, pushing model-style solutions might be less desired than exact copying.

Open questions:

How to choose the “right” partial hint (which sentences, how long) automatically?
Can smarter verifiers reduce false negatives on proof-like outputs?
How well does ReGFT generalize beyond math—code, science QA, planning—where verifiers are imperfect?
What’s the best recipe for mixing guided and self-generated traces across stages of training?
Can we learn to generate synthetic hints when human references are scarce?

06Conclusion & Future Work

Three-sentence summary: ReGFT fixes RL’s zero-reward stall on hard math problems by turning partial human hints into model-written, correct solutions before RL starts. This raises baseline competence, so RL sees more positive signals, learns faster, and reaches a higher final accuracy. The gains hold across AIME 2024/2025 and Beyond-AIME and improve inference-time scaling (pass@k).

Main achievement: Showing that reference guidance plus model-style generation (not direct copying) creates the right kind of trajectories to densify rewards and unlock harder problems for RL.

Future directions: Automate hint selection; improve verifiers to handle proofs; explore domains with weaker checkers (code, science, planning); tune the guided/self-generated mix across training; combine with smarter RL sampling strategies.

Why remember this: A small pre-RL shift—partial hints + model-written solutions—changes the whole game for reasoning: it transforms silence (no rewards) into learning (plenty of positives), letting RL actually teach models new, harder skills instead of just amplifying what they already knew.

Practical Applications

•Build stronger AI math tutors that can tackle Olympiad-style problems with clear, step-by-step explanations.
•Precondition reasoning models for RL training in other domains (e.g., coding) by using partial hints from documentation or comments.
•Reduce RL compute waste by ensuring models see more positive rewards on hard items from the very start.
•Create curriculum-learning pipelines that first unlock hard problems with guided hints, then consolidate with RL.
•Improve competitive benchmark performance (AIME, Beyond-AIME) for research or academic competitions.
•Develop better automated graders/verifiers by analyzing where guided solutions still fail verification.
•Enable safer decision-support systems that rely on verifiable end results (e.g., exact numeric answers).
•Design hint-generation tools that extract partial steps from references to seed ReGFT automatically.
•Enhance inference-time strategies by pairing ReGFT with multi-sampling (higher pass@k) for critical tasks.
•Transfer the approach to proof-heavy domains by mixing guided steps with model-style proofs to improve generalization.

Version: 1