TAROT: Test-driven and Capability-adaptive Curriculum Reinforcement Fine-tuning for Code Generation with Large Language Models

Chansung Park; Juyong Jiang; Fan Wang; Sayak Paul; Jiasi Shen; Jing Tang; Jianguo Li

TAROT: Test-driven and Capability-adaptive Curriculum Reinforcement Fine-tuning for Code Generation with Large Language Models

Intermediate

Chansung Park, Juyong Jiang, Fan Wang et al.2/17/2026

arXiv

Key Summary

•TAROT teaches code-writing AI the way good teachers teach kids: start at the right level and raise the bar at the right time.
•Each coding problem gets four kinds of tests (basic, intermediate, complex, edge) so the AI can learn step by step within the same problem.
•Training is guided by a curriculum that matches the model’s current ability, not just the raw reward score.
•Two dials control learning: how often each test tier is used (allocation) and how much each tier’s success is worth (reward weight).
•Less-capable models learn best from easy-to-hard, while stronger models learn best from hard-first.
•Across many models and benchmarks, TAROT improved pass@1, with gains up to about +4.26 percentage points on strong baselines.
•Compared to standard reward schemes that ignore curriculum, TAROT gives more stable learning and better results.
•The method works across different model families and sizes, but needs careful tuning for weaker models to avoid reward sparsity.
•A key surprise: shorter, more concise code during training often predicts better final benchmark scores.
•All code and data are released, enabling reproducible research and practical adoption.

Why This Research Matters

When code-writing AI understands both easy paths and tricky edge cases, software breaks less often and gets built faster. TAROT’s tiered, test-driven approach helps models learn robust solutions that stand up to real-world conditions, not just toy examples. This can save time for developers, reduce bugs reaching users, and improve safety in systems that need to be reliable. Matching the training challenge to each model’s skill level also means smaller, cheaper models can still learn effectively with the right pacing. Over time, these ideas can extend beyond coding to any task with checkable steps, making AI learning more efficient and dependable across domains.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine learning piano. If every lesson gave you either Twinkle Twinkle (too easy) or a concert piece (too hard), you’d get bored or stuck. Now imagine lessons that match your level and step up just when you’re ready—that’s how you really improve.

🥬 The Concept (Curriculum Learning): What it is: Curriculum Learning (CL) is teaching a model with problems ordered from the right level to harder ones so it learns smoothly instead of randomly. How it works: 1) Sort practice by difficulty. 2) Start where the learner can succeed. 3) Gradually raise difficulty as skills grow. 4) Keep feedback meaningful at every step. Why it matters: Without a good curriculum, models waste time on too-easy tasks or fail on too-hard ones, learning very slowly or not at all. 🍞 Anchor: Just like math class builds from addition to multiplication, a well-planned training schedule helps AI build complex coding skills.

🍞 Hook: You know how a dog learns faster with treats after doing the right trick? AI can learn from rewards too.

🥬 The Concept (Reinforcement Fine-Tuning, RFT): What it is: RFT improves a model by rewarding better answers during training. How it works: 1) Model writes code. 2) We run tests. 3) Passing tests earn reward, failing tests don’t. 4) The model shifts toward choices that earned more reward. Why it matters: Without useful rewards, the model can’t tell what to improve, like a game with no score. 🍞 Anchor: If code passes more tests, the model gets a higher “score,” and it keeps the habits that led to that score.

🍞 Hook: Think of a science fair project checklist you make before you start building. If you can pass each check, your project is likely solid.

🥬 The Concept (Test-Driven Development, TDD): What it is: TDD is writing tests first, then writing code to pass them. How it works: 1) Write simple tests. 2) Write code to pass them. 3) Add tougher tests. 4) Improve code to pass everything. Why it matters: Without tests, you don’t know if changes break things; tests keep you honest and robust. 🍞 Anchor: Like a recipe that checks taste at each step, tests help catch mistakes early.

The world before TAROT: Code LLMs could turn English into code, boosting productivity. But they often struggled with “algorithmically sophisticated and robust” solutions. Two big issues in reinforcement learning for code were “reward sparsity” (few signals when nothing passes) and “reward flatness” (treats easy and hard wins the same). Meanwhile, most curricula were set between problems (easy problems first, hard problems later). That missed the rich difficulty inside each single problem—exactly what developers use in real life with TDD.

Failed attempts: Prior works sorted by rough complexity scores (like cyclomatic complexity) or split tasks into smaller subtasks. Helpful, but they mainly looked at the data’s difficulty, not the learner’s changing capability. A one-size-fits-all schedule could bore strong models or overwhelm weaker ones.

🍞 Hook: Picture giving the same PE drills to both beginners and varsity athletes. The beginners get hurt; the athletes get bored.

🥬 The Concept (Capability-Conditioned Evaluation): What it is: Checking progress with goals matched to the current skill level. How it works: 1) Estimate the learner’s ability. 2) Pick tests at the right tier. 3) Value successes that prove real growth. 4) Re-check and adjust if needed. Why it matters: If you measure a rookie with pro drills, you get zeros; if you measure a pro with rookie drills, you learn nothing. 🍞 Anchor: A 5th-grader’s math quiz looks different from a college midterm—both are fair, but each is for the right level.

The gap TAROT fills: We needed a test-driven curriculum that (1) recognizes different difficulty within the same problem, and (2) adapts to the model’s capability so rewards stay informative. Most importantly, training should not depend blindly on the accidental mix of test difficulties in a dataset.

Real stakes: Better code generation means fewer bugs, safer systems, and time saved for developers. Imagine a model that not only solves simple tasks but also handles edge cases and tricky logic—this is what keeps apps from crashing, websites from breaking, and homework from failing hidden tests.

02Core Idea

The “Aha!” in one sentence: TAROT builds four-tier tests for every problem and separates “how often to practice each tier” from “how much to reward each tier,” so training can match the model’s current ability and grow it efficiently.

🍞 Hook: Think of climbing a staircase matched to your leg length. Too big steps? You trip. Too small? You waste time.

🥬 The Concept (Intra-Problem Difficulty Gradient via Tiered Tests): What it is: Each problem comes with basic, intermediate, complex, and edge tests that rise in challenge. How it works: 1) Generate candidate tests. 2) Verify them with a reference solution. 3) Sort them into four tiers. 4) Use these tiers to pace learning. Why it matters: Without tiers, rewards blur together, and the model can’t tell if it’s mastering harder skills. 🍞 Anchor: Like video game levels within the same world: you pass Level 1 before 2, but it’s still the same quest.

🍞 Hook: A coach decides both how often you practice each drill and how many points each drill earns—two different dials.

🥬 The Concept (Decoupling Allocation and Reward Weighting): What it is: TAROT separates (A) how much training time to spend on each tier and (B) how much success on that tier is worth. How it works: 1) Pick an allocation (e.g., mostly basic early). 2) Pick weights (e.g., value complex wins more for strong models). 3) Compute a combined return from pass rates by tier. 4) Optimize the model to maximize that return. Why it matters: If effort and value are tied together, you can’t truly tailor learning; separating them keeps training stable and targeted. 🍞 Anchor: Studying 70% vocabulary and 30% grammar (allocation), while grading essays more heavily (weighting) focuses growth precisely.

🍞 Hook: A beginner rock climber needs small holds close together; a pro needs overhangs and crimps.

🥬 The Concept (Capability-Adaptive Curriculum): What it is: The schedule of which tiers to emphasize depends on the model’s current capability (size, specialization, baseline performance). How it works: 1) Assess capability. 2) Choose from a portfolio: easy-to-hard for weak models, hard-first for strong ones. 3) Set tier allocations and weights. 4) Train with reinforcement learning. Why it matters: Wrong pacing causes collapse (too hard) or stagnation (too easy). 🍞 Anchor: In the paper, small models improved most with basic→complex; strong models did best starting with complex.

Three analogies for the main idea:

School tracks: Students are placed in classes that fit their level and then advanced at the right pace—TAROT does this for AI code learning.
Gym programming: You don’t max out deadlifts on Day 1; you scale weights to your strength and adjust reps and rest—allocation and reward weights are those knobs.
Video games: Difficulty scales with player skill; win streaks unlock harder bosses that give more points—TAROT’s tiered tests and weights act like that scaling.

Before vs After:

Before: Rewards often didn’t tell hard wins from easy wins; curricula ignored the learner’s capability and mostly sorted whole problems at once.
After: Rewards are tier-aware; pacing is matched to the model’s skill; the same problem teaches basics first or stress-tests edges, as needed.

Why it works (intuition):

Rich signal: Passing a complex test should teach the model more than passing a basic one; TAROT encodes this difference directly.
Stable steps: By controlling both practice frequency and point values, training avoids reward spikes or deserts.
Right challenge zone: Keeping the model inside its “Zone of Optimal Difficulty” makes gradients informative and learning efficient.

Building blocks:

Four-tier test suite per problem (basic, intermediate, complex, edge).
Curriculum allocator (how often each tier shows up in training).
Reward weights (how much each tier’s success counts).
Capability estimate to pick the right curriculum from a portfolio.
Reinforcement optimization (e.g., GRPO) to push the model toward higher tier-weighted returns.

03Methodology

At a high level: Problem + Four-tier tests → Choose curriculum (by capability) → Generate code → Run tiered tests → Compute tier-weighted return → Update the model (GRPO) → Better code.

Step 1: Build the tiered, test-driven dataset

What happens: For each coding problem (statement + reference solution), TAROT builds four test tiers: basic (happy path), intermediate (moderate inputs), complex (algorithmic stress), edge (boundaries/extremes). Frontier LLMs propose tests; any test that the reference solution fails is discarded.
Why this step exists: It creates a clear difficulty staircase inside each single problem, turning flat rewards into a rich, structured signal.
Example: For a string function, basic might check a short normal input, complex might use large nested structures, and edge might include empty strings or max-length inputs.

🍞 Hook: You know how puzzles come in beginner, intermediate, advanced, and expert boxes?

🥬 The Concept (Tiered Test Suite): What it is: Four buckets of tests per problem that increase in difficulty and cover different failure modes. How it works: 1) Generate many tests. 2) Verify with the right answer. 3) Sort into B/I/C/E. 4) Use them to pace and measure learning. Why it matters: Without these buckets, all passes look the same, and training can’t find the right challenge. 🍞 Anchor: In experiments, metrics like token diversity shift rightward from basic to complex, confirming real difficulty increases.

Step 2: Assess capability and pick a curriculum policy

What happens: Estimate the model’s effective capability (size, code specialization, baseline scores). Choose from a portfolio: Forward (basic→edge), Reversed (edge→basic), or Static (e.g., Basic Only). Set reward weights: Uniform, Basic/Intermediate-weighted, or Complex/Edge-weighted.
Why this step exists: To land the model in its Zone of Optimal Difficulty and avoid both boredom and burnout.
Example: A code-specialized 3B model behaves like a much larger general model, so it may start with complex-focused training.

🍞 Hook: A running coach decides both mileage (frequency) and race points (importance) based on your current fitness.

🥬 The Concept (Allocation vs Reward Weights): What it is: Allocation says how often a tier appears; reward weights say how much a tier’s pass is worth. How it works: 1) Fix α (allocation across tiers). 2) Fix w (weights across tiers). 3) During training, compute return as sum over tiers α_l * w_l * pass_rate_l. 4) Optimize to increase this return. Why it matters: If you don’t separate effort from value, training becomes tangled and fragile. 🍞 Anchor: You might practice basics 50% of the time but grade complex wins more heavily to push higher reasoning.

Step 3: Train with reinforcement learning (GRPO)

What happens: For each problem, the model generates several code candidates. We run each against tiered tests to compute pass rates per tier r_i,l. We combine them using R_TAROT = sum_l α_l * w_l * r_i,l. GRPO then updates the policy to increase expected R_TAROT while staying near the base model (via KL regularization β).
Why this step exists: Pass@1 alone doesn’t tell where the learning should go; tier-weighted returns tell the optimizer exactly what improvements to chase.
Example: If the curriculum prioritizes complex/edge tiers for a strong model, small gains there move the objective more than gains on basic tests.

🍞 Hook: Think of practicing free throws. You take many shots (samples), count your hits (pass rate), and then adjust your form to improve future shots (policy update).

🥬 The Concept (GRPO, a Stable RL Optimizer): What it is: A policy optimization method that nudges the model toward better actions while keeping it close to the original behavior. How it works: 1) Sample multiple candidates. 2) Estimate which are better than average (advantage). 3) Update the policy to favor better ones. 4) Use β to prevent over-shooting changes. Why it matters: Without stability, the model might forget useful skills or chase noisy rewards. 🍞 Anchor: It’s like improving your essay style step by step without suddenly writing in a totally different voice.

Step 4: Evaluate and analyze

What happens: Measure pass@1 on HumanEval, MBPP, their “+” versions, and accuracy on LiveCodeBench, CodeForces, and CruxEval. Also study training dynamics: rewards, completion lengths, and sensitivity to β and temperature.
Why this step exists: To verify that the curriculum truly improves real coding ability and robustness, not just training scores.
Example: On Qwen3-4B-Instruct-2507, the Complex/Edge-weighted curriculum yields +2.44pp to +4.26pp across HumanEval/HumanEval+/MBPP/MBPP+.

The secret sauce

Test-driven tiers create a clear difficulty map within each problem.
Decoupling allocation and weights keeps optimization stable and targeted.
Capability-adaptive scheduling matches the learner’s level, finding that weaker models thrive on easy→hard, while stronger ones leap ahead with hard-first.
This combination avoids reward flatness and focuses gradient updates where they teach the most.

04Experiments & Results

The test: Do tier-aware, capability-adaptive curricula make models write more correct and robust code? The team evaluated many models (Qwen2.5 Instruct/Coder, Gemma2, Qwen3-4B) on popular benchmarks (HumanEval, MBPP, HumanEval+, MBPP+), plus out-of-distribution sets (LiveCodeBench v5, CodeForces, CruxEval).

The competition: Baselines included the un-finetuned base checkpoints and standard RL reward schemes that use all tests but no curriculum scheduling: Avg-reward (average pass across tiers) and Pass@All (reward only if all tiers pass).

The scoreboard (with context):

TAROT vs base checkpoints: Consistent gains across sizes and families. For Qwen3-4B-Instruct-2507, Complex/Edge-weighted curriculum improved HumanEval from 89.02% to 91.46% (+2.44pp) and HumanEval+ from 78.66% to 82.92% (+4.26pp); MBPP rose from 52.60% to 55.20% (+2.60pp) and MBPP+ from 56.61% to 58.73% (+2.12pp). That’s like moving from a solid A- to a clear A when already near the top.
Capability matters: On Qwen2.5-Instruct, 1.5B did best with basic-focused progression, while 7B excelled with complex-focused strategies. A code-specialized 3B model acted like a larger general model, peaking with complex-first—so specialization boosts effective capability beyond parameter count.
TAROT vs standard reward schemes: On both small and large models, TAROT beat Avg-reward and Pass@All across benchmarks, confirming that curriculum-aware rewards (and scheduling) outperform treating all tests the same.
Architectural generalization: On Gemma2, the 9B model often preferred simpler strategies (e.g., Basic Only) depending on the task, and the 2B model was fragile—hard-first schedules could cause collapse due to sparse rewards. A fundamentals-first approach was essential for weaker models.

Surprising findings:

Not one-size-fits-all: On OOD tasks, the best curriculum varied by benchmark. For example, LiveCodeBench sometimes favored Basic Only, while CodeForces and CruxEval benefited from Complex/Edge emphasis. Translation: pick curricula with the target domain in mind.
Conciseness predicts success: Shorter average completion lengths during training correlated better with downstream scores than raw training rewards. That suggests concise solutions reflect clearer reasoning.
Token limits matter: Allowing very long generations helped MBPP/MBPP+ (more room to code), but hurt HumanEval/HumanEval+ (more verbosity, more errors). Choose length caps by task type.
Hyperparameters are task-specific: Lower KL strength (β=0.01) and higher training temperature (1.0) helped HumanEval(+), while MBPP liked slightly stronger regularization and moderate temperatures. There isn’t a single best setting across all tasks.

Big picture: TAROT’s capability-adaptive, tier-aware design consistently lifts performance and makes training more informative. It highlights a key principle: the right challenge at the right time beats flat rewards and one-speed-fits-all schedules.

05Discussion & Limitations

Limitations

Synthetic tiered tests: The four-tier suites are generated by strong LLMs and then filtered by a reference solution. Biases or blind spots in generation could reduce coverage or overfit certain patterns.
Python-only focus: Results use Python tasks; extending to other languages (C++, Java, Rust) or multi-language settings needs validation.
Static portfolio choice: Curricula are selected from a pre-defined menu using a baseline assessment. Continuous, dynamic scheduling that adapts mid-training remains future work.
Reward dependence on tests: If test tiers are mis-specified (e.g., mislabeled difficulty), the curriculum signal weakens.

Required resources

Data: A large set of problems with verified tiered tests (TAROT provides ~15k problems × 4 tiers).
Compute: Multi-GPU training (the paper used up to 8×A100 80GB) for GRPO and evaluation; inference servers (e.g., vLLM) to run many test executions.
Tooling: Sandboxed execution for safety/timeouts; RL libraries (TRL), transformers, and evaluation frameworks (EvalChemy).

When NOT to use TAROT

Very small or fragile models without basic code ability: Hard-first or mixed curricula can lead to reward sparsity and collapse—start with basics or use supervised warm-up.
Domains without reliable tests: If you cannot define meaningful tiered checks, the curriculum signal won’t guide learning.
Non-executable tasks: When there’s no clean pass/fail (e.g., purely stylistic outputs), tier-aware unit-test rewards don’t apply.

Open questions

Auto-curation: How to generate and validate cross-language, security-aware, and adversarially robust tiered tests at scale?
Dynamic curricula: Can we learn α and w online based on live capability estimates (bandits, meta-RL) instead of choosing from a static portfolio?
Transfer and OOD: How to predict the best curriculum for a new benchmark before full training? Can we learn task signatures that map to curriculum recipes?
Richer rewards: How to blend process rewards (line-level hints) with tier-aware outcome rewards for even denser, safer learning?
Capability metrics: What’s the best quick test to place a model into its Zone of Optimal Difficulty without long warm-ups?

06Conclusion & Future Work

Three-sentence summary: TAROT gives each coding problem a four-level test suite and then trains models with a curriculum that matches their current capability. By separating how often to practice each test level from how much each level is worth, TAROT creates stable, informative rewards that guide models to robust, algorithmically stronger code. Across model families and sizes, this approach reliably improves pass rates and reveals that the best curriculum depends more on effective capability than on parameter count alone.

Main achievement: A practical, test-driven, capability-adaptive curriculum framework that turns flat rewards into rich, tier-aware signals and consistently improves code LLMs via reinforcement fine-tuning.

Future directions: Automate curriculum selection during training, expand to multiple programming languages and domains, combine with process rewards, and design task-specific tiering for OOD targets. Better capability estimators and adaptive token-length controls could further boost stability and performance.

Why remember this: TAROT shows that “how we teach” code LLMs matters as much as “what we teach.” Matching challenge to ability—with clear, tiered tests—turns vague reinforcement into focused progress and pushes code generation toward correctness, robustness, and real-world reliability.

Practical Applications

•Create four-tier unit tests (basic/intermediate/complex/edge) for your own coding tasks to debug models more effectively.
•For weaker models, start training with a basic→complex schedule; for stronger or code-specialized models, try complex-first.
•Separate how often each test tier appears (allocation) from how much it counts (weights) to stabilize and target learning.
•Use tier-aware rewards in RL fine-tuning so complex/edge passes contribute more for capable models.
•Monitor average completion length during training; aim for concise solutions as a proxy for better reasoning.
•Tune GRPO β and sampling temperature per task (e.g., lower β and higher temperature for HumanEval-like tasks).
•Choose inference token limits by benchmark type (shorter for concise tasks like HumanEval, longer for MBPP-style tasks).
•For OOD targets, pilot multiple curricula (e.g., Basic Only vs Complex/Edge-weighted) and keep the best-performing one.
•Warm up very small models with supervised basics or Basic Only curricula to avoid reward sparsity collapse.
•Adopt sandboxed execution with timeouts for safe, consistent, and reproducible code evaluation.

Version: 1