Code2Math: Can Your Code Agent Effectively Evolve Math Problems Through Exploration?
Key Summary
- •This paper shows that code-writing AI agents can take an existing math problem and automatically turn it into a new, harder one while keeping it solvable.
- •The system is a team of three agents: one evolves the problem, one checks that it’s solvable, and one checks that it’s truly harder in a smart way (not just messier).
- •Agents actively run Python code (like SymPy and Z3) to experiment, test ideas, and verify parts of solutions during problem creation.
- •Across 100 seed problems, the evolved problems stayed mathematically sound with high agreement (up to about 96%) from an external judge model.
- •Solvers’ accuracy dropped on the evolved problems and they used more tokens, meaning the new problems really were harder and needed longer reasoning.
- •Interestingly, some models could create problems that were harder than what they themselves could solve, showing a 'make harder than you' effect.
- •Making good hard problems took several tries: typically 1.56 to 6.55 failed rollouts per success, with a long tail of even more retries for tough cases.
- •The key idea is 'Burden of Discovery': hiding the crucial insight so that finding the first step is the main challenge, not doing long calculations.
- •Code-driven exploration helps find fresh structures and counterexamples and keeps the evolved problems non-trivial and elegant.
- •This approach could scale up high-quality math data for training and testing advanced reasoning models without needing endless human-written problems.
Why This Research Matters
Strong math AIs need a steady supply of fresh, fair, and challenging problems to keep learning. This paper shows how to build such problems automatically, while checking they are solvable and truly harder in a thoughtful way. That means better training data, better tests, and faster progress without overloading human experts. Teachers and students can benefit from curated sets that reward insight instead of busywork. And research communities get reproducible pipelines that use code to explore ideas, not just text to describe them.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
🍞 Top Bread (Hook) You know how puzzles get boring if you only have easy ones, but making brand-new, excellent hard puzzles takes a lot of time from expert humans? AI math models are in that same boat: they’re getting very good, but they need a steady supply of new, challenging problems to keep improving.
🥬 Filling (New Concept 1: Code Agents)
- What it is: A code agent is an AI that can think and also write and run code to check its ideas.
- How it works:
- Reads a task
- Writes Python code to test patterns or try examples
- Runs the code and sees exact results
- Uses those results to make better next steps
- Why it matters: Without code agents, the AI guesses in words only and can get lost; with code, it can verify ideas quickly and explore far more possibilities. 🍞 Bottom Bread (Anchor): Imagine a student who can use a calculator and a lab. They don’t just guess; they try, measure, and improve. That’s a code agent.
🥬 Filling (New Concept 2: Mathematical Problem Solving)
- What it is: Turning a math question into steps you can prove.
- How it works:
- Understand the goal
- Look for key structures (like symmetry or parity)
- Test smaller cases
- Build a general argument
- Why it matters: Without clear steps, you can’t be sure your answer is correct. 🍞 Anchor: Proving the sum of two odd numbers is even: check examples, spot the pattern, then write a proof with 2k+1 and 2m+1.
🥬 Filling (New Concept 3: Mathematics)
- What it is: A language of patterns, quantities, and structures that allows precise reasoning.
- How it works:
- Define objects (numbers, shapes, sets)
- Set rules (axioms)
- Prove statements using logic
- Why it matters: Without this structure, results become opinions, not certainties. 🍞 Anchor: If 3x=12, then x=4 isn’t a guess; it follows rules everyone agrees on.
🥬 Filling (New Concept 4: Cognitive Science)
- What it is: The study of how thinking works (attention, memory, planning).
- How it works:
- Observe how thinkers solve tasks
- Model strategies and mistakes
- Improve learning and problem design
- Why it matters: Without understanding how we think, we can’t design problems that truly challenge understanding. 🍞 Anchor: A teacher spaces homework to match attention and memory—cognitive science in action.
The world before: Big AI models got better at math by training on lots of problems. But crafting truly challenging, fair, and novel math problems is hard and slow for humans. Many prior automatic methods either nudged numbers (like changing 30 to 300) or applied simple rules, which often made problems longer but not deeper. Other approaches tried to generate problems from scratch but sometimes produced ones that were shaky, unsolvable, or only superficially hard.
The problem: We need a scalable way to create new math problems that are both solvable and genuinely harder in an insightful way, not just tedious.
🥬 Filling (New Concept 5: Burden of Discovery)
- What it is: How hard it is to find the key idea (the 'Aha!') that unlocks the solution.
- How it works:
- Hide or twist the usual entry point
- Demand a fresh observation or structure
- Reward insight over brute force
- Why it matters: Without a burden of discovery, problems feel like routine worksheets. 🍞 Anchor: A puzzle where the trick is noticing a hidden symmetry—once you see it, everything falls into place.
Failed attempts: Simple text-only edits often inflate numbers or algebra but leave the same solution template. Fully automatic generations may slip into contradictions or unintentional hints that make the problem easy again. And without strong checking, some 'new' problems aren’t actually solvable.
The gap: We need an engine that can explore ideas like a scientist, verify logic like a mathematician, and purposefully design that 'Aha!' moment like a great teacher—at scale.
🥬 Filling (New Concept 6: Test-Time Exploration)
- What it is: Letting the agent run many code-powered experiments while crafting a problem.
- How it works:
- Generate a hypothesis (e.g., an inequality bound)
- Write code to test many cases
- Keep promising leads; discard counterexamples
- Refine until a crisp, solvable challenge emerges
- Why it matters: Without exploration, you either miss great ideas or accept broken ones. 🍞 Anchor: Like trying different Lego builds until one is sturdy and cool, then turning it into the official set.
Real stakes: Better math problems mean better training and fairer tests for AI models. Teachers, students, and researchers benefit from a growing library of high-quality challenges. As AI becomes a study partner, it must learn from—and be measured by—problems that reward insight, not just speed. This paper’s approach uses code-driven exploration and team-based verification to make that possible at scale.
02Core Idea
The 'Aha!' moment in one sentence: Use a team of code-savvy AI agents to evolve a seed math problem into a new one that is provably solvable and measurably harder by increasing the Burden of Discovery.
🍞 Top Bread (Hook) Imagine three friends building a maze: one designs twisty hallways, one checks every path really connects to an exit, and one makes sure the maze is trickier in a clever way, not just longer.
🥬 Filling (New Concept 7: Multi-Agent Framework)
- What it is: A system where multiple specialized code agents collaborate, each handling a different job in problem evolution.
- How it works:
- Evolution Agent proposes new problems and draft solutions
- Solvability Verification Agent checks logic and consistency
- Difficulty Verification Agent judges if the new problem truly raises the Burden of Discovery
- Why it matters: Without roles, one agent tries to do everything and makes more mistakes; specialization raises reliability and quality. 🍞 Anchor: Like a kitchen with a chef, a taster, and a nutritionist—delicious, safe, and balanced meals.
Multiple analogies for the main idea:
- Chef analogy: The head chef invents the recipe (Evolution), the food safety inspector checks it won’t make you sick (Solvability), and the food critic ensures it’s not just spicy but genuinely gourmet (Difficulty).
- Game design analogy: A level designer makes a new stage, a QA tester confirms it can be beaten, and a design lead ensures the fun comes from smart puzzles, not just more enemies.
- Science lab analogy: A researcher proposes a new experiment, a peer reviewer checks methods for errors, and a panel ensures it advances knowledge rather than adding busywork.
Before vs After:
- Before: Problem tweaks often changed surface details; some new items were unsolvable or only longer to compute.
- After: Problems are constructed with code-backed exploration, then double-checked for solvability and insight-based hardness.
- What changes: Reliability (fewer broken problems), depth (anti-template design), and transferability (hard for many solvers, not just one).
Why it works (intuition, not equations):
- Exploration turns hunches into evidence by testing many examples quickly.
- Separation of duties catches common failure modes: one agent to create, another to verify logic, another to score conceptual difficulty.
- A shared standard (an external judge model) calibrates solvability beyond the team’s internal bias.
🥬 Filling (New Concept 8: Theory of Mind)
- What it is: The skill of imagining how someone else would try to solve the problem.
- How it works:
- Predict typical solver strategies
- Place gentle traps for those templates
- Design a route that rewards a fresh insight
- Why it matters: Without it, you make problems that experts solve instantly using familiar tricks. 🍞 Anchor: If you know your friend always tries the longest word first in Hangman, you pick a sneaky short word instead.
Building blocks of the idea:
- Seed problem and solution: a starting map and a known path.
- Evolution Agent’s search: pushes bounds, reworks structures, and checks cases with code.
- Solvability checks: surface consistency (no illegal constraints) and deep logic (step-by-step proof validation).
- Difficulty checks: a rubric that penalizes tedium and rewards true 'Aha!' depth.
- Test-time scaling: multiple rollouts give the agent chances to refine and succeed.
- External judge: an independent, strong model to settle solvability and evaluate solver answers objectively.
In short, the system builds harder—but fair—mazes by exploring with code, validating with logic, and measuring whether the challenge lies in discovery rather than in drudgery.
03Methodology
At a high level: Seed Problem and Solution → Evolution Agent (explores with code) → Solvability Verification Agent (logic check) → Difficulty Verification Agent (insight check) → Evolved Problem and Reference Solution.
Step 1: Evolution Agent (Design and Exploration)
- What happens: The agent reads the seed problem and its official solution, identifies the key 'trick' that solved it, and then explores new versions that hide or reshape that trick. It writes Python code to test candidate constructions, simulate bounds, search counterexamples, and ensure the new statement is plausible.
- Why this step exists: Without guided exploration and self-checking, many proposals would be unsolvable or trivial.
- Example with data: Starting from a list-sum puzzle (sum 30, mode 9, special median), the agent generalizes to sum 323, mode 10, and asks for the maximum possible length. It writes code to pack small numbers around a fixed mode under median constraints, enumerates feasible configurations, and confirms the maximum length attains the sum exactly.
🥬 Filling (New Concept 9: Solvability Verification)
- What it is: A strict audit to ensure the new problem is well-defined and that the proposed solution is logically correct.
- How it works:
- Static check: Are constraints legal and consistent?
- Logic audit: Verify each solution step with symbolic tools (e.g., SymPy), solve equations, and test edge cases
- If any step fails, reject and send back for rework
- Why it matters: Without this, broken or contradictory problems slip through. 🍞 Anchor: Like checking a maze has at least one exit and that every doorway connects properly.
Step 2: Solvability Verification Agent (Two-Phase Audit)
- What happens: Phase 1 flags illegal domains (e.g., division by zero, impossible inequalities). Phase 2 re-derives each step with code to catch hidden assumptions, missed cases, or invalid transformations.
- Why this step exists: To prevent pretty-looking but incorrect math from becoming a 'final' problem.
- Example with data: In an inequality evolution, the agent confirms the claimed extremal case by setting up variables, running symbolic optimization, and ensuring constraints hold at boundaries.
🥬 Filling (New Concept 10: Difficulty Verification)
- What it is: A referee that checks if the new problem is genuinely harder in a conceptual way, not just longer.
- How it works:
- Compare original vs evolved solution paths
- Score on a 1–5 scale, penalizing 'artificial complexity' and rewarding 'Aha!'
- Accept only if
- Why it matters: Without this, we’d flood the pool with tedious exercises instead of insightful challenges. 🍞 Anchor: It’s the coach who says, 'Don’t just add more laps—change the drill so it builds real skill.'
Step 3: Difficulty Verification Agent (Rubric-Based Judgment)
- What happens: Using a detailed rubric, the agent distinguishes between 'longer algebra' (fail) and 'new insight needed' (pass). High scores go to problems that flip standard heuristics into traps or connect concepts in surprising ways.
- Why this step exists: To keep the bar high for intellectual depth and to shape the curriculum towards discovery.
- Example with data: Turning a local inequality into an extremal-moment optimization forces a shift from a one-shot trick to analyzing discrete distributions and deriving structural constraints—this earns a strong score.
Test-Time Scaling via Multiple Rollouts
- What happens: The Evolution Agent gets up to a fixed number of tries (e.g., 20 rollouts) to pass both verification gates.
- Why this step exists: Hard problems often need iteration; multiple attempts improve odds of finding a solvable, elegant construction.
- Example: Some seeds converge in 1–2 attempts; others require 10+ due to subtle contradictions caught by the solvability audit.
Code Tools and Environment
- What happens: Agents run in a sandbox with libraries like SymPy (symbolic math), Z3 (constraints), NetworkX (graphs), itertools (combinatorics), NumPy/SciPy (numerics), and mpmath (high-precision).
- Why this step exists: These tools enable precise checks, exhaustive searches of small spaces, and quick falsification of bad ideas.
- Example: For graph constraints (e.g., forbidding certain subgraphs), the agent constructs candidate graphs with NetworkX and searches for forbidden patterns.
Secret Sauce: Dual Verification + Code-First Exploration
- Clever part 1: Separation of creation and critique reduces bias and catches more errors.
- Clever part 2: Code-driven exploration finds structures humans might miss quickly and prunes dead ends early.
- Clever part 3: The difficulty rubric aligns the system with beauty and insight rather than grind, nudging the generator toward elegant constructions.
Putting it together on a toy flow
- Input: AIME-style list problem
- Evolution: Formulate a maximization with constraints (mode fixed, median excluded), search integer configurations by code, detect optimal median choices, and pack numbers economically.
- Solvability: Verify constraints produce at least one valid configuration and the reasoning is airtight.
- Difficulty: Confirm the shift from 'case-check small sum' to 'optimize under coupled constraints' requires a new insight about density and trade-offs.
- Output: A crisp, solvable problem with a reference solution and a clearly higher Burden of Discovery.
04Experiments & Results
The test: The authors gathered 100 seed problems (algebra, combinatorics, calculus, sequences, graph theory) and had several strong LLMs act as the Evolution Agent. Each attempted to evolve one new problem per seed under a budget of steps and rollouts. An external, very strong judge model evaluated solvability and graded solver attempts.
What was measured and why:
- Solvability Agreement Rate: Do the internal solvability checks match the external judge? This shows reliability.
- Solve Rates on original vs evolved sets: Do solvers get worse on evolved problems? Lower accuracy means harder.
- Average Token Consumption: Do solutions take longer? More tokens suggest deeper reasoning chains.
- Efficiency (rollouts per success): How many attempts are needed to pass both gates?
The competition: Evolution agents included DeepSeek-Chat, DeepSeek-Reasoner, Gemini-3-Pro-Preview-Thinking, Kimi-K2-Thinking, and Seed-2.0-Pro. Solvers included several open and closed models, with GPT-5.2-High as the external judge.
The scoreboard with context:
- High solvability agreement: For instance, DeepSeek-Reasoner achieved about 96% agreement (94/98) with the external judge. This is like two referees independently making the same call almost every time.
- Difficulty escalation: On evolved problems, many solvers’ accuracies dropped. Example: GPT-5.2-High fell from roughly 70% on originals to around 61% on evolved problems (a meaningful dip for a top model). Gemini-3-Flash-Thinking dropped even more (e.g., 56% to 35% in one setting), which is like moving from a comfy B to a struggling D on the new test—a strong sign the problems really got harder.
- Token usage rose: The distribution of average tokens per problem shifted right, often with a fat tail for evolved sets (e.g., median tokens jumping by thousands). That’s like students writing much longer, more detailed solutions, not because they’re waffling, but because the path requires more exploration.
- Efficiency and cost: On average, success took multiple attempts—about 1.56 to 6.55 failures per success depending on the model, with some problems needing 10+ retries. This shows the process is powerful but computationally hungry.
Surprising findings:
- Make-harder-than-you effect: Some models created problems they—and even stronger peers—found tougher. That means generation can outrun immediate solving capability, a promising sign for building future curricula.
- Reasoning strength matters in the evolver: Evolution driven by stronger reasoning models tended to induce bigger and more transferable difficulty increases across different solvers.
- Main bottleneck: Most failed attempts were caught by the solvability checker (logical consistency), not by the difficulty checker. So making something hard is easy; making it hard and correct is the real trick.
Concrete case snippets:
- From a bounded-variance inequality to an extremal moment problem: The evolved task demanded characterizing discrete distributions achieving maxima, lifting a local trick into a full structural analysis.
- From a small-sum list puzzle to a packing optimization: The evolved problem needed balancing mode frequency, median constraints, and budget allocation—requiring code-backed enumeration and careful reasoning.
Overall, the numbers and examples say the pipeline reliably crafts mathematically sound, insight-heavy problems that push solvers out of template comfort zones—and it does so via code-driven exploration plus dual verification.
05Discussion & Limitations
Limitations:
- Computational cost: Multiple rollouts and code checks add up. Some seeds need 10+ iterations to pass both gates.
- Solvability is the bottleneck: Most rejected attempts fail due to subtle logical or consistency issues caught by the solvability audit.
- Judge dependence: The external model as a judge is strong but not infallible; disagreements, while rare, can happen.
- Risk of artificial tweaks: Without the difficulty rubric, it’s easy to bloat numbers or algebra rather than raise conceptual depth.
- Domain breadth: While algebra/combinatorics/graph theory work well with available libraries, certain geometry or analysis niches may need specialized tooling.
Required resources:
- A controlled Python sandbox with math libraries (SymPy, Z3, NetworkX, NumPy/SciPy/mpmath).
- Access to strong LLMs for evolution and an even stronger model as an external judge.
- Compute budget for multiple rollouts and long reasoning chains by both generators and solvers.
When not to use:
- If you need instant generation at tiny compute budgets.
- If your domain requires specialized verifiers the sandbox doesn’t have (e.g., advanced geometry theorem provers) and you can’t add them.
- If your evaluation setting prizes speed or short answers over depth and elegance.
Open questions:
- Can we cut rollouts with better search (e.g., learned proposal priors, bandit allocation, or constraint-guided synthesis)?
- How can we provide stronger solvability guarantees (e.g., certified provers, proof assistants integration) without losing flexibility?
- Can the difficulty rubric be extended with automated features (e.g., measuring template-trap strength) to reduce subjective variance?
- Will this recipe generalize to other reasoning domains (physics word problems, program synthesis specs, logic puzzles)?
- Can self-evolution loops stabilize, where solvers learn from the generated curriculum and then evolve even tougher, yet still solvable, challenges?
06Conclusion & Future Work
Three-sentence summary: The paper builds a code-powered, multi-agent system that evolves seed math problems into new ones that are logically sound and measurably harder by increasing the Burden of Discovery. It does this by separating creation from verification, combining code exploration with strict solvability and difficulty checks, and validating results with an external judge. Experiments show high solvability agreement, lower solver accuracies, and longer reasoning chains on evolved problems, confirming real difficulty gains.
Main achievement: Demonstrating that code agents, via test-time exploration and dual verification, can reliably synthesize high-quality, insight-driven math problems that challenge even strong models.
Future directions: Improve search efficiency to reduce rollouts, integrate certified proof tools for stronger guarantees, enrich the difficulty rubric with automated signals, and expand to new domains like geometry with specialized engines. Iterative self-evolution—train on the evolved set, then evolve again—could create a virtuous cycle of ever-deeper curricula.
Why remember this: It shows a practical path to scaling the 'hard problem supply'—not by making math messier, but by making it smarter. As AI learns from richer challenges, we move closer to systems that value insight, elegance, and proof—hallmarks of real mathematical thinking.
Practical Applications
- •Auto-generate practice sets that emphasize insight for math clubs or Olympiad prep.
- •Continuously refresh benchmarks so models can’t overfit to familiar templates.
- •Create personalized difficulty ladders for students by evolving problems they’ve solved.
- •Stress-test solver models with anti-template problems to reveal reasoning blind spots.
- •Prototype new competition problems and filter them with solvability and difficulty checks.
- •Build curriculum-aligned problem banks that scale from seed examples in textbooks.
- •Use code exploration to find counterexamples and repair flawed problem drafts.
- •Measure real reasoning gains by comparing solve rates and token usage on evolved sets.
- •Bootstrap research datasets in domains like graph theory with automatic structure checks.
- •Run self-evolution loops: train on evolved problems, then evolve tougher ones.