CADEvolve: Creating Realistic CAD via Program Evolution

Maksim Elistratov; Marina Barannikov; Gregory Ivanov; Valentin Khrulkov; Anton Konushin; Andrey Kuznetsov; Dmitrii Zhemchuzhnikov

CADEvolve: Creating Realistic CAD via Program Evolution

Intermediate

Maksim Elistratov, Marina Barannikov, Gregory Ivanov et al.2/18/2026

arXiv

Key Summary

•AI models that make CAD designs used to learn mostly from simple “draw-then-extrude” examples, so they struggled with real, complex parts.
•CADEvolve grows CAD programs step by step like evolution: propose a change, build it, check it, and keep only the good ones.
•It starts from 46 simple hand-written seeds and expands into 7,945 validated parametric generators that cover the full CadQuery operation set (extrude, revolve, loft, sweep, fillet, chamfer, shell, booleans, patterns).
•From these generators, the team produced a multi-level dataset with millions of executable scripts and matching shapes, then cleaned and standardized them (centered, scaled, and number-quantized).
•They also made many safe code rewrites so models don’t just memorize one template but truly learn how to build shapes.
•Training a vision-language model on this dataset and then fine-tuning it with reinforcement learning sets a new state of the art on Image2CAD across DeepCAD, Fusion 360, and MCB.
•The approach uses strict checks (does the code run? is the shape watertight? do pictures match the text?) to keep only solid, realistic designs.
•This pipeline creates high-quality, diverse CAD data without relying on hard-to-get industrial histories, helping future Text2CAD, PC2CAD, and editing tasks too.
•CADEvolve is released as code, a three-tier dataset (G, P, C), and a trained model so others can build on it.

Why This Research Matters

Better CAD datasets mean AI can help designers create stronger, lighter, and safer products faster. With full operation coverage, models can finally learn real multi-step workflows instead of just boxes and extrudes. This can shorten design cycles for things we use every day, from appliances to bikes, reducing costs and mistakes. It also makes designs more editable, so engineers can tweak details without starting over. Because CADEvolve is open and standardized, researchers and companies can build on it fairly and reproducibly. Over time, the same ideas can boost other tasks like Text2CAD and CAD editing, making design tools more natural and powerful.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine building with LEGO. If your pieces are only basic blocks, you can make boxes and towers, but cool cars with smooth curves and fancy wheel wells are hard. AI that learns CAD from only simple building steps has the same problem.

🥬 The Concept: The world before this paper relied on public CAD datasets that mostly show simple 2D sketches extruded into 3D. How it works (before):

Collect user histories or auto-generated programs that mainly do sketch→extrude.
Train models to turn pictures or text into similar code.
Get good boxes and plates, but struggle with complex shapes (revolve, sweep, fillet, chamfer, shell, loft, and local patterns). Why it matters: Without richer steps, AI can’t learn real engineering workflows or compose many operations safely.

🍞 Anchor: Think of trying to make a bike helmet using only stacking squares. You’ll get a blocky lump, not a smooth, safe helmet shape.

🍞 Hook: You know how following only one recipe makes your cooking dull? CAD models trained on one recipe (sketch→extrude) end up bland.

🥬 The Concept: The problem researchers faced: a data bottleneck. Public CAD histories either lack complex operations or don’t release multi-operation build logs. How it works:

Existing corpora focus on prismatic parts and extrusions.
Mesh-only sets (like ShapeNet/ABC) don’t have the step-by-step recipe.
One-shot VLM prompting often returns simple or invalid code because it lacks deep 3D grounding. Why it matters: With limited ingredients and no instructions, models can’t learn to cook “real meals” (industrial-grade parts).

🍞 Anchor: If you only see how to make pancakes, you won’t learn soufflés. Models only seeing extrudes won’t learn lofts or sweeps.

🍞 Hook: Imagine asking a friend for a whole perfect cake in one go, versus baking layer by layer and checking taste in between.

🥬 The Concept: Prior failed attempts tried single-pass code generation using frozen VLMs or rule-based generators. How it works:

Prompt a big model once → get simple code.
Or write rigid rules → avoid crashes but stay simple.
Retrieval helps a bit but still narrow operator sets. Why it matters: Without iteration and checks, code stays basic or breaks.

🍞 Anchor: Asking for a castle in one click rarely works; building room by room with inspections does.

🍞 Hook: Imagine a science fair project that tests, fixes, and repeats, getting better each round.

🥬 The Concept: The missing piece was evolution with verification—a loop that proposes edits, builds them, and filters by strict tests. How it works:

Start with a few good seeds.
Propose changes using a VLM.
Keep only designs that compile, are watertight, and match their descriptions. Why it matters: This grows complexity safely and teaches models full CAD toolboxes.

🍞 Anchor: Like a gardener selecting the healthiest sprouts, the pipeline keeps only the strongest CAD programs.

🍞 Hook: Why should you care? Because everyday things—from phone cases to bike parts—are CAD-made.

🥬 The Concept: Real stakes: better data → better AI CAD tools. How it works:

Complex, validated scripts teach models reliable multi-step design.
Standardized code and sizes prevent memorization of quirks.
Results transfer to real benchmarks and can aid future tasks (Text2CAD, editing). Why it matters: Faster design cycles, fewer errors, and more creative, safe parts.

🍞 Anchor: Imagine an app that turns a few pictures of a toy car into editable, high-quality build steps you can tweak—wheels bigger, body smoother—without starting from scratch.

02Core Idea

🍞 Hook: You know how you perfect a drawing by sketching, checking, erasing, and improving, instead of nailing it in one stroke?

🥬 The Concept: The “Aha!” is to evolve CAD programs through propose→execute→filter, keeping only those that compile, make valid solids, and visually match their intent. How it works:

Begin with 46 simple, hand-made CadQuery seeds.
A vision–language model proposes edits or new children.
Each child is built and tested in stages; failures trigger targeted fixes.
Survivors become parents for the next round, growing complexity. Why it matters: This escapes the “extrude-only” trap and builds realistic multi-operation recipes.

🍞 Anchor: Like leveling up a video game character by passing challenges, CAD programs “level up” by surviving strict checks.

🍞 Hook: Imagine learning to cook by gradually adding techniques—chopping, sautéing, baking—while tasting after each step.

🥬 The Concept: Multiple analogies for the same idea:

Gardening: plant seeds → select healthiest sprouts → cross-pollinate → richer garden.
LEGO city: start with blocks → add roads, arches, curves → test stability → expand.
Writing: draft → proofread → revise → publish only what passes checks. Why it matters: Iteration with testing builds mastery, not just lucky one-shots.

🍞 Anchor: You wouldn’t publish a book without editing; CADEvolve doesn’t “publish” a CAD program without thorough checks.

🍞 Hook: Before vs. after is like having only straight rulers vs. owning a full toolbox with compasses, files, and clamps.

🥬 The Concept: Before vs After:

Before: Mostly sketch→extrude, low variety, brittle code, poor multi-step composition.
After: Full CadQuery operation coverage (extrude, revolve, loft, sweep, fillet, chamfer, shell, booleans, patterns) with validated, editable programs. Why it matters: Models learn richer strategies and handle real-world parts better.

🍞 Anchor: A model trained only on boxes makes boxy mugs; with fillet/sweep, it makes comfy, curved handles.

🍞 Hook: Think of why the method works: safe practice with feedback beats guessing.

🥬 The Concept: Why it works:

Tight feedback loop: compile check, geometry validity, visual–text agreement.
Parametric generators: one program spans many shapes, teaching general rules.
Canonicalization: unify code style, center/scale shapes, round numbers → model learns construction logic, not noise. Why it matters: Clear signals + variety = robust learning.

🍞 Anchor: Like learning music on a tuned instrument with clean sheet music, not a warped guitar and smudged notes.

🍞 Hook: Big ideas are built from small blocks you can stack confidently.

🥬 The Concept: The building blocks:

Seeds: 46 hand-written generators to cover key operations.
Propose–Execute–Filter loop to grow CADEvolve-G (7,945 generators).
Sampling to get concrete programs (CADEvolve-P), with diversity.
Rewrites and predictions to expand variety, then canonicalize to CADEvolve-C.
Train a VLM first with supervision, then improve with RL rewards. Why it matters: A pipeline from simple starts to state-of-the-art Image2CAD models.

🍞 Anchor: It’s like starting with the alphabet, making words, then stories, and finally publishing a book series that readers love.

03Methodology

🍞 Hook: Imagine building a robot: first gather parts, then assemble, test, fix, and repeat until it works great.

🥬 The Concept: High-level recipe: Input (seed CAD programs and a VLM) → Propose edits → Build and validate → Keep the good ones → Sample many instances → Standardize code/shape → Train a model → Fine-tune with rewards. Why it matters: Each stage removes noise and adds reliable complexity, so the final model learns the real craft of CAD.

🍞 Anchor: Like baking cookies, you pick ingredients, mix, taste the dough, adjust, bake, and frost—each step has a purpose.

🍞 Hook: You know how you start drawings from simple shapes? Same here.

🥬 The Concept: Seeds (CadQuery, parametric generators). What it is: 46 hand-written, parametric programs that produce single, watertight solids and already use many operations (extrude, revolve, loft, sweep, fillet, chamfer, shell, booleans, patterns). How it works:

Each function takes named parameters (like width, radius) and builds a part.
They are diverse anchors covering the full operator toolbox.
They set style for parameter exposure and safe defaults. Why it matters: Without strong seeds, evolution might get stuck making only boxes.

🍞 Anchor: A “gear generator” seed can make many gears by just changing tooth count or radius.

🍞 Hook: Think of kids suggesting changes to a LEGO car, then testing it on a ramp.

🥬 The Concept: Propose–Execute–Filter loop. What it is: An evolution cycle where a VLM proposes children, they are built, checked, fixed if needed, and accepted only if they pass. How it works:

Sample K parents (to recombine ideas).
The VLM proposes child metadata (name, short/long description, parents) and code via retrieval-augmented synthesis (uses similar examples to guide code).
Validation happens in stages: compile/run → geometry integrity (single, watertight solid) → visual–text agreement (multi-view renders checked against descriptions). Targeted repairs are attempted if a stage fails. Why it matters: This guards against broken or trivial designs and steadily increases complexity.

🍞 Anchor: Only the LEGO cars that roll straight, don’t fall apart, and match their blueprint get to enter the next race.

🍞 Hook: Imagine one recipe that can bake many cookies by changing chocolate chip count or size.

🥬 The Concept: Parametric generators → concrete programs. What it is: Turn each general recipe into many specific, executable scripts. How it works:

For each generator, search for 15 diverse, valid parameter sets using a quality–diversity objective and CMA-ES (a black-box optimizer).
Trace the executed code once (resolve branches/loops) and emit a flat, minimal CadQuery script that reproduces the same solid and exposes the build history.
Standardize imports and the final result variable. Why it matters: Training needs many, varied, runnable examples that share the same underlying logic.

🍞 Anchor: One “box with holes” generator becomes many different tool blocks by changing sizes and hole patterns.

🍞 Hook: You know how students copy notes in different handwriting? If the teacher grades by handwriting, not content, learning fails.

🥬 The Concept: Code-level augmentation and pretraining. What it is: Make many meaning-preserving rewrites so the model can’t just memorize one template. How it works:

Ask a small LLM to rewrite scripts into different but equivalent forms; keep only those that execute to the same solid.
Train an Image2CAD model on multi-view renders → code pairs.
Run this model on mesh-only sets (ABC, ShapeNet) to harvest more paired (render, predicted-code) samples for coverage. Why it matters: This breaks template bias and expands variety without touching external datasets.

🍞 Anchor: Same recipe written by different chefs still bakes the same cake—so students must understand the steps, not just the font.

🍞 Hook: Imagine sorting your toolbox, centering your work on the bench, and using a ruler with whole numbers so mistakes are fewer.

🥬 The Concept: Canonicalization and normalization. What it is: Clean and standardize code and shape scale so models learn construction, not quirks. How it works:

Unify code style (flat, macro-like, minimal Python—only geometry-affecting calls).
Center each shape at (0,0,0) and scale so the longest side is a fixed length.
Binarize numbers: remove tiny epsilons and round to integers. Why it matters: Without this, the model overfits random naming, poses, and floating-point noise.

🍞 Anchor: It’s easier to learn geometry when all drawings are centered, same size, and use neat, whole-number measurements.

🍞 Hook: Think of training wheels that come off once you can balance.

🥬 The Concept: Training pipeline with reinforcement learning. What it is: Two-stage learning: first supervised fine-tuning (SFT) on paired data, then RL to push geometric accuracy. How it works:

SFT: show multi-view images (6 orthographic + 1–2 isometric) and ask the VLM to generate CadQuery code.
RL: run the generated code; if it compiles, compute IoU between predicted and target meshes and reward accordingly; if it fails, give a penalty. Optimize with GRPO/CPPO-style objectives. Why it matters: SFT teaches the language; RL teaches precision and robustness.

🍞 Anchor: First you learn the rules of chess; then you play games with feedback to get really good.

🍞 Hook: What’s the special trick here?

🥬 The Concept: The secret sauce is evolution offline + strict validation + canonicalization + RL polish. How it works:

Grow a diverse, validated corpus without needing proprietary histories.
Remove spurious variability so models learn the essence.
Use RL to close the last gap in geometric fidelity. Why it matters: This combination is what lifts Image2CAD to state-of-the-art.

🍞 Anchor: Like building a library of clean, well-edited books and then training speed-readers who also check their answers against the text.

04Experiments & Results

🍞 Hook: Imagine a school contest where robots must rebuild a mystery toy from eight photos. We judge how close they got, how cleanly they worked, and how often they broke.

🥬 The Concept: The test is Image2CAD. What it is: Feed a grid of 6 orthographic + 1–2 isometric views into a VLM, and ask it to emit CadQuery code that reconstructs the target shape. How it works:

All target shapes are centered and scaled for fairness.
The model’s code is executed to get a predicted mesh.
We compare predicted vs target using shape-matching scores. Why it matters: This isolates the “program-understanding-from-images” skill without text or extra encoders.

🍞 Anchor: It’s like copying a sculpture after viewing it from set angles and seeing how close your copy is.

🍞 Hook: You know how races compare against last year’s champion? We need a strong baseline.

🥬 The Concept: The competition is against a leading method called “cadrille.” What it is: A state-of-the-art Image2CAD system that also uses online RL and reports results on DeepCAD, Fusion 360 Gallery, and MCB. How it works:

Same family of RL objectives and similar multi-view evaluation.
We follow their protocols to keep comparisons fair. Why it matters: Beating a strong baseline shows the dataset and method truly help.

🍞 Anchor: If your new running shoes help you beat the school’s fastest runner, they probably work.

🍞 Hook: Scores are clearer with good yardsticks.

🥬 The Concept: The scoreboard uses three metrics. What it is and how it works:

Chamfer Distance (CD): average distance between points on the two shapes (lower is better).
IoU (Intersection over Union): how much the volumes overlap (higher is better).
Invalid Rate (IR): fraction of generations that don’t compile or are non-watertight (lower is better). Why it matters: Together, they tell us accuracy, faithfulness, and reliability.

🍞 Anchor: Think of CD as “how far off are you on average,” IoU as “how much you colored inside the lines,” and IR as “did your pencil even work.”

🍞 Hook: What did they actually try?

🥬 The Concept: Experimental ladder. What it is: A progression of training sets and regimes. How it works:

Start with traced generator scripts (CADEvolve-P), then add code rewrites to fight template bias.
Pretrain Image2CAD and harvest model predictions on ABC/ShapeNet to broaden variety.
Canonicalize to CADEvolve-C (center, scale, round) and mix in diverse early sketches (from CAD-Recode) to enrich beginnings.
Train the VLM on ever-bigger, cleaner sets; then apply RL with IoU-based rewards. Why it matters: Each rung reduces a weakness (bias, scale drift, brittle code) and boosts performance.

🍞 Anchor: It’s like practicing music with simple songs, then variations, then clean sheet music, and finally live performances with a coach.

🍞 Hook: So, who won and by how much?

🥬 The Concept: Main findings. What it is: CADEvolve-trained models achieve state-of-the-art Image2CAD across DeepCAD, Fusion 360, and MCB after RL. How it works:

Without canonicalization, models improve but lag far behind—showing style/scale noise hurts.
Adding ABC/ShapeNet prediction-derived data gives a big jump, showing coverage matters.
Canonicalized large set + RL beats the strong baseline on key metrics (lower CD, higher IoU) with a slightly higher invalid rate, consistent with using richer, collision-prone operations.
Adding MCB to the RL pool adapts to its “softer silhouette” rendering domain and notably improves MCB results while keeping other datasets competitive. Why it matters: The dataset quality, not just model tricks, drives the gains.

🍞 Anchor: It’s like switching from guessing to practicing with a well-made workbook and then getting a coach—your test scores jump across classes.

🍞 Hook: Any surprises?

🥬 The Concept: Surprising/insightful observations. What it is: A few non-obvious lessons. How it works:

Code rewrites alone help, but not enough—canonicalization is crucial.
Enriching first primitives with diverse sketches (from CAD-Recode) noticeably helps composition.
Domain shifts in rendering (like MCB’s smoother edges) can be handled by targeted RL exposure. Why it matters: Data curation choices strongly shape downstream success.

🍞 Anchor: Using clearer sheet music and practicing on the actual concert piano beats practicing on a toy keyboard.

05Discussion & Limitations

🍞 Hook: Even a great map doesn’t show every road.

🥬 The Concept: Limitations. What it is: Honest places where the approach may fall short. How it works:

Synthetic distribution: The evolved dataset is not a perfect match to any one company’s private CAD world; operator frequencies may differ.
CadQuery dialect: Programs target CadQuery; porting to other kernels/tools may need care and won’t always be 1:1.
Invalids vs richness: Using complex ops can increase invalid attempts; strict validation helps, but some failure remains.
View protocol: Image2CAD with fixed views is clean but doesn’t test text understanding or point-cloud inputs. Why it matters: Knowing boundaries helps plan safe, effective use.

🍞 Anchor: A hiking trail app with great mountain maps might be less detailed for deserts—you’d pack accordingly.

🍞 Hook: Good tools need the right shop space and power.

🥬 The Concept: Required resources. What it is: What you need to run or extend this pipeline. How it works:

A reliable CadQuery environment and geometric validators.
Access to a capable VLM for proposing and repairing code.
Compute for rendering, tracing, optimization (CMA-ES), and multi-epoch SFT/RL training. Why it matters: Under-provisioning leads to slow or noisy growth.

🍞 Anchor: Baking lots of cakes needs a proper oven, not just a toaster.

🍞 Hook: When should you not use it?

🥬 The Concept: When not to use. What it is: Situations to avoid or adjust. How it works:

If you must exactly mirror a proprietary distribution (materials, tolerances, company-specific feature semantics), synthetic evolution may mismatch.
If your downstream CAD tool can’t map CadQuery features well, direct reuse is tricky.
If your task requires natural language reasoning (Text2CAD) but you only train on images, add text data. Why it matters: Right tool for the job saves time.

🍞 Anchor: Don’t use a hammer to turn a screw.

🍞 Hook: Big questions spark the next adventure.

🥬 The Concept: Open questions. What it is: What we still don’t know. How it works:

How to better steer the evolution to match specific industrial distributions?
Can we unify CadQuery with other CAD kernels in a portable IR for broader compatibility?
How to bring richer constraints (tolerances, mating, assemblies) into the evolution while keeping validity high?
Can multi-modal training (images+text+point clouds) yield even bigger gains without heavy architectures? Why it matters: Solving these deepens usefulness and reach.

🍞 Anchor: Next, the gardeners want to breed roses with exact colors, not just healthy blooms.

06Conclusion & Future Work

🍞 Hook: Picture starting with a handful of simple LEGO pieces and ending with a whole city, because you added blocks carefully, tested bridges, and fixed what wobbled.

🥬 The Concept: Three-sentence summary. What it is:

CADEvolve grows CAD programs by proposing edits, checking execution and geometry, and keeping only valid, well-matched results.
This creates a three-tier dataset—generators (G), concrete programs (P), and canonicalized scripts (C)—with millions of executable examples covering full CadQuery operations.
Training a VLM on this data, then polishing with RL, achieves state-of-the-art Image2CAD on multiple public benchmarks. Why it matters: It shows that smart data growth plus strong validation beats one-shot guesses.

🍞 Anchor: Like a music student who practices clean pieces, gets feedback, and then wins the recital.

🍞 Hook: What’s the #1 achievement?

🥬 The Concept: Main achievement. What it is: An evolution-and-validation pipeline that finally delivers open, executable, multi-operation CAD histories at scale—unlocking better training for CAD code generation. Why it matters: Community models can now learn the full craft, not just draw-and-extrude.

🍞 Anchor: Giving everyone a complete cookbook—not just pancake recipes—makes better chefs.

🍞 Hook: What’s next?

🥬 The Concept: Future directions. What it is: Broaden tasks and portability. How it works:

Extend to Text2CAD and PC2CAD with the same clean supervision base.
Improve portability across CAD kernels via a shared intermediate representation.
Evolve programs with constraints for assemblies and design intent. Why it matters: Wider reach and deeper realism.

🍞 Anchor: From building single houses to planning whole neighborhoods with roads and utilities.

🍞 Hook: Why remember this?

🥬 The Concept: Lasting impact. What it is: A practical recipe to create realistic CAD data when real logs are scarce—iterate, verify, standardize, and then train+RL. Why it matters: It turns limited data into a rich tutor, lifting CAD automation closer to real engineering needs.

🍞 Anchor: When you can grow your own garden of good examples, you don’t have to wait for perfect seeds to arrive.

Practical Applications

•Automated Image2CAD: Turn factory photos or drawings into editable CAD programs for reverse engineering.
•Design suggestion assistants: Propose plausible next operations (like filleting risky edges) during manual modeling.
•Rapid variant exploration: Use parametric generators to sample many safe design options (e.g., bracket sizes).
•Educational tools: Teach students full CAD workflows with step-by-step, executable examples.
•Quality control: Validate that generated CAD code compiles and produces watertight solids before manufacturing.
•Data bootstrapping: Create training corpora for Text2CAD or PC2CAD when real histories are unavailable.
•Cross-checking mesh libraries: Attach executable build recipes to mesh-only datasets for editability.
•Template de-biasing: Use code rewrites and canonicalization to ensure models learn construction logic, not style quirks.
•RL fine-tuning pipelines: Plug IoU-based rewards into CAD code generation to improve geometric fidelity.
•Operator coverage testing: Evaluate whether a model truly uses revolve/loft/sweep rather than collapsing to extrudes.

Version: 1