ParEVO: Synthesizing Code for Irregular Data: High-Performance Parallelism through Agentic Evolution | How I Study AI

ParEVO: Synthesizing Code for Irregular Data: High-Performance Parallelism through Agentic Evolution

Intermediate

Liu Yang, Zeyu Nie, Andrew Liu et al.3/3/2026

Key Summary

•ParEVO teaches AI to write fast, safe parallel code for messy, irregular data like big graphs and uneven trees.
•Instead of juggling low-level threads, it uses simple, reliable building blocks (ParlayLib primitives) that fit together like LEGO.
•A special training set (Parlay-Instruct) only keeps code that compiles, passes tests, and runs fast, so the model learns good habits.
•An Evolutionary Coding Agent (ECA) improves code by trial and error, using compilers, race detectors, and timers as strict judges.
•On the ParEval benchmark, ParEVO gets big speedups, averaging about 106 times faster on many tasks, and up to 1103 times on some.
•It also performs very well on hard graph problems, getting around 13.6 times faster, and sometimes even beating expert human code.
•There is a real trade-off: safer code is more likely to be correct but can be slightly slower than risky, ultra-optimized tricks.
•ParEVO focuses on shared-memory multicore machines (like big CPUs), not clusters of separate computers yet.
•The approach makes parallel programming more accessible by hiding tricky details and letting AI learn from real machine feedback.

Why This Research Matters

Many real-world problems—finding friends-of-friends on social networks, recommending videos, planning delivery routes—live on huge, messy graphs. Making these programs safe and fast can save massive time and money and reduce energy use. ParEVO shows that AI can build reliable, parallel solutions by using safe building blocks and learning from real hardware feedback. This helps more developers harness many-core CPUs without getting lost in tricky thread bugs. Faster, safer code means snappier apps, more accurate analytics, and greener data centers. It also opens the door to new discoveries by quickly exploring algorithm variants that humans might overlook.

Reading Workflow

Turn this paper into a decision

Scan fast. Promote only the papers that survive triage.

No workflow history yet.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook): You know how group projects go faster when everyone has a clear job and doesn’t step on each other’s toes? Computers can do that too: it’s called parallel computing, where many cores work together.

🥬 Filling (The Actual Concept):

What it is: Parallel computing is when a program splits work among many cores so multiple parts run at the same time.
How it works: (1) Break a big task into pieces; (2) Give pieces to different cores; (3) Make sure they don’t collide when sharing data; (4) Combine the results.
Why it matters: Without parallelism, modern programs would be slow, since single-core speed isn’t growing much anymore. 🍞 Bottom Bread (Anchor): Imagine sorting a giant list: 32 kids each sort a chunk, then a teacher merges the pieces—much faster than one kid doing everything.

🍞 Top Bread (Hook): Now imagine the list isn’t neat; it’s messy and keeps changing—like a playground game where kids join and leave unpredictably.

🥬 Filling (The Actual Concept):

What it is: Irregular data (like sparse graphs or unbalanced trees) is data that doesn’t split evenly; how much work each piece needs depends on surprises discovered while running.
How it works: (1) Read a piece; (2) Discover it has few or many neighbors; (3) Rebalance who does what on the fly; (4) Keep threads from bumping into each other.
Why it matters: If you assign work before you know how hard each part is, some cores sit idle while others are overloaded—wasting time. 🍞 Bottom Bread (Anchor): In a graph of friends, some kids are super popular (many edges) while others are not—if one core gets all the popular kids, it gets swamped.

🍞 Top Bread (Hook): You know how two people trying to write on the same whiteboard spot at once can smudge each other’s writing?

🥬 Filling (The Actual Concept):

What it is: A race condition is when two threads change or check the same data at almost the same time, making results unpredictable.
How it works: (1) Threads read data; (2) A thread updates it; (3) Another thread reads before or after in the wrong order; (4) The program gets wrong answers.
Why it matters: Races cause sneaky bugs that are hard to see and can crash or corrupt results, especially with irregular data. 🍞 Bottom Bread (Anchor): Two kids both erasing and rewriting a math answer at once leads to nonsense on the chalkboard.

🍞 Top Bread (Hook): Imagine a hallway where two kids block each other by standing in each other’s way forever.

🥬 Filling (The Actual Concept):

What it is: A deadlock is when threads wait on each other in a loop, so nobody can move.
How it works: (1) Thread A holds Lock 1 and wants Lock 2; (2) Thread B holds Lock 2 and wants Lock 1; (3) Both wait; (4) Program freezes.
Why it matters: Deadlocks stop progress completely and are a common trap in hand-written concurrent code. 🍞 Bottom Bread (Anchor): Two kids both saying “You go first!” and neither moving—forever.

🍞 Top Bread (Hook): If chores are unpredictable, a smart parent lets kids grab the next chore when they finish, instead of assigning everything in advance.

🥬 Filling (The Actual Concept):

What it is: Dynamic scheduling and work-stealing are ways to keep cores busy by letting them snatch new jobs (or “steal” from busy neighbors) when they finish.
How it works: (1) Start with tasks in multiple queues; (2) Each worker takes from its own queue; (3) If idle, steal tasks from others; (4) Repeat until done.
Why it matters: It balances uneven work, which is crucial for irregular problems. 🍞 Bottom Bread (Anchor): If one kid gets a super-long math sheet, others help by taking some pages so everyone finishes together.

🍞 Top Bread (Hook): Instead of teaching every kid how to build a clock from gears, give them reliable LEGO blocks that snap together.

🥬 Filling (The Actual Concept):

What it is: High-level parallel primitives (like map, filter, scan, reduce, sort) are safe building blocks that hide tricky scheduling and synchronization details.
How it works: (1) Choose a primitive (e.g., map applies a function to every item); (2) The library runs it in parallel correctly; (3) You combine primitives to make full algorithms; (4) The system scales across cores.
Why it matters: Fewer low-level mistakes and easier, provably scalable programs. 🍞 Bottom Bread (Anchor): Using ParlayLib primitives is like snapping together LEGO Technic pieces to quickly build a sturdy machine.

The world before ParEVO looked like this: LLMs learned mostly from sequential code and regular patterns. When asked to parallelize irregular problems, they often sprinkled naive parallel-for loops everywhere or used big locks that slowed everything down. They tripped on race conditions, created deadlocks, or produced code that compiled but didn’t scale. Researchers tried teaching models low-level threads and locks, adding OpenMP hints, or translating sequential code to GPU code, but they lacked a strong safety net: models weren’t getting reliable, real-machine feedback telling them when they made subtle concurrency mistakes. What was missing was a way to (a) teach the model safe, composable building blocks, and (b) close the loop with hard, truthful feedback from compilers, race detectors, and performance profilers.

ParEVO fills this gap by (1) building a carefully verified training set that only includes code that compiles, passes tests, and runs well; (2) fine-tuning models to think in terms of high-level primitives (ParlayLib for C++ and safe Rust patterns); and (3) adding an Evolutionary Coding Agent that tries, checks, times, and repairs code repeatedly based on real feedback. Why should you care? Because so many everyday systems—from social networks to maps to recommendation engines—are giant, irregular graphs. Making these run faster means quicker answers, lower costs, and greener computing, all while avoiding those sneaky parallel bugs that can break systems in surprising ways.

02Core Idea

🍞 Top Bread (Hook): Imagine teaching a robot chef to cook safely with pre-approved tools, then having them taste-test and improve each dish, round after round, until it’s both delicious and quick to make.

🥬 Filling (The Actual Concept):

What it is: The key insight is to train AI to build parallel programs from safe, high-level primitives and then evolve its code using strict, real-machine feedback until it’s both correct and fast.
How it works: (1) Curate a dataset where code must compile, pass tests, and be efficient; (2) Fine-tune models to think in ParlayLib/Rust primitives; (3) Run an Evolutionary Coding Agent (ECA) that generates many versions, compiles them, checks for data races, times them, and uses the results to make the next versions better; (4) Pick the best-performing, correct program.
Why it matters: Without safe primitives and hard feedback, LLMs often write code that looks smart but breaks under parallel pressure. 🍞 Bottom Bread (Anchor): Instead of guessing a perfect recipe in one try, the chef iterates with tasting and feedback, ending up with a reliably great, fast dish.

Three analogies for the same idea:

City traffic planner: Use proven building blocks (one-way streets, roundabouts) to prevent crashes, then simulate rush hour and tweak lights until traffic flows smoothly.
Sports team: Teach core plays (map, reduce, scan) everyone knows by heart, then review game footage (profilers and race detectors) to fix mistakes and improve speed.
LEGO engineering: Snap together safe components that never jam, test under load, replace weak parts, and repeat.

Before vs After:

Before: LLMs pushed parallel-for everywhere or used big locks, causing slowdowns, races, or deadlocks—especially on irregular data.
After: LLMs compose ParlayLib primitives that are safe by design, then evolve code with compilers, sanitizers, and profilers acting as truth-tellers, yielding correct and fast solutions.

🍞 Concept Sandwich: Work-span parallel primitives

Hook: Like chore charts that split total work fairly and keep the longest part short.
What: A design style using primitives (map, filter, scan, reduce, sort) that control both the total amount of work and how long the longest chain of steps is.
How: (1) Pick primitives that split work; (2) Combine them to keep parallel steps balanced; (3) Avoid long dependency chains; (4) Let the runtime schedule efficiently.
Why: If you don’t manage both total work and the longest chain, speedups stall, especially with irregular tasks.
Anchor: Building a line at a theme park with multiple ticket windows (work) and short processing chains (span) gets everyone through fast.

🍞 Concept Sandwich: Compiler feedback loop

Hook: Like a teacher who instantly marks wrong answers and explains what broke.
What: A process where compiled errors, test failures, race warnings, and timing numbers tell the AI exactly what to fix next.
How: (1) Generate code; (2) Compile and run; (3) Collect errors, race reports, and runtime; (4) Use feedback to repair and optimize.
Why: Without honest feedback, the AI keeps repeating invisible mistakes.
Anchor: If your model sees “data race at line 53,” it knows to guard that write properly before trying again.

🍞 Concept Sandwich: Fine-tuning LLMs for primitives

Hook: A piano student practices scales before complex songs.
What: Extra training teaches the model how ParlayLib/Rust primitives behave and how to combine them safely.
How: (1) Show verified examples; (2) Penalize incorrect or slow patterns; (3) Reward correct, fast solutions; (4) Repeat until those habits stick.
Why: Without this, the model reverts to unsafe, familiar patterns (like naive loops or giant locks).
Anchor: After fine-tuning, the model reaches for sor $t_i$ nplace with the right comparator instead of guessing an API.

🍞 Concept Sandwich: Parlay-Instruct Corpus

Hook: Like a workbook where every answer has been checked by the teacher.
What: A dataset of over 13,800 parallel tasks where only code that compiles, passes tests, and runs efficiently is kept.
How: (1) Create tasks and code; (2) Compile and unit test; (3) Keep only verified pairs; (4) Include performance comparisons to teach speed.
Why: If you train on broken or slow code, you learn bad habits.
Anchor: An example asks to filter even numbers with a specific delayed primitive and includes a hidden test that must pass.

🍞 Concept Sandwich: Correctness-speed trade-off

Hook: Wearing a helmet makes you a bit heavier, but much safer.
What: Models fine-tuned for safety often avoid risky low-level tricks, making their code more reliable but sometimes slightly slower.
How: (1) Learn safe APIs; (2) Prefer stable primitives; (3) Get higher pass rates; (4) Occasionally lose peak speed vs. razor-thin optimizations.
Why: Without safety, you may get blazing speed one run and a crash the next.
Anchor: In graphs, safer patterns reached higher correctness rates even if raw atomics were sometimes faster for niche cases.

Why it works (intuition): ParlayLib’s primitives lock in correct parallel structure, so the model doesn’t need to reinvent synchronization. The ECA brings reality checks—compilers, race detectors, profilers—so the model can’t fool itself. Together, they reduce the problem from “write flawless low-level concurrent code” to “assemble safe blocks and tune them with real feedback,” which is much easier for LLMs to master.

03Methodology

High-level flow: Natural-language task → Stage 1 (Make and verify training data) → Stage 2 (Fine-tune models on safe primitives and performance) → Stage 3 (Evolve code with compilers, race detectors, and profilers) → Final fast, correct program.

Stage 1: Data Synthesis through Evolutionary Search

What happens: Build the Parlay-Instruct corpus and performance pairs so the model learns both correctness and speed.
Why this step exists: LLMs need clean, trusted examples. If you feed them broken or slow code, they copy those mistakes.
Example with actual data: Training examples cover ParlayLib primitives (map, filter, scan, reduce, sort), DMOJ problem variations, and slow-fast code pairs that clearly show which version is faster.

🍞 Concept Sandwich: Seed generation and mutation

Hook: Start with a few good recipes, then create many tasty variations.
What: Write hundreds of golden examples, then mutate types, constraints, or algorithms to grow coverage.
How: (1) Type changes (like int to string) stress templates; (2) Constraint changes (e.g., filter odds then sort) force composing primitives; (3) Algorithmic changes (e.g., reduce to scan) build breadth.
Why: Without diverse seeds, the model only learns a narrow slice and fails on new shapes of problems.
Anchor: Turning “sum numbers” into “prefix sums” trains the model to reach for scan when needed.

🍞 Concept Sandwich: The critic loop (accept only correct, tested code)

Hook: A bouncer only lets in people with proper tickets.
What: Only keep (problem, code) pairs that compile and pass unit tests.
How: (1) Generate code; (2) Compile with ParlayLib; (3) Run test; (4) Discard if anything fails.
Why: This guarantees the dataset teaches good habits.
Anchor: A filtering task that returns exactly 50,000 even numbers from 0 to 100,000 must pass a hidden test before it’s kept.

🍞 Concept Sandwich: Performance comparison pairs

Hook: Two runners race; you learn which form is faster.
What: Present two working solutions and label which one is faster so the model learns to prefer better patterns.
How: (1) Generate and optimize solutions; (2) Keep pairs where the optimized version is clearly faster; (3) Randomize order so there’s no bias.
Why: Without speed labels, the model might pick correct but slow code.
Anchor: Map-scan-write (preallocate, then fill) beats pushing into a vector with re-allocations.

Rust support: Because Rust parallel patterns are rarer, the system supplies Rust-Parlay-like primitives in the prompt and builds an evolutionary dataset from real execution logs (including compiler errors). That lets the model learn both how to fix errors and how to speed up safe Rust code.

Stage 2: Fine-Tuning Models for ParlayLib and Rust

What happens: Fine-tune DeepSeek (6.7B), Qwen3 (30B), and Gemini-2.5 models so they naturally speak the language of safe primitives and performance.
Why this step exists: Out-of-the-box models carry a “sequential bias” and guess APIs. Fine-tuning aligns them with ParlayLib/Rust semantics.
Example with actual data: The base model failed a complex-number sort by giving the wrong comparator; the fine-tuned model correctly used sor $t_i$ nplace with a safe lambda and ran 17.5 times faster.

Training tricks:

Lightweight adapters (like LoRA/QLoRA) keep costs low while specializing the model.
Preference training pushes the model away from known-bad and slow patterns.
The result: Much higher rates of compiled, correct, efficient code.

Stage 3: Evolutionary Coding Agent (ECA)

What happens: At inference time, don’t trust a single guess. Generate, compile, test, profile, repair… then repeat. Keep the best survivor.
Why this step exists: Parallel code has subtle bugs and performance cliffs. Reality checks catch them; iteration fixes them.
Example with actual data: With 30 evolutionary rounds, speed improved by about 2.2 times over the first valid solution.

🍞 Concept Sandwich: Dynamic race detection

Hook: A real-time referee calls out fouls as they happen.
What: Tools that run your program and loudly flag when two threads touch memory unsafely.
How: (1) Run code under a detector; (2) If a race shows up, mark the candidate as failed; (3) Use the error to repair; (4) Try again.
Why: Races are invisible in text; only running on real inputs exposes them.
Anchor: If a BFS tries to mark “visited” without protection, the detector catches it and the agent switches to a safe pattern.

🍞 Concept Sandwich: MAP-Elites diversity

Hook: Don’t train only sprinters—also keep high-jumpers and marathoners.
What: Keep a diverse set of promising code versions across different shapes (length, complexity, synchronization style).
How: (1) Archive candidates into bins; (2) Keep top performers per bin; (3) Evolve from both best and diverse picks; (4) Avoid getting stuck.
Why: Without diversity, the search narrows too soon and misses better designs.
Anchor: One code path might be short and lock-free, another longer with careful atomics—keeping both can lead to a breakthrough hybrid.

Supported languages and benchmarks:

Languages: C++ with ParlayLib; Rust with Rayon plus a Parlay-like helper layer.
Benchmarks: ParEval (parallel code tasks), PBBS (expert C++ graph/geometry), RPB (expert Rust), and DMOJ problems (held-out tasks). All run on many-core CPUs to test scaling and irregular workloads.

Secret sauce in a sentence: Use safe parallel LEGO blocks, train on only-verified builds, and let evolution guided by real hardware feedback do the last-mile polishing.

04Experiments & Results

The Test: What did they measure and why?

They measured three things that matter in real life: (1) Does the code compile on the first try (Build@1)? (2) Does it pass tests (Pass@1)? (3) How fast does it run compared to known baselines (Speedup)? These map to developer happiness: fewer build errors, fewer bugs, and faster results.

The Competition: Who was compared?

ParEVO’s fine-tuned models (Gemini-2.5-Parlay, DeepSeek-Parlay, Qwen-Parlay/Rust) were compared to strong commercial and open-source models, including GPT-5-Thinking and Gemini-3-Pro. It also compared to expert-written human code from PBBS (C++) and RPB (Rust) to see if AI could match or beat handcrafted solutions.

The Scoreboard with context:

On ParEval, Gemini-2.5-Parlay achieved an average speedup of about 106 times, with a maximum of about 1103 times on specific tasks. Think of it like showing up to a bike race in a sports car: the average lap is wildly faster, and on one track it absolutely flies.
For hard graph problems (highly irregular), ParEVO achieved around 13.6 times speedup—still a huge gain even where parallelism is tricky. That’s like going from waiting minutes to getting your answer in seconds.
Build and pass rates improved strongly after fine-tuning, which means the model didn’t just imagine code—it produced code that actually compiled and worked. This is crucial for practical usability.
Against expert human baselines, ParEVO matched or beat performance in several cases. On Maximal Independent Set (a tough graph kernel), it achieved up to about 4.1 times speedup over an expert Rust baseline by choosing a better parallel strategy and tuning memory use.

Scaling behavior:

ParEVO-generated code showed strong scaling on up to 64 cores in regular tasks like a Discrete Fourier Transform, approaching near-linear speedup (e.g., close to 40 times). This shows the building-block approach and safe patterns let the runtime keep all cores busy without tripping over synchronization.

Surprising findings:

Safety vs. peak speed: Fine-tuning for safety significantly improved correctness (more tasks passed), but the very fastest versions sometimes used risky low-level atomics. The result: safer code is steadier and more likely to be correct, but can be a bit slower at the extreme edge.
Evolution helps a lot: Letting the ECA iterate and use real-machine feedback often doubled performance over the first valid solution. Iteration plus truth-telling tools beats one-shot guessing.
Strategy discovery: The system rediscovered and applied sophisticated ideas like direction-optimizing BFS (switching between push and pull phases) and memory preallocation with map-scan-write to avoid lock contention and vector re-allocations.
Domain pitfalls: In some geometry tasks, the fine-tuned model occasionally hallucinated non-existent APIs (like a fake convex hull function). The compiler feedback was essential to catch and correct these.

What this means in plain language:

ParEVO doesn’t just write code that looks right; it writes code that actually builds, runs, scales, and wins speed races—especially when the data is messy and unpredictable. When it stumbles (like making up a function), the compiler/race detectors blow the whistle, and evolution fixes it. The numbers—huge average speedups, strong build/pass rates, and head-to-head wins against expert code—show that safe building blocks plus evolutionary feedback make LLMs much more reliable engineers for parallel programs.

05Discussion & Limitations

Limitations:

Scope: ParEVO targets shared-memory multicore machines (like big many-core CPUs). It does not yet handle distributed systems where data lives across separate computers and messages must be sent around. Those worlds require extra tricks for data partitioning and communication.
Cost at inference time: The Evolutionary Coding Agent compiles and runs many candidates. That uses more compute up front. For long-lived HPC kernels that run millions or billions of times, this cost is usually worth it; for tiny, one-off scripts, it may not be.
Hallucinations in niche domains: In areas like specialized geometry, the model can confidently guess wrong APIs. That’s why the compiler and tests are always in the loop—but it still means you need a good toolchain to keep it honest.
Peak-performance trade-offs: Safety-focused fine-tuning leads to higher reliability but can leave a bit of raw speed on the table compared to hand-tuned, risky atomics in a few cases.

Required resources:

A proper build and run environment with ParlayLib (for C++) or Rust Rayon plus helper primitives, dynamic race detection tools, and a profiler. Multi-core hardware is needed to see speedups. For best fine-tuning results or larger searches, access to capable GPUs/CPUs is helpful.

When not to use it:

Very small tasks where compile-run-search overhead dwarfs runtime.
Strict real-time systems where you cannot afford any exploratory runs during development.
Fully distributed workloads (MPI/PGAS) where communication dominates; ParEVO doesn’t yet target that regime.

Open questions:

How to extend this to clusters and GPUs while keeping the same safety and composability benefits?
Can we blend formal verification tools with evolutionary search to prevent certain classes of mistakes before running any code?
How to automatically choose between the super-safe primitives and carefully controlled low-level atomics to get both reliability and more peak speed?
Can we generalize the performance-learning to new libraries and domains (e.g., GraphIt, Kokkos, or custom internal frameworks) without retraining from scratch?
How can we reduce the inference-time search cost while preserving big speedups—perhaps via smarter priors or caching learned fixes?

06Conclusion & Future Work

Three-sentence summary:

ParEVO teaches AI to write parallel programs using safe, high-level building blocks and then improves them by evolving code with real compiler, race detector, and profiler feedback.
A verified training corpus, specialized fine-tuning, and an Evolutionary Coding Agent together yield code that compiles more often, passes tests more reliably, and runs dramatically faster—especially on irregular data.
The system matches or beats strong models and expert baselines on major benchmarks and reveals a practical safety-versus-peak-speed trade-off.

Main achievement:

Showing that safe primitives plus evolutionary, execution-based feedback can turn general LLMs into dependable parallel programmers for messy, irregular workloads—achieving huge average speedups and robust correctness.

Future directions:

Expand beyond shared-memory CPUs to distributed systems and accelerators; integrate formal verification to prevent certain errors upfront; and learn smarter policies that balance safety with peak speed automatically.

Why remember this:

ParEVO demonstrates a blueprint for AI-driven performance engineering: teach with clean, verified examples, ground decisions in principled abstractions, and close the loop with real-machine truth. This turns parallel programming from a fragile art into a repeatable, teachable process that everyday developers—and their AI assistants—can actually use.

Practical Applications

•Speed up graph analytics pipelines (BFS, connected components, spanning forests) in social network or fraud analysis.
•Integrate an ECA-powered performance bot in CI to auto-tune hot kernels after each code change.
•Use ParEVO as a teaching assistant that turns sequential student code into safe parallel versions using primitives.
•Migrate legacy C++ graph code to ParlayLib-based implementations for predictable scaling on many-core CPUs.
•Generate safe, parallel Rust versions of performance-critical sections to leverage memory safety with speed.
•Optimize competitive programming or coding contest solutions on messy inputs by evolving faster variants.
•Build internal code assistants that prefer safe primitives and auto-repair race conditions using sanitizer feedback.
•Automate performance regressions triage by proposing evolved patches that restore or improve speed.
•Prototype multiple algorithmic strategies (e.g., push vs. pull BFS) and pick the best empirically.
•Create performance-aware templates for common irregular tasks (map-scan-write patterns) for reuse across teams.

Version: 1

Notes