CUDA Agent: Large-Scale Agentic RL for High-Performance CUDA Kernel Generation
Key Summary
- •CUDA Agent is a training system that teaches an AI to write super-fast GPU code (CUDA kernels) by practicing, testing, and getting rewards for correct and speedy results.
- •It builds its own training problems by combining simple PyTorch operators into tougher fused tasks, then filters them so they are executable, fair, and non-trivial.
- •The AI works in a safe, skill-based coding environment that lets it plan, write, compile, test, and profile CUDA code across many turns like a real engineer.
- •A robust reward schedule gives clear points for passing correctness and beating both PyTorch eager and torch.compile speeds, avoiding noisy and unfair feedback.
- •To keep training stable, the system warms up the model with single-turn RL, then uses Rejection Fine-Tuning for the actor and Value Pretraining for the critic before full multi-turn PPO.
- •Strong anti-hacking protections (permissions, multiple-input checks, no fallbacks, careful profiling) ensure the AI earns rewards only for real speedups.
- •On KernelBench, CUDA Agent achieves higher pass and faster rates than top proprietary models and shows big speedups over torch.compile, especially on fused multi-operator tasks.
- •The key insight is that an agent that can act, observe real execution feedback, and learn from structured rewards can outperform static compiler heuristics for kernel optimization.
- •This approach turns LLMs from code typists into performance engineers who reason about math, memory, and hardware to find faster implementations.
- •The work suggests a path to automating more of GPU performance engineering, saving expert time and making AI systems run cheaper and faster.
Why This Research Matters
CUDA Agent shows that AI can learn to act like a real performance engineer, not just write code that looks right. This means faster, cheaper AI training and inference because kernels do more work per second. It reduces the need for rare, highly specialized expertise on every team by teaching an agent reusable GPU optimization skills. The approach also raises the bar for fair, reliable evaluation in AI systems that interact with the real world. As more apps require high-performance GPU compute—from vision to robotics—automated kernel optimization becomes a powerful accelerator. In the long run, this can make advanced AI capabilities more accessible, energy-efficient, and widely deployed.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
🍞 Top Bread (Hook) Imagine you’re trying to make dinner faster. You could follow the recipe exactly every time (safe but not speedy), or you could learn clever shortcuts—like preheating the pan or chopping while the water boils—to serve the same meal much faster.
🥬 The Concept: CUDA kernel optimization is the practice of writing tiny, very fast programs that run on GPUs to do math for AI models. How it works (before this paper):
- Programmers or compilers translate high-level code (like PyTorch) into GPU kernels.
- Tools like torch.compile apply general-purpose rules to speed things up.
- Experts hand-tune the hardest parts by thinking about memory, threads, and GPU hardware. Why it matters: Without strong kernel optimization, AI runs slower and costs more, because GPUs are powerful only when fed with well-optimized kernels.
🍞 Bottom Bread (Anchor) If you’ve seen a phone photo app blur a background instantly, that speed comes from optimized GPU code—without it, your app would lag.
The World Before:
- GPUs (especially NVIDIA’s CUDA) power most deep learning systems. But writing high-performance CUDA kernels requires deep knowledge of memory hierarchies, warps, blocks, and profiling tools. Most developers rely on frameworks (PyTorch) and compilers (like torch.compile) that provide solid performance but can struggle with complex operator sequences or unusual data shapes.
- Large language models (LLMs) can write general code quite well, yet when asked to produce GPU kernels that truly beat compiler baselines across many tasks, they typically fall short.
The Problem:
- Existing LLM approaches either do training-free refinement (use a few heuristics and feedback) or fine-tune within a fixed multi-turn loop. Training-free methods can only polish what the base model already knows, so gains are capped. Fixed-loop fine-tuning wastes context and restricts the agent’s freedom to search, debug, and profile like a real engineer.
- Raw speedup as a reward is noisy and biased toward easy wins. Unstable RL training leads to collapse if the model hasn’t been prepared for this niche domain.
Failed Attempts:
- Heuristic refinement pipelines improved some cases but didn’t lift the model’s core CUDA skill. It’s like giving someone better oven mitts without teaching them to cook.
- Multi-turn fine-tuning with rigid scripts taught models to follow a playbook but not to think flexibly, explore alternatives, or manage long debugging sessions.
- Some systems suffered reward hacking or unreliable timing—meaning the model “won” on paper without really running faster.
The Gap:
- What’s missing is a true agent that can: (a) practice on many high-quality, realistic tasks; (b) work in a tool-rich coding environment with safe execution and reliable timing; and (c) learn stably from rewards that reflect correctness and meaningful, comparable speed gains.
Real Stakes:
- Faster kernels mean training and serving AI gets cheaper and greener: fewer GPU-hours per job.
- Teams can ship features that feel snappier (like instant video effects), and research groups can iterate faster.
- Expert kernel engineers are rare; an AI that helps with performance engineering can amplify their impact and make advanced optimizations more accessible.
🍞 Top Bread (Hook) You know how a good teacher gives clear gold-star goals, not just “be better”? That keeps students motivated and fair across different assignments.
🥬 The Concept: Robust reward scheduling gives the AI simple, fair points for passing correctness and beating specific speed baselines. How it works:
- Check correctness on multiple random inputs.
- Compare runtime to both eager and torch.compile.
- Award milestone points only if improvements are significant (e.g., at least 5%). Why it matters: Without robust rewards, the AI chases noisy timing or farms easy tasks, learning the wrong lessons.
🍞 Bottom Bread (Anchor) It’s like grading a race by clear tiers—bronze for finishing, silver for beating last year’s time, gold for beating the stadium record—so every runner knows what counts.
🍞 Top Bread (Hook) Imagine practicing piano: short pieces first, then longer ones, then full concerts. Jumping straight to a concerto is frustrating and messy.
🥬 The Concept: Curriculum-style data synthesis creates a wide range of operator tasks—from simple to fused multi-operator ones—so the agent improves step by step. How it works:
- Crawl seed PyTorch operators.
- Use an LLM to compose up to five operators into fused tasks.
- Filter for executability, determinism, non-triviality, and fair workload. Why it matters: Without enough diverse, well-structured practice, the agent either overfits simple cases or stalls on hard ones.
🍞 Bottom Bread (Anchor) Like building from scales to duets to symphonies, the agent learns to handle bigger, fused kernels for real speedups.
02Core Idea
🍞 Top Bread (Hook) You know how a great chef doesn’t just follow recipes but tastes, tweaks, and times everything in the kitchen to make dishes faster and tastier? That’s different from a cookbook that never changes.
🥬 The Concept: The key insight is to turn an LLM into an active CUDA engineer—an agent that writes, runs, profiles, and revises kernels across many turns, learning from robust rewards in a safe, skill-equipped environment plus stable RL training. How it works:
- Give the agent lots of realistic practice tasks (data synthesis).
- Let it use coding tools to iterate like a real developer (agent loop with skills).
- Provide reliable, milestone rewards for correctness and beating eager/compile.
- Stabilize learning with a multi-stage warm-up (single-turn RL, RFT actor init, critic value pretraining) and PPO. Why it matters: Without agency, skills, and stable RL, models can’t systematically discover hardware-aware patterns that beat static compilers.
🍞 Bottom Bread (Anchor) It’s like training a robot chef who can try a dish, check a timer, read the oven’s heat, then improve the recipe—until it reliably cooks faster than the standard cookbook.
Multiple Analogies:
- Sports Team: The model (player) practices drills (tasks), uses gear (tools), gets fair scoring (rewards), and works with a coach (critic) so game-day performance (benchmarks) beats last season (torch.compile).
- Science Lab: Form a hypothesis (kernel change), run the experiment (compile and profile), record results (timings), and iterate—with safety protocols (sandbox) and clear success criteria (robust reward tiers).
- Treasure Hunt: The map (skills doc) shows good paths; milestones (reward levels) keep explorers on track; a compass (critic value function) prevents wandering; and practice on varied terrains (synthesized tasks) builds general skill.
Before vs After:
- Before: LLMs produced kernels that often ran but rarely beat compile baselines, especially for fused multi-operator tasks. Training was unstable; rewards were noisy; environments allowed hacks.
- After: An agent learns stable, reusable CUDA skills, handles long multi-turn debugging and tuning, and consistently beats torch.compile (especially on fused tasks) with trustworthy evaluation.
Why It Works (intuition, no equations):
- Real feedback closes the loop. When the model can see correctness failures and real timings, it learns patterns that matter, not just pretty code.
- Milestone rewards reduce noise and make success comparable across tasks of different difficulty.
- Warm-up stages align the model’s behavior distribution with the niche CUDA domain and teach the critic what good states look like, preventing collapse.
- Agent skills and a safe sandbox let the model act like an engineer: measure, hypothesize, change, and re-measure.
Building Blocks (each with a sandwich):
🍞 Top Bread (Hook) Imagine building with LEGO: start with simple bricks, then combine them into cool structures.
🥬 The Concept: Data synthesis pipeline creates many training problems by fusing operators. How it works:
- Crawl trusted PyTorch operators.
- Randomly compose up to five into a single fused task.
- Filter for executability, determinism, non-triviality, and fair runtime. Why it matters: Without rich, scaled practice, the agent can’t generalize optimization skills.
🍞 Bottom Bread (Anchor) Like combining wheels, axles, and plates to make a car, fusing conv→relu→matmul creates realistic, harder tasks.
🍞 Top Bread (Hook) You know how a good workshop has the right tools neatly labeled?
🥬 The Concept: Skill-integrated agent loop gives the model structured tools and a step-by-step CUDA playbook (SKILL.md). How it works:
- Analyze performance using profiler.
- Implement custom kernels and bindings.
- Compile, verify correctness, and profile; iterate.
- Aim to beat torch.compile by 5%+, keep improving. Why it matters: Without tools and a workflow, the agent can’t do true engineering.
🍞 Bottom Bread (Anchor) Like a maker space with saws, drills, and safety rules, the agent has Bash, edit tools, and profiling scripts inside a safe sandbox.
🍞 Top Bread (Hook) Think of a game that gives stars for clear milestones—1 star for finishing, 2 for beating your time, 3 for beating the champion.
🥬 The Concept: Robust reward scheduling assigns discrete points for correctness and beating eager/compile by 5%+. How it works:
- Fail correctness → negative reward.
- Pass and beat eager → more points.
- Also beat compile → highest points. Why it matters: This removes timing noise and compares progress fairly across tasks.
🍞 Bottom Bread (Anchor) Like video game levels, the agent clearly knows what earns bronze, silver, or gold.
🍞 Top Bread (Hook) Before running a marathon, you jog; then do tempo runs; then race.
🥬 The Concept: Multi-stage warm-up (single-turn RL → RFT actor init → critic value pretraining) stabilizes training before full PPO. How it works:
- Single-turn RL teaches the basics of CUDA codegen.
- Rejection Fine-Tuning keeps only good multi-turn rollouts to shape the actor.
- Value pretraining teaches the critic what promising states look like, so PPO updates are sane. Why it matters: Without warm-up, RL collapses because CUDA tokens are rare and reward signals are sparse.
🍞 Bottom Bread (Anchor) Like practicing scales, then short songs, then full pieces, the agent is prepared for long multi-turn sessions without losing structure.
03Methodology
At a high level: PyTorch operator task → [Skill-integrated agent loop: think → code → compile → test → profile → revise] → Optimized CUDA kernels that beat torch.compile while staying correct.
Step 1: Scalable Data Synthesis Pipeline
🍞 Top Bread (Hook) Imagine a math workbook that automatically makes new mixed problems so you never run out of practice.
🥬 The Concept: The pipeline constructs many realistic CUDA-training tasks by composing well-tested PyTorch operators and filtering them. What happens:
- Seed crawl: Collect operator classes from PyTorch/transformers—clean, maintained code.
- Combinatorial synthesis: An LLM samples up to 5 torch operators and stacks them into a fused layer.
- Rigorous filtering: Keep only tasks that (a) run in both eager and compile, (b) are deterministic, (c) don’t output constants, and (d) have eager runtime in 1–100 ms. Also remove anything too similar to benchmark tests. Why this step exists: Without broad, quality tasks, the agent overfits or learns shallow tricks.
🍞 Bottom Bread (Anchor) Like generating 6,000 varied practice problems—some single ops, many fused—so the agent sees the real world of kernels it must optimize.
Step 2: Skill-Integrated Agent Loop and Safe Sandbox
🍞 Top Bread (Hook) You know how a pilot trains in a simulator with real dashboards and strict rules?
🥬 The Concept: The agent loop lets the model interleave reasoning and actions with tools—edit files, compile, verify, and profile—inside a controlled environment. What happens:
- ReAct-style interaction: The model thinks, acts (e.g., edit, compile), observes outputs (errors, timings), and plans the next move.
- Skills: A CUDA playbook (SKILL.md) defines workflow—analyze, implement, test, iterate, beat compile by 5%+.
- Sandbox architecture: CPU sandbox for compilation; GPU sandbox pool for testing and profiling. Strict permission isolation and no web tools. Why this step exists: Without tools and safe execution, the model can’t learn from real compile errors and timings or might exploit shortcuts.
🍞 Bottom Bread (Anchor) Like a flight sim where you can adjust controls, read instruments, and retry safely, the agent gets reliable feedback each turn.
Step 3: Robust Reward Scheduling and Anti-Hacking Protections
🍞 Top Bread (Hook) Picture a scoreboard that ignores lucky spins and only counts fair wins.
🥬 The Concept: Rewards are discrete milestones (fail correctness → -1; pass and beat eager/compile by 5% → higher points) with hardened evaluation. What happens:
- Multiple-input correctness checks on five random inputs.
- Timing with warmups, synchronization, and repeats to reduce noise.
- Forbid fallbacks to torch.nn.functional during evaluation; protect verification/profiling scripts via permissions. Why this step exists: Without trustworthy rewards, the agent learns to game the system instead of getting truly faster.
🍞 Bottom Bread (Anchor) It’s like sealing test answers in a safe and timing races on a certified track—no cheating, clean measurements.
Step 4: Algorithmic Stability—Warm-ups and PPO
🍞 Top Bread (Hook) Before hiking a tough mountain, you acclimate so you don’t collapse halfway up.
🥬 The Concept: A multi-stage warm-up makes RL stable for this rare, specialized domain. What happens:
- Single-turn RL warm-up: Train the base model to produce workable CUDA kernels in a one-and-done setting.
- Actor initialization by Rejection Fine-Tuning (RFT): Run the full agent loop, keep only good rollouts (pass correctness, avoid tool misuse), and SFT the actor on them to shape a strong behavior prior.
- Critic value pretraining: Teach the critic to estimate which states are promising using GAE targets from outcome rewards, so advantage estimates are meaningful.
- Full agentic RL with PPO: Optimize the actor with clipped objectives, guided by the pretrained critic, across up to 200 turns and 128k tokens. Why this step exists: Without warm-up, importance ratios explode, entropy drifts, and training collapses.
🍞 Bottom Bread (Anchor) It’s like practicing with shorter hikes and a map before the summit push—the final climb (PPO) becomes steady and safe.
Concrete Example Walkthrough (mini run):
- Input: A fused operator pipeline (e.g., conv2d → relu → matmul) with 12 ms eager time, 8 ms torch.compile time.
- The agent profiles the native code, identifies fusion/memory bottlenecks, writes a fused kernel (plus cuDNN GEMM where suitable), compiles, and tests on five random inputs.
- First attempt runs at 7.9 ms (passes correctness, barely beats compile) → reward for beating compile.
- The agent then tunes block size and introduces shared-memory tiling, reaching 6.5 ms—higher reward tier.
- Final output: kernels/*.cu plus *_binding.cpp and model_new.py that consistently achieves >5% speedup and passes all checks.
Secret Sauce:
- The trio of scale (6K carefully filtered tasks), skills (clear workflow and tools), and stability (warm-ups + PPO) transforms an LLM from a code printer into a performance engineer.
- Strong anti-hacking and robust rewards mean every improvement reflects real, reproducible speed.
04Experiments & Results
🍞 Top Bread (Hook) Imagine a science fair where everyone builds cars, but they all race on the same track with the same timer. That’s the only way to know who’s really fastest.
🥬 The Concept: KernelBench is a standardized benchmark of operator tasks (Levels 1–3) to test whether generated CUDA kernels are correct and faster than baselines. How it works:
- Provide 250 realistic tasks across difficulty levels.
- For each task, measure if the kernel compiles and is correct (Pass Rate), how often it’s faster (Faster Rate), and by how much on average (Geomean Speed-up) vs eager and torch.compile. Why it matters: Without fair tests, speed claims don’t mean much.
🍞 Bottom Bread (Anchor) It’s like racing cars over sprints, middle distances, and long tracks—and counting wins and average lap times.
The Test (what and why):
- We evaluate on KernelBench Levels 1–3, covering single ops, composed sequences, and challenging fused/realistic cases. We measure:
- Pass Rate: Can the kernel build and produce correct outputs across random inputs?
- Faster Rate: In what fraction of tasks is it faster than eager or compile?
- Geometric Mean Speed-up: Average speed gain among correct solutions, which fairly handles ratios.
- Why these? They ensure correctness first, then celebrate consistent, meaningful speedups.
The Competition:
- Baselines include torch.compile and strong proprietary/open LLMs: Claude Opus 4.5, Gemini 3 Pro, GLM 4.6, Kimi K2, plus the Seed 1.6 base model.
The Scoreboard (with context):
- Overall, CUDA Agent achieves ~98.8% Pass Rate and ~96.8% Faster Rate vs torch.compile, with ~2.11Ă— geomean speed-up over compile (and ~2.60Ă— vs eager). Interpreting this: think of getting not just an A, but the highest A+ in the class where others got strong Bs.
- Level 1 (simpler): 100% Pass, 97% Faster vs compile, ~1.87× speed-up over compile—strong consistency.
- Level 2 (operator sequences): 100% Pass, 100% Faster vs compile, ~2.80× speed-up—this is where learned strategies outshine static compiler heuristics.
- Level 3 (hardest, realistic): 94% Pass, 90% Faster vs compile, ~1.52× speed-up—still significantly ahead of competing LLMs.
- Compared to top proprietary models (Claude Opus 4.5, Gemini 3 Pro), CUDA Agent improves Faster Rate by a wide margin on Level 3 and shows higher overall speed-ups, indicating better mastery of fusion and low-level optimization.
Surprising Findings:
- Learned optimization policies can reliably outperform static compiler heuristics on complex fused tasks. This suggests that iterative, feedback-driven exploration discovers tilings, memory layouts, and fusion strategies that compilers miss.
- Reward design matters a lot: switching to raw speedup rewards hurt consistency, even when correctness was similar. Milestone-based rewards better aligned the model with the goal of being consistently faster.
- Stability stages (RFT and value pretraining) made the difference between steady improvement and catastrophic collapse. Removing them caused entropy blow-ups and wandering, ultra-long trajectories with little gain.
🍞 Top Bread (Hook) You know how an orchestra sounds better when the conductor and players warm up first?
🥬 The Concept: Ablations show each component’s role by removing it and observing performance drops. How it works:
- Remove agent loop → fewer correct and slower kernels (less learning from errors and timing).
- Remove robust reward → correctness ok, but speed gains drop notably.
- Remove RFT → training collapses, outputs become chaotic.
- Remove value pretraining → critic is blind; the agent wanders with long, inefficient interactions. Why it matters: Each piece—the loop, rewards, RFT, and critic warm-up—is essential for top performance.
🍞 Bottom Bread (Anchor) Like trying to perform without tuning instruments, removing warm-ups and guidance makes the system fall out of harmony quickly.
05Discussion & Limitations
Limitations:
- Domain Specialization: The system is tailored to CUDA on NVIDIA GPUs. Porting to other accelerators (like AMD or custom NPUs) needs new skills, libraries, and profiling.
- Infrastructure Heavy: Training uses large compute (e.g., a GPU sandbox pool) and a long context agent loop. Smaller teams may prefer distilled or smaller variants for deployment.
- Benchmark Scope: While KernelBench is broad, real-world models may involve even more complicated memory patterns, dynamic shapes, or operator semantics.
- Tool/Env Dependence: The approach leverages a carefully controlled environment. In the wild, missing dependencies or noisy systems could reduce reliability without similar safeguards.
Required Resources:
- Access to modern NVIDIA GPUs, Dockerized sandboxes, and profiling stacks.
- Ability to run long-context, multi-turn PPO with actor/critic models.
- Curated task sets (like CUDA-Agent-Ops-6K) and storage for logs/rollouts.
When NOT to Use:
- If your workload already saturates hardware via vendor libraries (e.g., standard GEMMs/convolutions) and offers little headroom.
- If you lack the sandbox and permissions setup needed to ensure clean, trustworthy timings.
- If your target platform isn’t CUDA/NVIDIA.
Open Questions:
- Generalization: How far do learned optimization policies transfer across GPUs (e.g., H100→A100→consumer RTX) and across very different operator mixes?
- Distillation: Can we compress the agent’s skill into smaller, faster models or static code generators for deployment?
- Auto-Safety and Debugging: Can the agent automatically detect and fix rare numerical stability issues or race conditions without manual guardrails?
- Beyond Operators: How well does this extend to end-to-end model graphs, scheduling across multiple kernels, or CUDA Graphs orchestration?
- Cross-Modal Feedback: Could richer hardware feedback (e.g., Nsight metrics) further improve learning without overwhelming the agent?
06Conclusion & Future Work
Three-Sentence Summary:
- CUDA Agent turns an LLM into a real CUDA engineer by giving it skills, safe tools, and stable reinforcement learning that reward correctness and significant speed gains.
- With a large, carefully synthesized training set, robust reward milestones, and multi-stage warm-ups (RFT and value pretraining), the agent learns to write and optimize kernels that reliably beat torch.compile.
- On KernelBench, this approach surpasses strong proprietary LLMs, especially on fused tasks, proving that learned, agentic optimization can outdo static compiler heuristics.
Main Achievement:
- Demonstrating that an agent trained with execution-grounded feedback and stability-focused RL can discover hardware-aware optimizations at scale, consistently outperforming compiler baselines across diverse tasks.
Future Directions:
- Port the method to other accelerators and APIs, incorporate deeper profiling signals, and distill the agent’s skills into lightweight models or auto-tuners.
- Expand from operator-level to end-to-end graph optimization, including cross-kernel scheduling and memory planning.
- Explore safer, more autonomous debugging and self-correction for rare numerical or concurrency bugs.
Why Remember This:
- It marks a shift from LLMs as code typists to performance engineers: acting, measuring, and improving like experts do.
- It shows that with the right practice, tools, and rewards, learned policies can beat static compiler rules on real GPU workloads.
- It opens a path to faster, cheaper, and more accessible performance engineering for AI systems.
Practical Applications
- •Speeding up custom CUDA extensions for model serving to cut latency in production APIs.
- •Auto-fusing operator sequences in research prototypes to accelerate iteration cycles.
- •Optimizing data preprocessing or augmentation kernels for faster training pipelines.
- •Generating specialized kernels for uncommon tensor shapes or sparsity patterns.
- •Automated performance tuning for edge devices with limited GPU memory bandwidth.
- •Creating baseline-strong kernels before hand-tuning by human experts, saving expert time.
- •Porting high-level PyTorch layers to optimized library calls (cuBLAS/cuDNN) plus targeted custom kernels.
- •Building internal auto-tuner tools that search block sizes, tilings, and vectorization strategies safely.
- •Using the sandbox to reproduce and diagnose performance regressions across GPU clusters.
- •Teaching new engineers CUDA best practices via the agent’s skill playbook and examples.