Towards Automated Kernel Generation in the Era of LLMs
Key Summary
- •AI programs called LLMs can now help write the tiny, super-fast pieces of code (kernels) that make GPUs run AI models efficiently.
- •Before, top-speed kernels took experts weeks or months to hand-tune for each kind of hardware; this was slow and didn’t transfer well across devices.
- •This survey organizes the fast-growing research into two big ideas: training LLMs with examples (SFT and preferences/RL) and using agent loops that try, test, profile, and improve code automatically.
- •Agent systems boost results by adding planning, memory (RAG), hardware profiling, and teams of cooperating AI helpers (multi-agent).
- •New datasets and benchmarks (like KernelBench, TritonBench, MultiKernelBench, and FlashInfer-Bench) make it possible to measure not just correctness but also speed and hardware efficiency.
- •Results show that smart feedback, test-time thinking, and evolutionary search can match or even beat expert libraries for some tasks (e.g., GEMM).
- •Key challenges remain: not enough high-quality training data, slow compile/run feedback loops, and limited evaluations beyond NVIDIA and fixed input shapes.
- •The paper provides a roadmap of methods, data, and open problems, plus a living GitHub index to track rapid progress.
- •If successful, automated kernel generation could make AI faster, cheaper, and greener by squeezing more performance out of today’s hardware.
Why This Research Matters
Every message you send to an AI runs through kernels; making those kernels faster means quicker replies and lower costs. Automated kernel generation shrinks the need for scarce experts, helping small teams build high-performance systems. Better kernels save energy, making large-scale AI greener and more affordable. Cross-hardware support means the same ideas can speed up phones, PCs, and data centers. Standardized datasets and benchmarks let the community measure progress fairly and improve together. Over time, tools that explain their optimizations will help engineers learn and trust the results, accelerating innovation.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
🍞 Hook: Imagine you’re trying to make the world’s fastest paper airplane. You can fold something that flies, but to win races you need a pro who knows every crease and angle—and even then, the perfect design for one kind of paper might flop with another.
🥬 The Concept: A GPU kernel is like that expert folding recipe for the GPU; it tells the hardware exactly how to run an operation super fast. How it works (recipe):
- Take a high-level goal (like matrix multiply). 2) Break it into thousands of tiny tasks that can run in parallel. 3) Arrange memory reads/writes so data arrives right when needed. 4) Tune details (tile sizes, warps, cache use) for a specific GPU. Why it matters: Without great kernels, even powerful GPUs run like sports cars stuck in first gear. 🍞 Anchor: The attention step in LLMs spends most time in matrix math; a top-notch kernel makes that step zoom, cutting cost and latency for every chat response.
🍞 Hook: You know how some video games run better on one console than another? That’s because the code has to match the machine.
🥬 The Concept: GPUs are special computers for running many tiny jobs at once. How it works: 1) Split work across threads. 2) Group threads into warps/blocks. 3) Move data smartly through registers, shared memory, and caches. 4) Keep all parts busy with coalesced access and good tiling. Why it matters: If you don’t map work to the hardware well, you leave speed on the table. 🍞 Anchor: A badly written kernel can be 10–100× slower than a tuned one, like carrying groceries one apple at a time instead of using a cart.
🍞 Hook: You know how a great teacher can explain a tough idea in simple steps?
🥬 The Concept: Transformers are a type of AI model that understands context by paying attention to important pieces. How it works: 1) Read tokens. 2) Use attention to score what matters. 3) Mix information with layers. 4) Predict the next token. Why it matters: This makes LLMs good at reasoning about code and instructions. 🍞 Anchor: When asked to write a CUDA kernel, a Transformer-based LLM focuses on key words like “tile,” “coalesced,” and “shared memory.”
🍞 Hook: Imagine a giant library of how-to guides stuffed into one smart helper.
🥬 The Concept: Large Language Models (LLMs) are AI tools trained to read and write text, including code. How it works: 1) Train on huge text and code datasets. 2) Learn patterns and concepts. 3) Generate step-by-step plans and programs. Why it matters: LLMs can compress rare, expert know-how that’s hard to write into rules. 🍞 Anchor: Ask an LLM to implement 2D convolution in Triton; it drafts code and often remembers best practices like boundary checks and block sizes.
🍞 Hook: Think of a student who improves faster when shown correct answers.
🥬 The Concept: Supervised Fine-Tuning (SFT) teaches an LLM with paired examples of tasks and great solutions. How it works: 1) Collect high-quality kernel pairs (intent → code). 2) Train the LLM to imitate them. 3) Add reasoning steps so it explains choices. 4) Check results on new tasks. Why it matters: Without SFT, the LLM’s answers are generic and miss hardware tricks. 🍞 Anchor: KernelLLM and KernelCoder learn from curated PyTorch→Triton/CUDA pairs to write more reliable, faster kernels.
🍞 Hook: Training a puppy works best with treats after good tricks.
🥬 The Concept: Reinforcement Learning (RL) improves the LLM by rewarding better code (faster, correct kernels). How it works: 1) Generate kernels. 2) Compile and run. 3) Measure correctness and speed. 4) Give rewards and update the model. Why it matters: Without feedback from the real hardware, the model can’t learn what truly runs fast. 🍞 Anchor: CUDA-L2 uses RL to learn GEMM settings that can beat cuBLAS in some cases.
🍞 Hook: Picture a team with a planner, a coder, and a tester passing notes to improve a project.
🥬 The Concept: Agent-based systems wrap an LLM with tools for planning, memory, testing, and profiling. How it works: 1) Plan a kernel design. 2) Draft code. 3) Compile, test, and profile. 4) Reflect, retrieve docs, and iterate. 5) Repeat until it’s fast and correct. Why it matters: One-shot code often fails; loops with feedback drive big gains. 🍞 Anchor: Systems like STARK, GEAK, and Astra run plan–code–debug cycles guided by hardware logs.
The world before: AI teams relied heavily on human experts to craft kernels for each hardware family (NVIDIA, AMD, NPUs, TPUs). Libraries like CUTLASS or FlashAttention helped, but customizing or fusing operators still took lots of time and didn’t always transfer to new GPUs. As LLMs grew, model size and costs soared, making every millisecond and watt matter.
The problem: Kernel engineering is hard, slow, and non-scalable. What runs great on one GPU may slump on the next. Most code models aim for correctness, not peak speed under real hardware rules.
Failed attempts: Generic code generation produced compiling programs but not fast ones. Fixed autotuners searched narrow spaces and missed cross-operator fusions or unusual memory patterns.
The gap: We needed a system that could read high-level intent, recall expert tactics, try code on real hardware, learn from results, and repeat—across many platforms.
Real stakes: Faster kernels cut cloud bills, reduce energy, enable snappier apps, and make advanced AI affordable to more people. As models scale, kernel efficiency becomes the biggest lever on cost and sustainability.
02Core Idea
🍞 Hook: Imagine building a race car. You could read manuals forever, or you could build, test on the track, check lap times, tweak, and repeat until you win.
🥬 The Concept: The key insight is to combine LLMs (who know the manuals) with agent loops (who run track tests) so kernels are generated, profiled, and improved automatically. How it works: 1) Start from a high-level operator (like GEMM or attention). 2) The LLM drafts a kernel using learned expert tricks. 3) Compile, test for correctness, and profile for speed/efficiency. 4) Use feedback (numbers, logs, errors) to revise the code or plan. 5) Optionally use RL to reward faster versions. 6) Store good designs and keep learning. Why it matters: Without the loop, you get plausible but slow code; without the LLM, you explore blindly. 🍞 Anchor: CUDA-LLM, TritonRL, and KernelGen show that reflection, profiling, and multiple attempts push performance closer to expert-tuned libraries.
Explain it three ways:
- Chef analogy: The LLM is the chef who knows many recipes; the agent kitchen tastes each dish, measures spice and temperature, and suggests tweaks until the flavor (speed) is perfect.
- Pit-crew analogy: The LLM is the engineer designing the car; the agent is the pit crew timing laps, changing tires (tiling), and adjusting aerodynamics (memory layout) to beat records.
- Treasure map analogy: The LLM draws the first map; the agent sends scouts (profilers), marks traps (bank conflicts), and redraws paths (swizzling) to reach the gold (peak throughput).
Before vs after:
- Before: One-shot generation aimed for “it runs,” not “it flies.” Porting across hardware meant starting over.
- After: Iterative, hardware-in-the-loop optimization learns what works on each device, often transferring strategies via memory or meta-prompts.
Why it works (intuition, not math):
- Compression: Pretrained LLMs store heaps of tacit expert wisdom (tiling, unrolling, vectorization) that’s hard to encode as rules.
- Local feedback: Real compile errors, unit-test mismatches, and profiler stats give precise hints on what to fix now.
- Search: Trying multiple candidates (test-time scaling, population evolution) reduces the chance you get stuck with a weak design.
- Structure: Multi-agent roles and hierarchical planning split a messy task into bite-size wins.
Building blocks (the toolkit):
- SFT and preference learning: Teach the model with aligned intent→kernel pairs and good reasoning traces (e.g., ConCuR, KernelLLM).
- Reinforcement learning: Turn runtime and correctness into rewards (AutoTriton, TritonRL, CUDA-L1/L2).
- External memory (RAG): Retrieve CUDA/Triton docs, best-practice snippets, and hardware guides at generate-time (AI CUDA Engineer, KernelEvolve).
- Hardware profiling: Feed warp size, cache sizes, occupancy, and Nsight stats into the loop; translate numbers into natural-language advice (PRAGMA, TritonForge, KERNELBAND).
- Multi-agent orchestration: Planner, coder, tester, judge, and tuner collaborate; roles can target NVIDIA, AMD, or NPUs (STARK, GEAK, AKG, Astra).
- Benchmarks and datasets: KernelBench, TritonBench, MultiKernelBench, ROCm datasets, FlashInfer-Bench, plus training corpora like The Stack v2, HPC-Instruct, KernelBook.
🍞 Anchor: A practical flow: Ask for a Triton softmax kernel; the LLM proposes a tiled version; tests find a boundary bug; the profiler shows low L2 hit rate; the agent retrieves a swizzling pattern and adjusts tile sizes; pass@k rises, speedup improves, and the final kernel gets saved for future reuse.
03Methodology
At a high level: Intent (e.g., PyTorch op) → Plan (tiling, memory) → Draft kernel (LLM) → Compile & test → Profile & analyze → Refine (LLM + tools/RL) → Select best → Archive and evaluate.
Step-by-step with what, why, and an example:
- Parse the high-level intent
- What: Convert an operator (e.g., torch.matmul) or a fusion pattern into a clear spec: shapes, dtypes, strides, tolerance, and platform (NVIDIA, AMD, NPU).
- Why: Ambiguity leads to wrong indexing or wasted memory traffic.
- Example: “GEMM: A[M×K] × B[K×N] → C[M×N], float16 in, float16 accumulate on NVIDIA SM80.”
- Plan the computation (tiling, parallelism, memory)
- What: Choose tile sizes, thread/block layout, shared memory usage, vector widths; note reuse patterns.
- Why: Bad tiling kills coalescing and cache hit rate.
- Example: Choose 128Ă—64Ă—32 tiles, map warps to rows, prefetch K-slices into shared memory.
- Draft the kernel with the LLM
- What: The LLM writes CUDA or Triton code, with comments and reasoning steps.
- Why: Encodes expert templates (bounds checks, avoid bank conflicts) fast.
- Example: Generate Triton code with program_id-based block mapping and masking for edges.
- Compile and unit test for correctness
- What: Build the kernel; run randomized tests against a trusted reference (PyTorch/CUDA).
- Why: A fast-but-wrong kernel is useless.
- Example: Compare outputs under many shapes and seeds; assert max error < 1e-3.
🍞 Hook: You know how throwing more darts gives you a better shot at hitting the bullseye? 🥬 The Concept: pass@k tells us the chance at least one of k tries is correct. How it works: Generate k kernels, run tests, and see if any pass. Why it matters: Kernel generation is variable; multiple shots boost success. 🍞 Anchor: If k=10 and 3 candidates pass, pass@10 counts that as a success for that task.
- Profile runtime and hardware efficiency
- What: Measure wall time, occupancy, memory bandwidth, and cache hit rates; parse compiler logs.
- Why: Helps pinpoint bottlenecks (e.g., memory-bound vs compute-bound).
- Example: Nsight Compute shows low L2 hits; shared-memory bank conflicts appear in SASS analysis.
🍞 Hook: Imagine racing two cars on the same track to see which is faster. 🥬 The Concept: Speedup compares your kernel’s time to a baseline. How it works: speedup = baseline_time / your_time; higher is better. Why it matters: It tells you how much real speed you gained. 🍞 Anchor: If cuBLAS GEMM takes 1.0 ms and yours takes 0.8 ms, speedup is 1.25×.
🍞 Hook: Think of how full a bus is—more seats filled means better use of resources. 🥬 The Concept: Efficiency measures how well your kernel uses the hardware’s potential. How it works: Compare achieved throughput to theoretical max; look at occupancy and utilization. Why it matters: Low efficiency means unused performance is being wasted. 🍞 Anchor: TritonBench reports efficiency so two kernels with similar times can be compared fairly across shapes.
- Reflect, retrieve, and revise (agent loop)
- What: Convert errors and profiler numbers into natural-language critiques; pull relevant docs/snippets from a knowledge base; try fixes.
- Why: Raw logs are noisy; turning them into guidance accelerates improvement.
- Example: “L2 hit rate is low; consider swizzling B’s layout; try 64×64 tiles and wider vector loads.”
- Explore candidates (test-time scaling, evolutionary search)
- What: Generate multiple variants by mutating tiles, unroll factors, memory layouts; keep the best.
- Why: Avoid local optima; diversity finds surprising wins.
- Example: FM Agent maintains multiple populations with different mutation rates; EvoEngineer decouples population control from traversal.
- Learn from feedback (RL and preferences)
- What: Turn correctness + runtime into rewards; learn to propose faster drafts next time.
- Why: The system gets better with use, not just per-task tuning.
- Example: CUDA-L1 uses dense feedback with an LLM judge; CUDA-L2 refines to surpass cuBLAS on some GEMMs.
- Select, archive, and benchmark
- What: Keep the top kernels, store metadata (shapes, times, device info), and evaluate on public suites.
- Why: Reuse saves time; benchmarks enable fair comparison and progress tracking.
- Example: Save the winning attention kernel with input-shape ranges, then test on KernelBench and TritonBench.
The secret sauce (what makes this clever):
- Knowledge compression: LLMs internalize hard-to-formalize expert heuristics.
- Hardware-in-the-loop: Real compile/run feedback beats paper plans.
- Structured search: Multi-agent roles, RAG, and evolution organize exploration.
- Continual learning: RL and archives turn every attempt into future skill.
Concrete mini-walkthrough (attention block on NVIDIA vs AMD):
- Intent: FlashAttention-like operator for seq_len up to 8K, fp16.
- Plan: Tile QK^T by blocks; use shared memory for K/V; schedule causal masking.
- Draft: Triton kernel with block-wise softmax, vectorized loads.
- Test: 50 random cases vs PyTorch; fix an out-of-bounds mask.
- Profile: Low L2 hit rate; L2 thrashes on long sequences.
- Revise: Retrieve SwizzlePerf patterns; add swizzled layout and prefetch distance.
- Explore: Try 5 tilings; 2 pass tests and 1 wins on speed.
- Learn: Reward higher speed on long sequences; cache the winner; re-evaluate on ROCm-tuned subset for AMD.
04Experiments & Results
The test: Researchers evaluate two big things—does the kernel work (correctness), and how fast/efficient is it (performance). Because generation is stochastic, they try multiple candidates per task and report pass@k (chance that at least one is correct). They also track speedup and efficiency metrics. Some suites also consider portability across hardware.
The competition: Baselines include trusted references like PyTorch built-ins, cuBLAS/CUTLASS for GEMM, and expert Triton/CUDA kernels from libraries (FlashAttention, Transformer Engine, FlashInfer). New systems are tested on benchmarks such as KernelBench (PyTorch→CUDA tasks), TritonBench (real-world Triton operators and fusions), MultiKernelBench (multi-platform), ROCm/AMD suites, and FlashInfer-Bench (serving workloads).
Scoreboard with context:
- Supervised and curated data: ConCuR’s curation (short reasoning, real speedups, diverse tasks) led to KernelCoder producing CUDA kernels with strong reliability and speed—like a student who not only gets As but finishes tests fastest.
- Compiler-aligned corpora: KernelLLM’s PyTorch↔Triton instruction tuning improves structured mapping from intent to kernels—like learning with side-by-side translations.
- Reinforcement Learning: AutoTriton and TritonRL show that mixing structural checks with runtime rewards and verification boosts both pass@k and speedup@k—like combining practice quizzes with timed drills.
- Surpassing expert libraries in niches: CUDA-L2 refines GEMM choices to rival or surpass cuBLAS for certain shapes/dtypes—akin to beating the school record on specific race distances.
- Agentic loops and test-time scaling: Inference-time scaling, reflection, and population evolution (FM Agent, EvoEngineer) raise pass@k and efficiency—comparable to trying many car setups before race day.
- Hardware-aware prompting and meta-prompts: QiMeng-TensorOp/GEMM/Attention and SwizzlePerf show that injecting exact device specs and patterns guides the LLM toward better tiling and cache behavior—like giving the runner the exact track map and wind conditions.
- Cross-platform progress: GEAK and AKG target AMD/NPUs; MultiKernelBench and ROCm datasets broaden evaluation beyond NVIDIA—like holding a tournament in different stadiums and climates.
Representative findings:
- Correctness: pass@k jumps significantly when using multiple attempts, stronger test oracles, and explicit reasoning traces. Verification of intermediate steps (not just outputs) reduces silent errors.
- Speed and efficiency: Profiling-guided loops consistently trim latency; clustering runtime behaviors (KERNELBAND) narrows search and saves trials. Efficiency gains often come from memory layout fixes (swizzling), vector widths, and occupancy tuning rather than just unrolling.
- Generalization: Models trained on aligned operator–kernel pairs transfer to unseen but related ops; retrieval of domain docs reduces hallucinations of APIs.
- Serving workloads: FlashInfer-Bench shows that standardizing kernel definitions and workloads helps compare apples-to-apples across attention variants, batch shapes, and caching strategies.
Surprises and lessons:
- Test-time compute matters: Letting the model think longer and explore more candidates at inference (without retraining) can rival big training gains.
- Feedback shaping is crucial: Dense, well-structured rewards (contrastive RL, LLM-as-judge plus runtime) accelerate learning; sparse rewards stall progress.
- Small changes, big wins: A single swizzle or tile tweak can flip a kernel from memory-bound to compute-bound, unlocking large speedups.
- Data trails help: Capturing optimization trajectories (not just final code) teaches future models how to fix typical failure modes faster.
Overall: Across datasets like The Stack v2, HPC-Instruct, KernelBook, and real-world samples, the combination of SFT + RL + agentic loops reliably improves both correctness and speed, with early cases matching or beating hand-tuned baselines. Broader, more robust multi-hardware benchmarks are pushing the field toward practical, production-ready systems.
05Discussion & Limitations
Limitations:
- Data scarcity: High-quality, hardware-aware kernel examples are rare; many corpora miss reasoning traces and optimization journeys.
- Hardware coverage: Most results concentrate on NVIDIA and limited shapes; AMD, NPUs, and TPUs still lag in training data and benchmarks.
- Infrastructure bottlenecks: Compiling and profiling many candidates is slow and costly, throttling RL and large search.
- Reliability: Agents can overfit to benchmarks, hallucinate APIs, or pass tests on fixed shapes but break in the wild.
Required resources:
- Access to multiple GPUs/NPUs for compile–run cycles; profilers like Nsight Compute; curated datasets (KernelBook, KernelBench samples); and retrieval-ready knowledge bases (CUDA Guide, PTX ISA, Triton docs).
- Orchestration frameworks for distributed, asynchronous evaluation to keep search loops fast.
When not to use:
- Tiny projects where a standard library already saturates hardware (cuBLAS/cuDNN/FlashAttention fit perfectly). The agent’s iteration overhead won’t pay off.
- Strictly regulated environments with no room for trial-and-error execution or where compilers/profilers are unavailable.
- Ultra-novel hardware with no docs or toolchains—retrieval and profiling won’t ground the LLM.
Open questions:
- How to scale trustworthy data: Can we automatically mine and verify large volumes of high-quality, cross-platform kernels and their optimization trajectories?
- Stronger reasoning: How do we turn scattered expert heuristics into structured knowledge that agents can query reliably?
- Formal guarantees: Can we blend formal verification with empirical profiling for both safety and speed claims?
- Portability: What abstractions and meta-prompts best transfer learned tactics across NVIDIA/AMD/NPUs/TPUs?
- Human–AI teamwork: What is the best mixed-initiative interface so experts steer constraints while agents do the heavy lifting?
Bottom line: The pieces—SFT, RL, RAG, profiling, and multi-agent orchestration—work, but making them robust, fast, and broadly portable is the next frontier.
06Conclusion & Future Work
Three-sentence summary: This survey shows how LLMs plus agent loops can turn high-level operator intent into fast, correct GPU/accelerator kernels by drafting code, testing on real hardware, and iteratively improving. It organizes methods (SFT, RL, retrieval, profiling, multi-agent) and compiles datasets and benchmarks that enable fair progress tracking. It also maps out key challenges—data, infrastructure, evaluation, and collaboration—and points to where the field should go next.
Main achievement: A clear, structured roadmap of LLM-driven kernel generation—covering training strategies, agent designs, data/benchmark resources, and open problems—backed by a living GitHub index so researchers and practitioners can track rapid advances.
Future directions:
- Self-directed agents with dynamic memory and planning, not fixed workflows.
- Hardware-general skills via meta-prompts and cross-platform learning.
- Scalable, distributed “gym-like” environments to decouple model thinking from compile–run costs.
- Richer evaluations that stress shape generalization, robustness, and multi-vendor parity.
- Mixed-initiative tools where humans set constraints and agents explain and optimize.
Why remember this: As models get bigger and budgets tighter, kernels—not just chips—decide speed, cost, and energy. Automating kernel generation with LLMs and agents could unlock large, lasting efficiency wins across AI, making powerful systems faster, cheaper, and more accessible.
Practical Applications
- •Auto-generate fast kernels for common DL ops (GEMM, attention, softmax) tailored to your exact shapes and dtypes.
- •Fuse multiple PyTorch ops into a single custom kernel to remove overhead and improve locality.
- •Port kernels across hardware (NVIDIA ↔ AMD ↔ NPUs) using retrieval-guided prompts and profiling-in-the-loop agents.
- •Integrate an agentic kernel tuner into CI to test, profile, and accept only speedups that pass correctness gates.
- •Use RL-based training to specialize an internal LLM on your workloads and devices, improving over time.
- •Leverage swizzling and tiling recommendations from agent feedback to raise cache hit rates and occupancy.
- •Build a kernel knowledge base (docs, examples, gotchas) and connect it with RAG so generations follow best practices.
- •Adopt public benchmarks (KernelBench, TritonBench, FlashInfer-Bench) to track improvements and avoid overfitting.
- •Capture and store optimization trajectories so future runs learn common fixes faster.
- •Wrap Nsight or ROCm profilers with natural-language summaries to make performance diagnostics accessible to all engineers.