KernelEvolve: Scaling Agentic Kernel Coding for Heterogeneous AI Accelerators at Meta

Gang Liao; Hongsen Qin; Ying Wang; Alicia Golden; Michael Kuchnik; Yavuz Yetim; Jia Jiunn Ang; Chunli Fu; Yihan He; Samuel Hsia; Zewei Jiang; Dianshi Li; Uladzimir Pashkevich; Varna Puvvada; Feng Shi; Matt Steiner; Ruichao Xiao; Nathan Yan; Xiayu Yu; Zhou Fang; Roman Levenstein; Kunming Ho; Haishan Zhu; Alec Hammond; Richard Li; Ajit Mathews; Kaustubh Gondkar; Abdul Zainul-Abedin; Ketan Singh; Hongtao Yu; Wenyuan Chi; Barney Huang; Sean Zhang; Noah Weller; Zach Marine; Wyatt Cook; Carole-Jean Wu; Gaoxiang Liu

KernelEvolve: Scaling Agentic Kernel Coding for Heterogeneous AI Accelerators at Meta

Intermediate

Gang Liao, Hongsen Qin, Ying Wang et al.12/29/2025

arXiv PDF

Key Summary

•KernelEvolve is a smart, self-improving system that writes and tunes tiny but crucial programs (kernels) so AI runs fast on many kinds of chips.
•It turns a weeks-long, expert-only process into hours by automating code generation, testing, profiling, and optimization.
•The system uses a tree search (like exploring many paths in a maze) guided by a fitness score to keep only correct and faster kernels.
•A universal operator with retrieval-augmented prompts adapts its strategy using a persistent knowledge base, including private details for Meta’s MTIA chips.
•It supports multiple programming styles (like Triton and CuTe/TLX) and spans NVIDIA, AMD, and MTIA hardware.
•On public and internal tests, it achieved 100% correctness across 160 PyTorch ATen operators on 3 platforms and all 250 KernelBench problems.
•In real workloads, it delivers 1.25× to 17× speedups, especially by fusing steps to cut extra memory trips and by shape-specific tiling/autotuning.
•By generating missing preprocessing kernels, it enables monolithic deployment and avoids 10–20 ms network overhead from disaggregated serving.
•Unified profiling (Triton MPP, NCU, Proton, MTIA Insight) gives instruction-level clues so the system can fix bottlenecks precisely.
•It scales evaluations via serverless (FaaS) hardware pools, so lots of candidates get tested in parallel without blocking local machines.

Why This Research Matters

If a single small step in an AI model is missing or slow on an accelerator, the whole system can be forced to split across machines, adding 10–20 ms of pure network delay that users feel. KernelEvolve automates writing and tuning those small steps across many chips, so more models can run end-to-end on one accelerator tier with lower latency. This directly improves experiences like ad relevance and feed ranking, which are highly sensitive to speed. It also cuts infrastructure costs by using hardware more efficiently and avoiding extra CPU tiers. By injecting private hardware knowledge at runtime, new accelerators like MTIA can be enabled faster without retraining giant models. In short: faster, cheaper, and more reliable AI services at worldwide scale.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine running a lemonade stand for a huge festival. Every second counts. If the person slicing lemons is slow or missing, the whole line backs up, customers leave, and you lose money.

🥬 The Situation (The World Before): In AI at Meta’s scale, tiny programs called kernels slice the “lemons”—they do the low-level math and data reshaping that every model needs. Ads ranking models run under strict time limits, across many different chips (NVIDIA GPUs, AMD GPUs, and Meta’s MTIA). There are also many different models and hundreds of data preprocessing steps (like bucketing, hashing, merging). Before this work, getting great performance meant hand-writing and hand-tuning thousands of kernels for each operator on each type of hardware—over and over again each hardware generation. That took experts weeks per kernel and didn’t scale.

🍞 Anchor: Just like a festival needs every station to run smoothly, AI needs every kernel fast and available on the target hardware to keep latency low and user experience great.

🍞 Hook: You know how some video games lag when your internet connection adds extra hops? That’s like AI systems when some steps can’t run on the accelerator and must hop to a different server.

🥬 The Problem: Missing or slow kernels—especially for data preprocessing—force a split (disaggregation) where part of the work runs on CPUs on other machines and the rest on accelerators. That adds 10–20 ms of pure network tax per request, which is huge when your budget is under 100 ms. Worse, a single missing operator can completely block deployment on new accelerators.

🍞 Anchor: If your smoothie shop must send fruit to a different building to get chopped and then bring it back, your drinks are late even if your blender is super fast.

🍞 Hook: Imagine your school has different classrooms with different rules. A trick that works in one class might not work in another.

🥬 Hardware Heterogeneity (why it’s hard): Different chips have different memory layouts, caches, and programming styles. Even generations of the same vendor can change a lot (e.g., H100 adds TMA and warp-group features). So code can’t just be copy‑pasted; it must be redesigned. Multiply this by the number of operators and you get a giant optimization space.

🍞 Anchor: What gets an A+ in Math class might not work in Science class—you have to adapt to each teacher’s rules.

🍞 Hook: Think of legos in many shapes. Some builds need long beams; others need tiny bricks.

🥬 Model and Kernel Diversity: Recommendation pipelines (retrieval → early ranking → late ranking) need very different kernels. Transformers for sequences add new patterns. Preprocessing has hundreds of operators, many with irregular memory access. Optimizing only GEMM (matrix multiply) isn’t enough—coverage for preprocessing often decides whether you can deploy at all.

🍞 Anchor: If your kit is missing tiny connectors, you can’t finish the model—no matter how nice your big pieces are.

🍞 Hook: Picture trying lots of recipes, but you’re only allowed to change one ingredient at a time and you can’t remember what worked last time.

🥬 Failed Attempts: Prior AI code agents could write some GPU kernels for benchmarks, but they: (1) focused on narrow problems, (2) tested on simple, fixed shapes, (3) mostly targeted one vendor, (4) lacked deep tool integration and memory of past attempts, (5) didn’t scale search to hundreds of iterations, and (6) had no checkpointing. So they weren’t production-ready.

🍞 Anchor: It’s like student chefs who can bake a cupcake at home, but can’t run a full bakery with different ovens and constant rush orders.

🍞 Hook: Imagine a super helper that learns the kitchen, knows each oven’s quirks, remembers past wins and fails, and tries lots of safe variations quickly.

🥬 The Gap Filled by This Paper: KernelEvolve is that helper. It automates kernel generation across chips, keeps a knowledge base of hardware rules (including private MTIA details), runs big searches guided by real profiling, and remembers everything so it gets better over time. It focuses not only on math-heavy ops but also on the many preprocessing kernels that unlock monolithic deployments.

🍞 Anchor: With a recipe book that updates itself and smart taste-testing, the kitchen serves faster meals to more people—reliably.

— New Concepts (Sandwich Explanations) —

🍞 You know how a school map helps you choose hallways to reach class fastest? 🥬 Graph-Based Search: It’s a way to explore many code ideas as nodes and keep the best. It works by: (1) picking promising versions, (2) generating improved ones, (3) testing speed and correctness, (4) repeating. Without it, we’d guess randomly and waste time. 🍞 Example: Trying 300 conv1d variants and keeping the quickest that stays correct.

🍞 Imagine asking a librarian for the right book just-in-time. 🥬 Retrieval-Augmented Prompting: The system pulls the most relevant hardware notes and code patterns into the LLM’s prompt right when needed. It does: (1) diagnose bottlenecks, (2) fetch matching guidance, (3) rewrite the next attempt. Without it, the model forgets crucial details. 🍞 Example: When an MTIA compile error appears, it auto-loads the MTIA docs to fix it.

🍞 Think of a rulebook that says “No cheating!” 🥬 Persistent Knowledge Base: A long-term, organized folder of hardware constraints, tips, and examples. Steps: (1) store docs per platform, (2) index them, (3) retrieve on demand. Without it, the model guesses or relies on incomplete internet knowledge. 🍞 Example: Docs that say which MTIA libdevice call maps to fast SFU GELU.

🍞 Picture a fitness tracker score. 🥬 Fitness Function: A single number showing how good a kernel is (speedup over PyTorch, zero if incorrect). Steps: (1) run baseline, (2) run candidate, (3) compute ratio, (4) discard wrong ones. Without it, we can’t rank ideas. 🍞 Example: Score 2.3× for a conv1d that halves runtime versus baseline.

02Core Idea

🍞 Hook: You know how a coach guides a team to try lots of plays, watches what works, studies the playbook, then tries better plays next time?

🥬 The “Aha!” in one sentence: Treat kernel writing like a guided, memory-full, play-by-play search—using an AI coach that retrieves the right playbook pages (hardware knowledge) at the right time, tests plays on the field (profiling), and keeps only the winners.

🍞 Anchor: That’s KernelEvolve: a coach that automates generating, testing, and improving kernels across very different chips.

Multiple Analogies (3 ways)

Chef’s Kitchen: The AI chef tries many tiny changes to a recipe, tastes (benchmarks) each, reads the right cookbook page (retrieval), and serves the tastiest, safest dish (correctness first).
GPS Navigator: It explores many routes (kernels), checks traffic (profiling), and reroutes quickly using a map (knowledge base).
Science Fair: Make a hypothesis (candidate), run an experiment (evaluation), study notes (retrieval), and repeat until you get prize-winning results (speed + correctness).

Before vs After

Before: Manual, weeks-long expert tuning per operator per hardware; missing preprocessing kernels forced slow, split systems.
After: Automated search that generates correct kernels in hours, with coverage across NVIDIA/AMD/MTIA, enabling monolithic deployments and double-digit speedups.

Why It Works (intuition, not equations)

Feedback loops beat guessing: profiling and correctness checks give precise signals to climb toward faster code.
Memory matters: a persistent knowledge base injects hardware reality (like MTIA-specific instructions) that models didn’t learn during pretraining.
Big search finds hidden wins: exploring many tilings, fusions, and launch configs discovers combinations humans might skip.
Unified tooling removes blind spots: instruction-level profiling shows where time really goes, so fixes match true bottlenecks.

Building Blocks (Sandwich explanations)

🍞 Imagine one Swiss‑army tool that changes shape as needed. 🥬 Universal Operator: One adaptive generator that writes, debugs, and optimizes code by changing its prompting based on context. Steps: (1) gather errors & profiles, (2) retrieve matching docs, (3) craft the next prompt, (4) propose a better kernel. Without it, we need fixed, siloed operators that miss context. 🍞 Example: It sees bank conflicts, retrieves shared‑memory tips, and rewrites the kernel to fix them.

🍞 Think of picking which plants to water first in a big garden. 🥬 Selection Policy: A rule to choose which candidate kernels to try next (greedy, MCTS, evolutionary). Steps: (1) score nodes, (2) balance explore/exploit, (3) expand best branches. Without it, search wastes time on weak ideas. 🍞 Example: MCTS keeps a few diverse paths while favoring faster ones.

🍞 Picture timing every runner with the same stopwatch. 🥬 Unified Profiling (Triton MPP + tools): A standardized way to collect low‑level, reliable performance data across layers. Steps: (1) instrument, (2) run minimally invasive probes, (3) synthesize structured traces. Without it, we’d misread slowdowns. 🍞 Example: Seeing async copies and Tensor Core compute overlapping properly versus stalling.

🍞 Imagine a custom dictionary for a secret language. 🥬 MTIA Knowledge Injection: Add MTIA-specific docs (SFUs, inter‑PE ops, dual‑core sync) into retrieval so the model can write correct kernels for a platform it never learned in training. Without it, the model writes GPU‑centric code that fails. 🍞 Example: Swapping a math GELU for a fast SFU libdevice GELU on MTIA.

🍞 Think of trying shirt sizes before buying. 🥬 Autotuning: Systematically tests parameter sets (block sizes, warps, stages, MTIA cb_multiplier) to pick the fastest. Steps: (1) generate grid, (2) run quick benchmarks, (3) record best per shape. Without it, we’d lock into a mediocre fit. 🍞 Example: BLOCK_SIZE 512 beats 256 for a given conv1d shape.

🍞 Like packing your backpack so the most-used items sit on top. 🥬 Tiling & Fusion: Cut data into tiles that fit in fast on‑chip memory and fuse operations to avoid extra memory trips. Steps: (1) pick tile sizes, (2) keep intermediates on‑chip, (3) write once at the end. Without it, memory traffic dominates. 🍞 Example: Fusing X^T Y and X·(X^T Y) in WuKong’s FM to skip writing the intermediate.

🍞 You know how teachers let students re‑submit improved work? 🥬 Checkpointed, Distributed Search: Every attempt is saved; many agents can try different paths in parallel and resume after failures. Steps: (1) log all nodes, (2) parallelize expansions, (3) restart from last good state. Without it, long runs are brittle and slow. 🍞 Example: Hundreds of conv1d variants explored safely over hours, not lost on a crash.

03Methodology

High-Level Recipe: Input spec → State machine with tree search → Dynamic prompt + knowledge retrieval → Generate kernel(s) → Run correctness + profiling on target hardware → Update memory and scores → Repeat until done → Deploy

Step-by-step (with Sandwich explanations and examples)

Input and Setup 🍞 Hook: Imagine a cooking show where you’re given the dish name and pantry items. 🥬 What happens: The system receives a kernel spec (operator type, shapes, platform), then prepares a workspace and baseline PyTorch reference. Why it exists: We need a ground truth output and a clear target to compare against. Example: Conv1d with (B=2048, Cin=96, Cout=96, L=200) on H100. 🍞 Anchor: Now we can taste-test any recipe we produce.
Tree Search (State Machine) 🍞 Hook: Think of a branching maze where you try many doors but keep track of the best halls. 🥬 What happens: The system maintains a graph of candidates. Each iteration selects nodes (greedy/MCTS/evolutionary), applies the universal operator to create new variants, and computes a fitness score (speedup; 0 if incorrect). Why it exists: Systematic exploration beats random guessing. Example: 300 conv1d variants explored; best score rises steadily. 🍞 Anchor: Like moving from slow sidewalks to faster express lanes.
Universal Operator with Dynamic Prompting 🍞 Hook: Imagine a single multi-tool that swaps heads based on what you’re fixing. 🥬 What happens: The operator reads errors, profiles, and prior attempts; the context memory agent summarizes bottlenecks; the deep search agent retrieves matching docs; then a customized prompt is built to generate the next kernel. Why it exists: Fixed prompts can miss real, current bottlenecks. Example: A bank conflict triggers shared-memory layout guidance retrieval. 🍞 Anchor: The tool adapts its tip to tighten the right screw.
Knowledge Base Retrieval 🍞 Hook: Like a librarian who brings the exact book chapter you need. 🥬 What happens: Structured folders hold constraints, general guidance, and platform-specific docs (NVIDIA/AMD/MTIA). The system queries the index to pull in precisely relevant pages. Why it exists: LLMs don’t memorize private MTIA rules; we must supply them. Example: Pulling MTIA libdevice GELU docs and inter‑PE broadcast notes. 🍞 Anchor: The right manual page shows the correct lever to pull.
Code Generation Across Abstractions 🍞 Hook: Think of writing with different pens for different paper types. 🥬 What happens: The system primarily targets Triton (portable), with optional low‑level control via TLX/CuTe on NVIDIA, and MTIA‑specific Triton extensions. Why it exists: Portability + the ability to dive deeper when needed. Example: H100 warp‑group tiling via TLX; MTIA SFU calls via libdevice. 🍞 Anchor: Same essay, but sometimes you need a fine-tip pen for small print.
Automated Evaluation (Correctness First) 🍞 Hook: A referee checks the rules before timing the race. 🥬 What happens: TritonBench runs PyTorch baseline and candidate, compares outputs (torch.allclose), and only then measures speedup. Why it exists: Fast but wrong is useless. Example: ATen suite: 160 ops × 3 platforms, 100% correct. 🍞 Anchor: You don’t award medals if someone runs outside the track.
Deep Profiling and Feedback 🍞 Hook: Imagine a super slow‑mo camera showing exactly where time is lost. 🥬 What happens: Torch Profiler (timeline), NCU & Proton via Triton MPP (GPU instruction‑level), and MTIA Insight (PE/DPE/SFU/MLU, caches, bandwidth) collect structured signals. Why it exists: To locate memory bottlenecks, stalls, occupancy issues. Example: Seeing that layout conversions cost more than the convolution itself → fuse away conversions. 🍞 Anchor: It’s easier to fix a leak when you see the exact drip.
JIT Debugging for MTIA (C++ Emission & Replay) 🍞 Hook: Like viewing the engine’s internal parts. 🥬 What happens: MTIA-Triton can emit C++ for the compiled kernel; the system edits and replays it quickly to test fixes without full recompilation. Why it exists: Rapid, precise diagnosis and correction of low-level issues. Example: Adjusting circular buffer pointers or dual‑core sync without waiting minutes to rebuild. 🍞 Anchor: Pop the hood, tweak, and test in seconds.
Interpreter Environments & Continuous Deployment 🍞 Hook: Think of ready-to-use labs where all tools are pre-installed. 🥬 What happens: Bento-based interpreters (GPU/AMD/MTIA) bundle compilers, runtimes, and profilers. Conveyor auto-rebuilds them as dependencies change. Why it exists: Reproducible, up-to-date runs at scale. Example: New Triton backend version rolls out; all evaluations use it immediately. 🍞 Anchor: Every lab bench has clean, calibrated instruments waiting.
Evaluation Code Generation (Static, Reliable Harness) 🍞 Hook: Like standardized test forms so every student is graded fairly. 🥬 What happens: A static generator turns kernels into profiler-ready scripts for each tool, keeping measurement consistent. Why it exists: Consistency across thousands of candidates avoids noisy, ad‑hoc setups. Example: Same Torch Profiler/NCU/Proton harness across platforms and shapes. 🍞 Anchor: Everyone takes the same test under the same rules.
FaaS-Based Remote Evaluation (Scale-out) 🍞 Hook: Picture sending homework to a big grading center with many teachers. 🥬 What happens: Kernel tests run on remote GPU/MTIA pools via serverless functions, while local agents keep generating. Why it exists: Avoids local device bottlenecks; increases throughput dramatically. Example: Hundreds of candidates profiled in parallel instead of waiting for 8 local GPUs. 🍞 Anchor: Many graders finish the stack faster than one.
Termination & Deployment 🍞 Hook: Like ending practice once the team repeatedly hits their best times. 🥬 What happens: Search stops by budget, progress, or target score. Best kernels are shipped with shape-aware dispatch and safe fallbacks to vendor libraries. Why it exists: Stable, practical gains without regressions. Example: Use fused conv1d on target shapes; fall back to cuDNN conv2d on out-of-distribution shapes. 🍞 Anchor: Use your championship play only in the games where it wins.

Secret Sauce (what’s truly clever)

The universal operator + retrieval makes prompts reflect live profiler truth, not static templates.
A persistent knowledge base injects missing, private platform knowledge (MTIA), unlocking correctness and speed.
Unified profiling (Triton MPP) gives instruction-level signals the agent can act on.
Checkpointed, distributed, serverless evaluation scales inference-time search to production size.
Treating preprocessing as first-class enables monolithic serving and eliminates 10–20 ms network tax.

04Experiments & Results

The Tests (what, why)

Breadth/Correctness: 160 PyTorch ATen operators × 3 platforms (NVIDIA H100, AMD MI350, MTIA v3) to prove coverage; plus KernelBench’s 250 problems across levels. Why: If you can’t get basics correct everywhere, you can’t deploy.
Depth/Performance: Real production cases—conv1d in convolutional transformers, WuKong’s Optimized FM, InterFormer’s PFFN, and preprocessing ops (MapId, MBDT, Batch Event Truncate). Why: Show real speedups that matter to latency and TCO.

The Competition (baselines)

PyTorch native ops (conv1d, etc.) and torch.compile paths.
Vendor-optimized workarounds, e.g., conv2d (NHWC channels_last) mapping to cuDNN’s implicit GEMM.

Scoreboard (with context)

Correctness: 100% pass across all 480 ATen operator–platform pairs and 100% on all 250 KernelBench tasks. That’s like acing every quiz across three different classrooms.
Speedups (headline): 1.25×–17× across diverse workloads. That’s like jumping from a B to solid A/A+ or even valedictorian for some kernels.
Conv1d (production shape 2048×96×96×200, FP16): 2.30× faster than conv1d baseline and 1.62× faster than the optimized conv2d workaround on H100. On MTIA v3, up to 6.54× over conv1d and 4.71× over conv2d. Why: Kernel fusion avoids multiple layout conversions and cuts global memory traffic.
WuKong Optimized FM: 2–4× on typical shapes (N ≤ 64), by fusing X^T Y and X·(X^T Y) so the intermediate stays on‑chip.
InterFormer PFFN: Gains by fusing BMM + GELU + RMSNorm variants and by shape‑specific tiling.
Preprocessing Ops: MapId ≈ 4.1×, MBDT ≈ 9.3×, Batch Event Truncate ≈ 9.8×. These wins are double important: they speed things up and also enable monolithic serving by eliminating CPU‑tier preprocessing.

Surprising/Notable Findings

Preprocessing kernels can be kingmakers: a single missing one forces a whole architecture change (extra 10–20 ms network hop). Treating them as first-class targets is crucial.
Knowledge injection for MTIA is transformative: without private docs, LLMs write GPU‑style code that fails; with docs, they emit correct, fast MTIA kernels (e.g., SFU GELU, inter‑PE ops, dual‑core sync).
Fusion often beats raw compute: even if a vendor GEMM is blazing fast, avoiding extra reads/writes and layout conversions wins end‑to‑end.
Inference‑time scaling matters: exploring hundreds of candidates systematically (with checkpointing) reaches expert‑level implementations that manual loops would miss under time pressure.

Sandwich Explanations for Key Result Concepts

🍞 Hook: Carrying groceries in one trip is faster than several back‑and‑forths. 🥬 Operator Fusion: Combine multiple steps into one kernel to avoid extra memory trips. Steps: (1) analyze chain, (2) keep intermediates on‑chip, (3) write final once. Without it, you pay the price of each separate journey. 🍞 Anchor: Fusing conv1d stages to skip layout-conversion kernels.

🍞 Hook: Trying different shoe sizes to find the best fit. 🥬 Shape-Aware Dispatch: Use the fastest kernel for the current input shape, with safe fallbacks. Steps: (1) benchmark per shape, (2) record winners, (3) dispatch at runtime. Without it, one-size-fits-all leaves performance on the table. 🍞 Anchor: Use the fused conv1d for 2048×96×96×200; fall back to cuDNN for odd shapes.

🍞 Hook: A magnifying glass reveals exactly where the ant trail goes. 🥬 Instruction-Level Profiling: Collect fine-grained timings and stalls to guide fixes. Steps: (1) minimally instrument, (2) attribute costs, (3) change code accordingly. Without it, you fix the wrong thing. 🍞 Anchor: Seeing TMA copies overlap with compute versus stall on H100 led to better pipelining.

05Discussion & Limitations

Limitations (be specific)

Out-of-distribution shapes: A kernel specialized for common production shapes can underperform on rare or new shapes. Shape-aware dispatch mitigates this but adds maintenance.
Search cost and hardware time: Large search runs need many evaluations; even with FaaS, they consume accelerator cycles and scheduling time.
Knowledge upkeep: The persistent knowledge base must be curated as hardware evolves (new MTIA versions, AMD/NVIDIA features), or quality drifts.
Non-transferable tricks: An optimization for H100 (e.g., TLX warp specialization) may not help on AMD or MTIA; per-platform variants increase artifact count.
Debug complexity: While tooling is unified, interpreting multi-layer signals (IR, SASS, counters) still requires good heuristics in the agents.

Required Resources

Access to target accelerators (NVIDIA, AMD, MTIA) via interpreter environments or FaaS pools.
Up-to-date toolchains (Triton backends, profilers like Triton MPP, NCU, Proton, MTIA Insight).
Storage for metadata/object stores and the knowledge base.
Budget for search time (hours) per operator, especially for high-impact kernels.

When NOT to Use

Tiny, rarely called ops where PyTorch or vendor libraries are already optimal and the search cost won’t pay back.
Extremely dynamic operator graphs whose shapes change wildly per request (hard to specialize; fallback libraries may suffice).
Strictly regulated contexts where codegen artifacts can’t be audited or profiled as required (compliance constraints).

Open Questions

Can the universal operator learn an even better internal policy from accumulated histories (meta-optimization) without overfitting to past platforms?
How to auto-detect fusion opportunities safely across model graphs at scale while preserving numerical stability?
Can we generalize MTIA-style knowledge injection to new, unseen accelerators fast enough for day‑zero enablement?
What’s the best balance between portability (single Triton) and peak performance (platform‑specific TLX/CuTe/MTIA intrinsics) for long-term maintenance?
How to include energy/cost in the fitness function to jointly optimize latency and TCO?

06Conclusion & Future Work

3-Sentence Summary KernelEvolve turns kernel development from a manual art into an automated, search-driven service that retrieves the right hardware knowledge at the right time, tests many candidates, and keeps only fast, correct ones. By covering not only compute-heavy ops but also the many preprocessing operators, it enables monolithic accelerator deployments and avoids 10–20 ms network penalties from disaggregated serving. Deployed across NVIDIA, AMD, and MTIA, it achieves 100% correctness on broad suites and up to 17× speedups on real workloads.

Main Achievement A production-grade, agentic kernel coding framework—with a universal operator, retrieval-augmented prompts, unified profiling, and checkpointed distributed search—that reliably generates high-performance kernels across heterogeneous accelerators in hours instead of weeks.

Future Directions

Add energy/TCO-aware fitness and multi-objective optimization.
Expand zero-day enablement for new accelerators via rapid knowledge injection.
Automate graph-level fusion discovery and safe integration with torch.compile.
Learn from history (meta-learning) to warm-start search with smarter priors.
Tighten reliability guarantees (formal checks) for numerical stability and determinism.

Why Remember This It shows how to scale expert-level kernel performance across fast-changing hardware by combining agentic AI, live profiling, and an evolving knowledge base—turning a fragile, weeks-long craft into a robust, hours-long pipeline that directly saves latency and cost in real products.

Practical Applications

•Enable monolithic serving by auto-generating missing preprocessing kernels on new accelerators.
•Accelerate sequence-heavy models (Transformers) by fusing attention, normalization, and MLP steps.
•Speed up convolutional transformers by removing layout conversions and fusing conv1d stages.
•Optimize recommendation operators (e.g., WuKong Optimized FM) via on-chip tiling and fusion.
•Provide shape-aware dispatch that picks the best kernel per input distribution with safe fallbacks.
•Port kernels across NVIDIA, AMD, and MTIA with platform-specific knowledge injection.
•Use unified profiling (Triton MPP, NCU, Proton, MTIA Insight) to identify true bottlenecks automatically.
•Scale evaluations via FaaS so hundreds of kernel candidates can be tested in parallel.
•Reduce developer time from weeks to hours for new operator–hardware combinations.
•Maintain a growing knowledge base that continually improves future kernel generations.

Version: 1