PACEvolve: Enabling Long-Horizon Progress-Aware Consistent Evolution

Minghao Yan; Bo Peng; Benjamin Coleman; Ziqi Chen; Zhouhang Xie; Shuo Chen; Zhankui He; Noveen Sachdeva; Isabella Ye; Weili Wang; Chi Wang; Ed H. Chi; Fernando Pereira; Wang-Cheng Kang; Derek Zhiyuan Cheng; Beidou Wang

PACEvolve: Enabling Long-Horizon Progress-Aware Consistent Evolution

Intermediate

Minghao Yan, Bo Peng, Benjamin Coleman et al.1/15/2026

arXiv PDF

Key Summary

•PACEvolve is a new recipe that helps AI agents improve their ideas step by step over long periods without getting stuck.
•It fixes three big problems: messy memories (context pollution), getting stuck on one kind of idea (mode collapse), and poor teamwork across parallel searches (weak collaboration).
•It organizes past attempts into a tidy 'idea library' and prunes low-value clutter so the model thinks more clearly.
•It watches progress with a simple, scale-aware momentum signal and backtracks to an earlier point when progress stalls.
•It lets parallel agents decide when to backtrack or borrow ideas from a stronger teammate using a self-adaptive sampling policy.
•On Symbolic Regression (LLM-SR), PACEolve reached Log10 NMSE as low as -8.24, beating all baselines and lowering variance across runs.
•On KernelBench, it produced faster GPU kernels than prior methods on most tested tasks, with up to 17.38x speedup over PyTorch for LayerNorm.
•On Modded NanoGPT, it still found measurable speedups even after 40 rounds of strong human optimizations, showing real engineering value.
•Ablation tests show each piece (organized context, momentum backtracking, adaptive collaboration) matters, and together they deliver the best, most consistent results.
•PACEvolve offers a principled, progress-aware framework for long-horizon, self-improving LLM agents in science and engineering.

Why This Research Matters

PACEvolve makes AI agents better long-term problem solvers. It helps them remember the right things, notice when progress truly stalls, and share breakthroughs between parallel teams at the perfect moments. That means faster discovery in science (like finding formulas), better code for speed (like faster GPU kernels), and smarter engineering tweaks (like shaving precious seconds off model training). It also reduces wasted compute by avoiding repeated mistakes and by escaping dead-ends promptly. Most importantly, it turns ad hoc trial-and-error into a steady, reliable process that scales to tougher, real-world challenges.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine your classroom trying to build the tallest LEGO tower. If everyone keeps adding blocks randomly and never cleans up old designs, the pile becomes a mess, people copy the same idea over and over, and nobody notices when progress slows down. That’s how many AI searches used to feel.

🥬 The Concept: Evolutionary Algorithms (EAs)

What it is: An evolutionary algorithm is a way to find great solutions by making many small changes, keeping the best ones, and mixing ideas—like evolution in nature.
How it works:
1. Make many candidate solutions (like many LEGO tower shapes).
2. Test each one and score how good it is.
3. Keep the best, slightly change them (mutate), and sometimes mix two good ones (crossover).
4. Repeat until you find something great.
Why it matters: Without evolution, you might stop after the first okay idea or waste time on random guesses, never reaching the really good stuff. 🍞 Anchor: Designing a paper airplane—try many folds, keep the best-flying design, tweak wings or nose, and repeat until it sails far.

🥬 The Concept: Large Language Models (LLMs) in Evolution

What it is: LLMs are smart text tools that can read your history and suggest smarter next steps instead of random changes.
How it works:
1. Read past tries and results.
2. Suggest improved code, math formulas, or parameters.
3. Test and record the outcome.
4. Use the new result to suggest the next try.
Why it matters: LLMs can reason with context, so they need fewer guesses and can learn from failure. 🍞 Anchor: Like a coach who reviews past games and gives the team smarter plays for the next match.

🥬 The Concept: Context Pollution

What it is: Context pollution is when the agent’s memory gets flooded with lots of failed attempts, which bias the LLM to repeat weak ideas.
How it works:
1. The agent stores summaries of many past tries.
2. Wins are rare; failures are common.
3. The memory fills with low-signal junk.
4. The LLM keeps seeing the same flawed patterns and generates similar ideas.
Why it matters: With a noisy memory, future ideas get worse, exploration shrinks, and the search slows down or stops improving. 🍞 Anchor: If your notebook is packed with wrong math steps and no highlights, you’ll likely repeat the same mistakes when studying.

🥬 The Concept: Mode Collapse

What it is: Mode collapse happens when the agent keeps picking the same style of idea and won’t explore different directions.
How it works:
1. The agent likes ideas that resemble its history.
2. It exploits small improvements near what it knows.
3. It avoids big leaps to new regions that might be better.
4. It gets stuck in a local minimum.
Why it matters: Without diversity, you might miss the best solution because you never look far enough. 🍞 Anchor: Always ordering the same pizza topping—you never discover that a different combo is your true favorite.

🥬 The Concept: Weak Collaboration (in multi-island setups)

What it is: Weak collaboration is when parallel searches don’t help each other at the right time or in the right way.
How it works:
1. Multiple agents search in parallel (like teams on different islands).
2. Old systems copy winners on a fixed schedule.
3. Timing is off: sometimes you copy too early or too late.
4. Opportunities to share breakthroughs are missed.
Why it matters: Without smart teamwork, you waste compute and progress slower. 🍞 Anchor: Two study groups never compare notes until the end—both miss shortcuts the other already found.

The World Before: Classical EAs used random tweaks and simple rules, which often needed huge numbers of tries. LLMs promised to be smarter by reading history and reasoning. But early LLM-in-the-loop systems were unstable: they stored too much noisy history (context pollution), clung to one idea family (mode collapse), and shared poorly across parallel searches (weak collaboration).

The Problem: How do we design a reliable agent scaffold that keeps context clean, explores broadly without getting stuck, and coordinates multiple searches well over long horizons?

Failed Attempts: Prior systems mostly summarized history and did periodic crossovers. But summaries still propagated biases; fixed-schedule resets and crossovers ignored whether progress was actually stalling or surging; and parallel teams didn’t adaptively decide when to backtrack versus collaborate.

The Gap: A principled, progress-aware controller was missing—one that manages memory quality, monitors real progress, and adapts actions (backtrack or crossover) based on that progress, not on a calendar.

Real Stakes: This matters for science, engineering, and coding—like designing faster GPU kernels, discovering equations in physics, or speeding up model training. Cleaner thinking, timely pivots, and smart teamwork can turn weeks of trial-and-error into days—and can surface ideas humans might miss.

02Core Idea

🍞 Hook: You know how in a long group project, you need three things to succeed: clean notes, knowing when to undo a bad path, and smart sharing between teams? That’s exactly what AI searches need too.

🥬 The Concept: PACEvolve (Progress-Aware Consistent Evolution)

What it is: PACEvolve is a framework that keeps AI evolution clean, alert, and collaborative so it can steadily improve for a long time.
How it works:
1. Keep a tidy, layered idea library and prune low-value clutter (Hierarchical Context Management).
2. Track a scale-aware momentum of progress and backtrack when improvement stalls (Momentum-Based Backtracking).
3. In parallel runs, adaptively choose backtracking or crossover with the best teammate using a fair, global progress score (Collaborative Evolution Sampling).
Why it matters: Without this trio, LLM agents often get stuck, repeat themselves, and fail to cooperate when it counts. 🍞 Anchor: Like a sports team with organized playbooks, timeouts when the offense sputters, and perfect passes between players when another teammate is open.

The Aha! Moment in one sentence: Treat progress as the north star—manage memory to keep signal high, measure improvement with a scale-aware momentum, and let that momentum decide when to backtrack or collaborate across teams.

Three Analogies:

Library + Compass + Walkie-Talkie: A clean library of ideas (HCM), a compass that shows whether you’re moving fast or slow (momentum), and walkie-talkies that help teams share the best route at the right time (CE).
Garden + Weather + Neighbor Help: Prune weak branches (HCM), watch weather trend (momentum) to know when to replant (backtrack), and trade seeds with neighbors when they’ve grown something better (CE).
Study Notes + Quiz Streak + Study Buddies: Keep neat notes (HCM), watch if your quiz scores stop improving (momentum), then either go back to basics (backtrack) or borrow tips from the top student (crossover).

Before vs After:

Before: Agents hoarded messy history, clung to local ideas, and collaborated on a timer. Progress was bumpy and often stalled.
After: Agents keep history high-signal, detect stalls early with momentum, and share breakthroughs exactly when useful. Progress becomes steadier and more consistent across runs.

🥬 The Concept: Hierarchical Context Management (HCM)

What it is: A clean, layered memory that separates big ideas from specific experiments, with pruning when things get crowded.
How it works:
1. Split Macro Ideas (concepts) from Micro Hypotheses (specific tests).
2. Classify new ideas: merge if similar; add if truly new.
3. Cap how many hypotheses per idea; summarize when full.
4. Cap how many active ideas; drop the least promising; archive failures to avoid repeats.
Why it matters: Prevents context pollution so the LLM sees the best signals and keeps exploring diverse directions. 🍞 Anchor: Like organizing a binder with tabs (units), limiting pages per tab, writing a summary sheet, and storing old worksheets in a separate box so you don’t redo the same mistakes.

🥬 The Concept: Momentum-Based Backtracking (MBB)

What it is: A trigger that rolls back to a healthier past state when improvement momentum gets too low.
How it works:
1. Define a target (like error → 0) and track the best score so far.
2. Compute Relative Progress: how much of the remaining gap you just closed.
3. Smooth it into Momentum with a moving average.
4. If momentum falls below a threshold, jump back to an earlier checkpoint (sampled to favor earlier states) and continue.
Why it matters: It’s a principled, state-aware escape from local minima—much better than resetting on a fixed schedule. 🍞 Anchor: Like hiking with a smartwatch: if your pace tanks on a bad trail, you hike back to the last fork that led to good progress.

🥬 The Concept: Collaborative Evolution Sampling (CE)

What it is: A self-adaptive rule to pick either backtracking or crossover with the best teammate, based on a global progress score.
How it works:
1. Compute Absolute Progress for each island: how much gap it has closed since it began.
2. If you’re lagging and another island is far ahead, prefer crossover.
3. If you’re already the best (or everyone is equally stuck and low), prefer backtracking.
4. If you and the best island are both high and similar, add a synergy bonus to crossover.
Why it matters: It times teamwork well, speeding up the whole group instead of copying blindly. 🍞 Anchor: If your science fair group sees that Team B’s design is clearly better and compatible, you borrow; if no team is ahead, you rethink your plan instead.

Why It Works (intuition, no equations):

Relative Progress normalizes gains by how far you still have to go, so tiny late-stage wins count properly and early-stage big jumps don’t dominate forever.
Momentum smooths noisy scores into a stable trend, reducing false alarms.
Backtracking removes the harmful influence of bad recent context.
Absolute Progress gives a fair, shared scoreboard across islands, so decisions to borrow or backtrack are globally sensible.

Building Blocks:

Clean idea memory (HCM)
Scale-aware progress signal (Relative Progress and Momentum)
State-triggered escape (Backtracking)
Fair team coordination (CE with Absolute Progress) Together, these align the agent’s memory, movement, and teamwork around real, measured progress.

03Methodology

High-level recipe: Input → HCM (organize/prune ideas) → Evaluate experiments → Update progress/momentum → If stalled, MBB triggers → CE picks backtrack or crossover → Output better candidates (repeat).

Step 1: Hierarchical Context Management (HCM)

What happens:
1. Split thinking into two layers: Macro Ideas (big concepts) and Micro Hypotheses (specific trials under each idea).
2. Idea Generation: Brainstorm several ideas; classify each as new or a refinement of an existing idea.
3. Idea Selection: Choose which idea to try next and specify a concrete hypothesis.
4. Run the experiment and record the result under that idea’s history.
5. Hypothesis cap: If an idea has too many hypotheses, summarize its key lessons and keep just the summary and best result.
6. Idea cap: If there are too many active ideas, drop the least promising ones to force exploration of new directions.
7. Permanent memory: Archive dropped ideas/hypotheses to avoid repeating known failures later.
Why this step exists: Without HCM, the LLM’s context becomes noisy and biased, causing repeated weak ideas and poor exploration.
Example with data: Suppose we’re doing Symbolic Regression. Ideas might be: (A) Add a sine term to capture periodic motion; (B) Use polynomial damping terms; (C) Try exponential decay. Under (B), you tried 6 different coefficient sets; you hit the hypothesis cap and summarize: “Higher cubic term helps early but plateaus; quadratic alone underfits.” Now the agent keeps that distilled lesson instead of pages of similar failures.

Step 2: Momentum-Based Backtracking (MBB)

What happens:
1. Define a target lower bound (like error target r = 0).
2. Track the best score so far (s_t) and compute the remaining gap to target.
3. When you improve, compute Relative Progress = fraction of the gap closed by the new best.
4. Smooth Relative Progress with a moving average to form Momentum m_t.
5. If m_t drops below a threshold, trigger backtracking: revert context to an earlier checkpoint (earlier steps are more likely), clearing the recent, unhelpful influence.
6. Continue exploration from that earlier, healthier state.
Why this step exists: Fixed resets ignore real-time progress; momentum-triggered backtracking is surgical, escaping exactly when the search truly stalls.
Example with data: Best error goes 0.50 → 0.40 (relative progress 0.20), then tiny nudges 0.40 → 0.398 → 0.397 (relative progress ~0.007 then ~0.002). The moving momentum falls below the threshold. The agent jumps back to the checkpoint at 0.45 (a broader frontier), and now tries a new macro idea with a sine term, quickly reaching 0.35.

Step 3: Collaborative Evolution Sampling (CE)

What happens:
1. Run several islands (parallel searches). For each, compute Absolute Progress: the fraction of the overall gap each island has closed since it started.
2. When MBB flags an island as stalled, it must choose an action: backtrack or crossover with island j.
3. Assign weights to actions:
  - Crossover weight for island j grows if j’s Absolute Progress is higher than i’s (knowledge transfer likely helps).
  - Backtrack weight grows if i is already the leader (nobody to learn from) or if i and the best j are both low and similar (group-wide stagnation → explore afresh).
  - If i and the best j are both high and similar, add a synergy bonus to crossing over with the best j.
4. Sample action by these weights: higher weight → higher chance.
5. Optional freeze at the very start to let momentum form before any backtracking or crossover.
Why this step exists: Static, timed crossovers miss the right moments. CE coordinates teamwork based on real, comparable progress across islands.
Example with data: Island A closed 80% of its gap, Island B closed 30%. B stalls—CE says B should likely crossover with A. Later, both A and C are ~90% and similar; when C stalls, CE prefers crossover with A (synergy) instead of backtracking.

The Secret Sauce:

Scale-aware progress signals (Relative Progress and Momentum) make decisions fair at any stage—early or late.
Clean, layered memory focuses the LLM’s reasoning on signal, not noise.
Unified policy (CE) treats backtracking and crossover as two sides of the same coin—pick the one that most likely boosts global progress.

Putting it all together (concrete walk-through):

Input: Task spec + evaluator + initial candidate(s).
HCM builds an idea repo with macro concepts (e.g., “add Nesterov momentum”) and micro tests (specific hyperparameters).
Run a batch of experiments; record results; prune/summarize as caps are hit; archive failures.
Update Relative Progress and Momentum on each improvement.
If an island’s momentum dips below threshold, trigger MBB; then CE decides: revert or borrow from a better island.
Output: steadily improving candidates; repeat until budget is used or target is met.

04Experiments & Results

🍞 Hook: Think of a science fair where teams test inventions. You don’t just ask, “Who won?” You ask, “Against whom, on what challenges, and by how much?”

🥬 The Concept: The Tests (What and Why)

What it is:
1. Symbolic Regression (LLM-SR): Recover hidden equations behind motion data; metric: Log10 NMSE (lower is better).
2. KernelBench: Write super-fast GPU kernels; metric: runtime in microseconds (lower is better), reported vs PyTorch and leaderboard entries.
3. Modded NanoGPT: Speed up training to reach a fixed validation loss; metric: time-to-target on 8×H100 GPUs (lower is better).
How it works:
- Run up to 1000 iterations per task instance (LLM-SR and kernels), compare against strong baselines, use the same LLM family.
Why it matters: These cover scientific discovery, code-generation for performance, and complex full-stack engineering. 🍞 Anchor: Like testing bikes on hills (physics), flat tracks (speed), and city streets (real-world systems) to prove overall strength.

🥬 The Concept: The Competition (Baselines)

What it is: Prior strong methods—uDSR (symbolic regression), LLM-SR baseline, OpenEvolve, CodeEvolve, ShinkaEvolve.
How it works: Each baseline runs under its recommended setup; PACEvolve uses single- or multi-island variants; all use comparable LLMs (e.g., Gemini 2.5 Pro).
Why it matters: Beating multiple mature systems shows robustness, not just a lucky case. 🍞 Anchor: Winning one game is nice; winning a whole tournament against tough teams proves your playbook works.

Scoreboard with Context:

LLM-SR (Nonlinear Oscillators):
- PACEvolve-Single reached best Log10 NMSE of -8.23; PACEvolve-Multi reached -8.24 and improved P75 and mean, discovering three solutions below -8.
- Compared to baselines (best results around -7.26 for CodeEvolve, -7.11 for OpenEvolve, -6.35 for ShinkaEvolve), PACEvolve’s best is about an order of magnitude tighter in error reduction (on a log scale).
- Think of -8.24 like scoring an A+ while others are around A- to B+.
KernelBench (16 kernels):
- PACEvolve-Single and PACEvolve-Multi beat prior best kernels in most cases.
- Big wins include LayerNorm (17.38x faster than PyTorch) and strong gains on ConvTranspose2D and others; near-parity on MatMul (hard because vendor libraries are extremely optimized).
- Multi-island improved 13/16 kernels over single-island, showing CE’s teamwork benefits.
Modded NanoGPT (already heavily optimized v40):
- Still found measurable training speed-ups (e.g., 142.8s → 140.2s to reach target loss) by sharding/preloading data, U-shaped skip initialization, better hyperparameters (e.g., softcapping constants, optimizer betas), and a smarter context window schedule.
- When a field has already shaved time from ~2700s to ~142s, further two-second wins are like squeezing extra drops from a nearly dry sponge—hard but real value.

Surprising Findings:

Backtracking reduces worst-case outcomes: momentum-triggered resets eliminated runs that got stuck the entire time.
CE amplified good news: once any island found a promising pocket, CE spread that progress effectively without derailing leaders.
Clean memory mattered most early: HCM lifted average and median runs by removing noisy context and encouraging fresh directions.
Even on super-optimized tasks (MatMul, NanoGPT v40), PACEvolve could match or eke out more gains through nuanced, system-level changes.

Reliability (Variance Story):

Boxplots (ablations) show a steady improvement story from vanilla (append-only) → +HCM → +MBB → +CE. The combination not only boosts best scores but tightens spread—fewer dud runs and more consistent wins.

05Discussion & Limitations

🍞 Hook: Even great playbooks have limits—knowing when not to use them is part of being wise.

Limitations (specific):

Expensive evaluations: Some tasks (e.g., training large models) cost a lot per try; even with smart pruning/backtracking, budgets can be tight.
Sensitivity to scaffolding: Thresholds (momentum), caps (ideas/hypotheses), and prompt styles can affect behavior; tuning may be needed across domains.
Noisy or ultra-sparse rewards: If signals are extremely erratic or nearly always zero, momentum and pruning have little to anchor to.
Safety/Hallucination: LLMs can propose risky code changes; sandboxes, tests, and guardrails are essential.
Not a proof of global optimality: It’s a strong heuristic controller, not a guarantee of the best possible solution.

Required Resources:

A capable LLM (e.g., Gemini 2.5 Pro or similar) and sometimes a second, cheaper model for classification/summarization.
A reliable evaluator (accuracy metrics, latency timers with fixed GPU frequency, anti-reward-hacking checks).
Orchestration to manage idea pools, archives, backtracking checkpoints, and multi-island communication.
Compute (GPUs/CPUs) matching task demands, especially for code and model training tasks.

When NOT to Use:

One-shot tasks where no iterative improvement is possible or evaluation is ill-defined.
Domains with rewards too noisy/sparse to estimate progress (momentum) meaningfully.
Settings where storing and reusing history is prohibited or infeasible (no memory allowed).
Safety-critical code paths without strong testing/sandboxing.

Open Questions:

Can we auto-tune momentum thresholds and decay to each task on the fly?
How to blend curiosity-driven exploration with progress-aware control for even better diversity?
Can CE incorporate uncertainty or confidence in evaluations to weigh crossover more safely?
What are the best human-in-the-loop checkpoints for high-stakes changes (e.g., systems code)?
Can we formalize regret bounds or convergence properties for PACEvolve-style controllers across classes of tasks?

06Conclusion & Future Work

Three-sentence summary: PACEvolve is a progress-aware framework that keeps AI evolution clean, detects stalls with momentum, and coordinates teamwork adaptively across parallel searches. By combining Hierarchical Context Management, Momentum-Based Backtracking, and Collaborative Evolution Sampling, it delivers consistent, long-horizon self-improvement. It achieves state-of-the-art or better results on symbolic equations, GPU kernels, and complex model training pipelines.

Main Achievement: Turning ad hoc, unstable LLM-in-the-loop evolution into a principled, progress-governed process that reduces variance, escapes local minima, and times collaboration well.

Future Directions: Auto-tune momentum and sampling thresholds; integrate curiosity or uncertainty estimates; expand to more domains (robotics, compilers, materials); strengthen safety/testing harnesses; explore theoretical guarantees.

Why Remember This: It reframes how to run LLM-driven search—treat progress as the compass, keep memory tidy, and collaborate when it truly helps. That simple shift unlocks steadier gains, fewer dead-ends, and real-world wins even in already-optimized systems.

Practical Applications

•Automatic equation discovery in physics or biology using cleaner memory and stall-aware pivots.
•Generating faster custom GPU kernels for deep learning workloads and scientific computing.
•Speeding up model training pipelines by optimizing data loading, scheduling, and hyperparameters.
•Tuning complex systems (databases, compilers) with progress-aware resets and adaptive sharing.
•Designing better optimizers or architectures via structured idea pools and pruning.
•Improving robotics policies by escaping local minima and sharing successful maneuvers across agents.
•Game or simulation strategy search with consistent, long-horizon self-improvement.
•Industrial process optimization where evaluations are costly and progress must be tracked carefully.
•Auto-ML pipelines that keep exploration diverse while avoiding repeated failed configurations.
•Research assistants that maintain high-signal notes and know when to borrow or backtrack.

Version: 1