GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning

Lakshya A Agrawal; Shangyin Tan; Dilara Soylu; Noah Ziems; Rishi Khare; Krista Opsahl-Ong; Arnav Singhvi; Herumb Shandilya; Michael J Ryan; Meng Jiang; Christopher Potts; Koushik Sen; Alexandros G. Dimakis; Ion Stoica; Dan Klein; Matei Zaharia; Omar Khattab

GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning

Intermediate

Lakshya A Agrawal, Shangyin Tan, Dilara Soylu et al.7/25/2025

arXiv

Key Summary

•GEPA is a new way to improve AI prompts by letting the AI read its own work, reflect in plain language on what went wrong, and then rewrite its instructions.
•Instead of changing the model’s weights with reinforcement learning (which needs tons of tries), GEPA changes only the prompts and learns much faster from fewer tries (rollouts).
•GEPA keeps a diverse set of “best so far” prompts using a Pareto frontier so it doesn’t get stuck on one idea; it mixes and matches good ideas across tasks.
•On six benchmarks (like HotpotQA, IFBench, HoVer, PUPA, AIME-2025, LiveBench-Math), GEPA beat GRPO by up to 20% while using up to 35x fewer rollouts.
•GEPA also beat the leading prompt optimizer MIPROv2 by over 10% and made prompts that are often much shorter (cheaper to run) but more effective.
•GEPA’s reflection uses natural language feedback (including error messages) as rich clues, not just a final score, to decide how to edit prompts.
•A “Merge” step (system-aware crossover) can combine the best modules from different lineages to get extra gains when the budget and timing are right.
•Prompts tuned on a smaller open model (Qwen3-8B) transferred well to a closed model (GPT-4.1 Mini), still outperforming other optimizers.
•GEPA can also be used as an inference-time search strategy (e.g., for code kernels), turning failure messages into targeted improvements.
•Overall, the paper shows that learning from language (reflection) can be more sample-efficient and practical than standard RL for many real AI systems.

Why This Research Matters

GEPA lowers the cost and time needed to improve real AI systems by learning from language, not just from numeric rewards. That means teams can ship better assistants, search tools, and coding agents using fewer API calls and less engineering. Because GEPA builds shorter, rule-focused prompts, runtime costs and latency also drop—good for both wallets and user experience. The Pareto approach keeps diverse strategies alive, so systems resist getting stuck and generalize better. GEPA’s prompts even transfer across different models, helping organizations switch providers or scale without starting over. Finally, by turning error messages into lessons, GEPA strengthens reliability in settings with strict rules, like privacy, safety, or formatting.

Reading Workflow

Turn this paper into a decision

Scan fast. Promote only the papers that survive triage.

No workflow history yet.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

Let’s set the stage by walking through how people made AI systems work before GEPA, why that was hard, and why this new idea matters.

🍞 Top Bread (Hook): You know how if you’re trying to get better at basketball, you don’t just look at your final score—you watch the replay, see which passes failed, and then change your strategy? That replay is way more helpful than only seeing “You lost by 3.”

🥬 Filling (The Actual Concept):

What it is: Large Language Models (LLMs) are powerful “text machines” that follow written instructions (prompts) to solve tasks. Many real systems use several LLM steps and tools together—this is called a compound AI system.
How it used to work: To improve performance, people often used Reinforcement Learning (RL). The model tried a task (a rollout), got a single score at the end, and then its weights were nudged to do better next time.
Why that was hard: RL usually needs thousands to hundreds of thousands of rollouts. That’s slow, costly, and many apps can’t afford it, especially when tool calls (like running code or querying the web) are expensive.

🍞 Bottom Bread (Anchor): Imagine a homework helper bot that needs 24,000 tries to learn how to follow strict instructions—that’s too expensive. Wouldn’t it be better if it could just read comments like “You ignored the word count” and fix its own instructions quickly?

— New Concept 1 — 🍞 Hook: Imagine building a LEGO robot with several moving parts. If the robot trips, you want to know which part failed—the leg, the arm, or the sensor—so you can fix the right piece, not rebuild the whole thing.

🥬 The Concept (Compound AI System):

What it is: A compound AI system is a team of modules (LLM calls and tools) wired together with instructions for each step.
How it works: Inputs go into module A (say, retrieve info), then module B (summarize), then module C (answer), and so on. Each module follows its own prompt.
Why it matters: If the final answer is wrong, you need to know which module to improve. Otherwise, you waste effort changing the wrong thing.

🍞 Anchor: In HotpotQA, a system first finds documents (module A), then reads them (module B), then answers (module C). If it misses a document, improving the answer-writer won’t help—you must fix the retriever’s prompt.

— New Concept 2 — 🍞 Hook: You know how a coach’s comments (“your footwork was late on defense”) teach you more than just the final score? Textual comments are rich clues.

🥬 The Concept (Rollouts and Natural Language Feedback):

What it is: A rollout is one full attempt to solve a task; beyond the final score, the system and environment often produce text traces (reasoning steps, tool errors) that explain what happened.
How it works: The AI runs, produces reasoning and calls tools; the environment may return messages like “compiler error: missing semicolon.”
Why it matters: These text traces are gold. They point to exactly what went wrong and where, so the AI can fix the right instruction.

🍞 Anchor: When generating code, “error: undefined variable x” tells you a precise fix—much better than just “0/1 passed.”

— New Concept 3 — 🍞 Hook: Think of RL like practicing free throws by counting how many you make, but without watching the video. It helps, but you miss many lessons.

🥬 The Concept (Reinforcement Learning with Verifiable Rewards, e.g., GRPO):

What it is: RLVR gives a score you can trust (like tests or unit checks) and methods like GRPO adjust model weights to raise that score.
How it works: Try many times, get end scores, compute updates to the weights.
Why it matters: It works—but it can be sample-hungry and misses the rich lessons inside the language traces.

🍞 Anchor: If a math bot gets only “correct/incorrect” without seeing where its reasoning went off-track, it needs way more tries to improve.

— The Problem & Failed Attempts — Before GEPA, lots of RL training was used to adapt models to new tasks. It often worked but came with heavy rollout budgets and engineering effort. Other prompt optimizers tried random edits, greedy search, or optimizing few-shot examples, but they could get stuck on local tricks, make prompts very long and costly, or overfit to validation data.

— The Gap — What was missing was an optimizer that:

Fully reads and uses the text traces (reasoning, tool errors, rubrics),
Can aim updates at the exact module that needs them,
Keeps multiple promising strategies alive (not just one “best so far”),
And turns a few rollouts into big improvements.

— Real Stakes —

Faster iteration: Teams can improve quality without massive training runs.
Lower cost: Shorter, better instructions reduce API tokens and latency.
Better reliability: Using evaluator messages (like constraint checks) helps systems obey strict formatting, privacy rules, or safety constraints.
Practical reach: Even small orgs with limited budgets can optimize complex agents by evolving their prompts.

02Core Idea

Here’s the heart of the paper—what GEPA is and why it works so well.

— New Concept 4 — 🍞 Hook: Imagine you keep a diary after each game: what worked, what didn’t, and the rule you learned (“Don’t pass into traffic”). Next game, you apply the new rule.

🥬 The Concept (Reflective Prompt Evolution):

What it is: The AI reads its own traces and feedback in plain language and rewrites its instructions (prompts) to encode high-level rules it just learned.
How it works:
1. Run the system on a small batch of tasks and collect traces + feedback.
2. Pick which module’s prompt to improve.
3. Ask a reflection LLM to write a better instruction that addresses the observed mistakes.
4. Test the new instruction on the same mini-batch; if it’s better, keep it.
Why it matters: This mines maximum learning from each rollout, upgrading the “playbook” (prompts) instead of expensive weight finetuning.

🍞 Anchor: In PUPA (privacy), GEPA noticed leaks of names; it added rules like “abstract PII and explain your abstraction,” boosting the privacy score quickly.

— New Concept 5 — 🍞 Hook: Picture a science fair with many categories—best physics, best biology, best chemistry. No single project wins them all, but many are category leaders.

🥬 The Concept (Pareto Frontier Candidate Selection):

What it is: Keep every candidate prompt that’s best on at least one example; don’t throw away diverse winners.
How it works:
1. For each training instance, mark which candidate achieved the top score.
2. Collect all candidates that are “top for something.”
3. Sample from these winners with higher chance for those that win more often.
Why it matters: Avoids getting stuck on one local trick; explores multiple promising strategies in parallel.

🍞 Anchor: One prompt might nail questions about albums; another excels at dates. GEPA keeps both until it learns a combined strategy.

— New Concept 6 — 🍞 Hook: Think of evolution: creatures mutate, and sometimes two strong families mix their strengths.

🥬 The Concept (Genetic Mutation and System-Aware Crossover, aka Merge):

What it is: Mutation = reflectively edit one module’s prompt; Crossover (Merge) = combine complementary modules from different successful lineages.
How it works:
1. Mutation proposes a new instruction for a chosen module based on observed errors.
2. Merge, when conditions are right, picks the best version of each module from different candidates sharing an ancestor.
Why it matters: Mutation makes targeted improvements; Merge fuses separate strengths for bigger jumps in quality.

🍞 Anchor: If one lineage perfected the retriever and another perfected the answer-writer, Merge can assemble a system with both strengths.

— New Concept 7 — 🍞 Hook: When you debug code, the error message tells you exactly where to look.

🥬 The Concept (Evaluation Traces as Learning Signals):

What it is: The evaluator’s messages (e.g., compiler errors, failed rubric checks) are fed back as text for reflection.
How it works: GEPA’s feedback function collects both the score and the detailed text diagnostics; reflection uses them to propose precise prompt edits.
Why it matters: You’re no longer blind—each failure comes with a hint.

🍞 Anchor: In code-gen, “index out of bounds on line 42” becomes a concrete fix in the next prompt version (e.g., “validate array lengths before access”).

— The Aha! Moment (One Sentence) — If you let an AI read and reason about its own language traces and evaluator messages, it can rapidly rewrite its instructions to capture general rules, and—by exploring a Pareto set of diverse winners—outperform RL with far fewer tries.

— Multiple Analogies —

Coach & Playbook: Instead of making a player stronger (weight updates), rewrite the playbook using game footage and referee notes.
Cooking: Taste the soup (trace), read the critic’s review (feedback), change the recipe step, and keep the best variations across dishes.
Study Group: Different students master different topics; keep notes from all top students, then compile the best combined guide.

— Before vs After —

Before: Big RL runs, slow learning from sparse end scores; long, example-heavy prompts; risk of local optima.
After: Fast learning from rich text feedback; shorter, rule-focused prompts; diverse exploration via Pareto; occasional Merge boosts.

— Why It Works (Intuition) —

Language is a high-bandwidth teacher. Traces + feedback carry detailed, actionable lessons.
Prompts encode behavior directly; editing them can enact big, immediate changes.
Pareto sampling protects diversity, reducing premature convergence.

— Building Blocks —

Compound system with module-level prompts.
Feedback function that returns both scores and text feedback per rollout.
Reflective mutation to turn mistakes into better instructions.
Pareto-based selection to preserve diverse winners.
Optional system-aware Merge to fuse complementary strengths.

03Methodology

At a high level: Input tasks → Run current system on a small batch and collect traces + feedback → Reflect to propose a prompt edit for one module → Test quickly on the same mini-batch → If improved, keep and evaluate on validation (Pareto set) → Repeat until budget ends → Return the top candidate.

— Inputs —

A compound AI system (modules with prompts; model weights are frozen).
A training dataset split into: feedback set (for learning signals) and Pareto validation set (for selection).
An evaluation metric (e.g., exact match, pass rate) and a feedback function that also returns text diagnostics.
A rollout budget.

— Step-by-Step Recipe —

Initialize the Candidate Pool
- What happens: Start with the base system as the only candidate.
- Why it exists: Provides a baseline to improve from and a parent for new variants.
- Example: For HotpotQA, the base second-hop retriever prompt is simple: “Given question and summar $y_1$ , produce query.”
Score on the Pareto Validation Set
- What happens: Evaluate the base system on a held-out validation set to track which candidate is best on which instance.
- Why it exists: Builds the Pareto map—needed to keep diverse winners.
- Example: Instance A might favor candidate 0 (base), instance B might favor a future candidate 3.
Select a Candidate to Evolve (Pareto-Based)
- What happens: From all candidates that are “best on at least one instance,” sample one—more wins means higher chance.
- Why it exists: Prevents tunnel vision; keeps exploring multiple promising ideas.
- Example: If candidate 2 is best on 10 cases and candidate 5 is best on 3, candidate 2 is more likely to be chosen, but candidate 5 still has a shot.
Choose a Module to Edit (e.g., round-robin)
- What happens: Pick which module’s prompt to mutate next.
- Why it exists: Focuses the update on where it’s needed; spreads attention across the system over time.
- Example: In a 3-module pipeline (retrieve → summarize → answer), iteration 1 edits retrieval, iteration 2 edits summarization, etc.
Run a Small Mini-Batch on the Feedback Set
- What happens: Execute the chosen candidate on a small number of training tasks; collect reasoning, tool outputs, and evaluator text.
- Why it exists: Cheap, fast feedback loop to test risky changes safely.
- Example: For IFBench, collect which output constraints were satisfied or failed for three prompts.
Reflective Prompt Mutation
- What happens: Provide the reflection LLM with the current instruction, the mini-batch traces, and feedback text; ask it to write a new, clearer rule-focused instruction.
- Why it exists: Converts concrete failures into general, reusable rules inside the prompt.
- Example: In PUPA, add: “Abstract personally identifiable info (PII) and explain your abstraction; never include names or IDs; maintain task utility.”
Quick Mini-Batch Re-Evaluation
- What happens: Test the edited prompt on the same mini-batch; if it’s better, keep it as a new candidate and record its parent.
- Why it exists: Gatekeeping—only keep promising edits; avoid wasting validation budget on weaker versions.
- Example: If accuracy rises on that mini-batch, we promote the new candidate to the main pool.
Pareto Validation Update
- What happens: Evaluate the kept candidate on the full Pareto validation set; update which instances it now leads.
- Why it exists: Refreshes the Pareto frontier so future selections reflect the latest winners.
- Example: The new candidate might now be the best on instances A, C, and F, joining the Pareto set.
(Optional) System-Aware Merge
- What happens: When two candidates with a shared ancestor improved different modules, and both beat the ancestor overall, fuse their best modules into a single child.
- Why it exists: Harvest complementary strengths—bigger jumps with fewer tries.
- Example: Merge a candidate with the best retriever prompt and another with the best answer-writer prompt.
Repeat Until Budget Ends; Return the Best by Validation Performance

What happens: Keep looping through selection, mutation, quick test, and Pareto updates.
Why it exists: Controlled exploration that spends most budget on validation to choose robust winners.
Example: After a few hundred rollouts, return the top-scoring candidate.

— The Secret Sauce —

Rich Text Feedback: Uses evaluator comments and tool errors, not just a final score, so each failure is a lesson.
Prompt-Level Edits: Changes behavior directly and immediately, with frozen weights.
Pareto Diversity: Keeps multiple “winners” alive, escaping local optima.
Targeted Merge: Smartly combines strengths when lineages are complementary.

— Concrete In-Action Examples —

HotpotQA Second Hop: GEPA learned rules like “don’t rephrase the first-hop question; target the broader or related entity hinted by summar $y_1$ ,” sharply improving retrieval.
PUPA Privacy: GEPA evolved strict, stepwise PII abstraction and required an explanation of privacy transformations, boosting score without harming utility.
Code Kernels: GEPA turned compiler/runtime errors into specific prompt rules (“check tensor shapes before indexing”), lifting kernel efficiency well beyond naive sequential refinement.

04Experiments & Results

— The Test — The authors tested GEPA on six benchmarks that stress different skills: multi-hop retrieval and reasoning (HotpotQA, HoVer), instruction-following under strict constraints (IFBench), privacy-aware delegation (PUPA), and math problem solving (AIME-2025, LiveBench-Math). They used both an open model (Qwen3-8B) and a closed model (GPT-4.1 Mini). They compared GEPA to strong baselines: GRPO (RL), MIPROv2 (prompt optimizer), TextGrad, and Trace (OptoPrime).

— The Competition —

GRPO: Reinforcement learning with verifiable rewards, typically run with very large rollout counts (e.g., 24,000).
MIPROv2: State-of-the-art joint instruction + few-shot optimizer using Bayesian search.
Trace/TextGrad: Use execution traces and language feedback differently; good modern baselines.

— The Scoreboard (with Context) —

Against GRPO (Qwen3-8B): GEPA outperformed by up to about 20 percentage points on final scores while using up to 35x fewer rollouts. On average across six tasks, GEPA gained about +6% over GRPO’s 24,000-rollout runs. Think of it like getting an A when the other method needed many more practice tests just to get a B.
Against MIPROv2: GEPA won by over 10% on aggregate across models and tasks. On AIME-2025, for instance, GEPA showed notable gains, and in several tasks it achieved +10–13% aggregate improvements.
Cross-Model Transfer: Prompts tuned on Qwen3-8B (weaker, open) still improved GPT-4.1 Mini by about +9% in aggregate—better than other optimizers that trained directly on GPT-4.1 Mini.
Prompt Length & Cost: GEPA’s evolved instructions were often up to about 9x shorter than MIPROv2’s few-shot heavy prompts, reducing token costs and latency while improving accuracy.
Pareto Sampling Wins: Replacing “always pick the current best” with Pareto-based selection increased aggregate performance by roughly +6–7% across tasks, avoiding local optima.

— Surprising Findings —

Instruction-Only Can Beat Instruction+Few-Shot: Because modern LLMs are better at following instructions and reflecting, strong rule-based instructions (GEPA) sometimes beat long, example-heavy prompts.
Validation Budget Dominates: Most of GEPA’s budget went to validation to manage selection; the actual number of train mini-batch rollouts needed to match GRPO’s best validation scores could be tiny (tens to hundreds), showing high sample efficiency.
Merge Helps, But Timing Matters: System-aware crossover (Merge) added a few extra percent on some settings, especially with GPT-4.1 Mini, but could hurt if invoked at the wrong time or with the wrong budget split.
Inference-Time Search for Code: Even without Retrieval-Augmented Generation, GEPA’s reflective updates turned raw compiler errors into better prompts, dramatically improving kernel quality (e.g., mean vector utilization jumping far beyond a naive baseline).

— Takeaway — If you treat text traces and evaluator feedback as first-class learning signals and evolve prompts with diversity (Pareto), you can beat heavyweight RL training on both speed (fewer rollouts) and final quality across varied, realistic tasks.

05Discussion & Limitations

— Limitations —

Feedback Quality Matters: If the evaluator’s text feedback is noisy or uninformative, reflections can propose the wrong rules.
Budget Split Sensitivity: Over-investing in validation vs. mini-batch learning (or invoking Merge too early) can waste budget or stall progress.
Module Credit Assignment Isn’t Perfect: Choosing which module to edit (e.g., round-robin) is simple; smarter policies might work better in some systems.
Some Tasks Favor Examples: A few domains may still benefit from carefully crafted few-shot demonstrations alongside instructions.

— Required Resources —

A capable reflection LLM to rewrite prompts from traces and feedback.
An evaluation harness that can return not only scores but also useful text diagnostics.
A dataset split into a feedback set (for learning) and a Pareto validation set (for selection).
Modest compute compared to RL: GEPA avoids massive gradient-based finetuning and heavy rollout counts.

— When NOT to Use —

If no meaningful text feedback exists (only a bare numeric score), GEPA loses much of its advantage.
If the system is a single black-box step with no modularity or traces, it’s harder to aim edits precisely.
If very long, detailed few-shots are absolutely required (e.g., complex structured outputs only learned via examples), instruction-only edits may underperform.

— Open Questions —

Adaptive Selection: Can we automatically adjust how much to spend on validation vs. training mini-batches over time?
Smarter Module Targeting: Can we learn a policy that chooses which module to edit based on trace signals rather than round-robin?
Merge Scheduling: When is the exact right moment (and budget share) to trigger crossover for maximum gains?
Human Feedback: How best to blend occasional human comments with automatic evaluator messages to steer reflection?
Robustness: How do reflectively evolved prompts behave under adversarial or distribution-shifted inputs over long deployments?

06Conclusion & Future Work

— 3-Sentence Summary — GEPA is a prompt optimizer that reads language traces and evaluator feedback to reflectively rewrite instructions, improving compound AI systems quickly. By maintaining a Pareto set of diverse winners and occasionally merging complementary lineages, it avoids local traps and scales learning from very few rollouts. Across six benchmarks and multiple models, GEPA outperformed strong RL and prompt-optimization baselines while often producing shorter, cheaper prompts.

— Main Achievement — The paper shows that natural-language reflection plus Pareto-guided evolution can outperform heavyweight reinforcement learning (like GRPO) in both sample efficiency and final quality on real, multi-module AI tasks.

— Future Directions —

Adaptive scheduling of validation vs. training mini-batches and smarter timing for Merge.
Learned policies for selecting which module to edit based on trace signals.
Wider applications in inference-time search (e.g., code, data pipelines) using domain-specific textual feedback.
Combining reflective instruction evolution with compact, well-chosen few-shots when domains require examples.

— Why Remember This — It reframes optimization for LLM systems: instead of “more gradients from more rollouts,” let the AI read rich language feedback and rewrite its own rules, while keeping many good ideas alive. That simple shift unlocks fast, frugal, and robust improvements that travel well across tasks and even across models.

Practical Applications

•Speed up multi-hop QA systems by evolving second-hop retrieval rules that target missing entities.
•Improve instruction-following agents (e.g., IFBench-style constraints) by adding precise format and wording rules learned from failed checks.
•Harden privacy pipelines (e.g., PUPA) by evolving explicit PII-abstraction procedures and justification steps.
•Optimize RAG workflows by teaching prompts to avoid redundant retrieval and fill knowledge gaps efficiently.
•Boost code-generation agents by converting compiler/runtime errors into specific coding and validation rules in the prompt.
•Reduce serving costs by replacing long few-shot examples with compact, high-quality instructions.
•Create robust evaluation: evolve adversarial prompts that reveal brittle behaviors to guide safety or fine-tuning.
•Transfer tuned prompts from open models to closed models as a quick-start baseline when switching providers.
•Run inference-time search for hard tasks (kernels, ETL scripts) by overfitting GEPA to a finite task set and sharing lessons across tasks.
•Design modular agent workflows where module-specific feedback guides targeted prompt edits for faster gains.

Version: 1