SCOPE: Prompt Evolution for Enhancing Agent Effectiveness

Zehua Pei; Hui-Ling Zhen; Shixiong Kai; Sinno Jialin Pan; Yunhe Wang; Mingxuan Yuan; Bei Yu

SCOPE: Prompt Evolution for Enhancing Agent Effectiveness

Beginner

Zehua Pei, Hui-Ling Zhen, Shixiong Kai et al.12/17/2025

arXiv PDF

Key Summary

•SCOPE lets AI agents rewrite their own instructions while they are working, so they can fix mistakes and get smarter on the next step, not just the next task.
•It turns an agent’s running history (its execution trace) into short, clear rules called guidelines that get added to the agent’s prompt.
•A Dual-Stream system separates quick, task-only fixes (tactical) from long-lasting best practices (strategic), so rules don’t get mixed up.
•SCOPE looks at problems from two angles at the same time—Efficiency and Thoroughness—and keeps whichever one solves the task.
•On tough benchmarks, SCOPE more than doubled success rates compared to static agents (for example, 14.23% to 38.64% on HLE).
•It helps agents actually use error messages as instructions, avoiding loops and fake assumptions.
•It also learns from successes, adding smarter strategies (like trying synonyms in searches) even when no error happens.
•SCOPE updates the system prompt (the agent’s ‘constitution’) so guidance is steady and not bossy, which led to the best results.
•It filters many candidate rules down to the best ones and cleans up duplicates so the prompt stays sharp.
•Because SCOPE works with different models, teams can pick cheaper or faster models for its helper parts without losing much accuracy.

Why This Research Matters

SCOPE makes AI assistants more reliable in real life by letting them learn from what just happened, not only from past training. That means fewer time-wasting loops, fewer made-up answers, and more smart shortcuts like batching or using synonyms. It helps with schoolwork, coding, research, and browsing by turning error messages into instant lessons and successes into reusable strategies. Because SCOPE’s helper parts work well with different models, teams can choose cost-effective setups without losing accuracy. The approach also encourages safer behavior by promoting verification rules and fallback plans (like Archive.org) rather than guessing. Overall, it upgrades AI from ‘static rule follower’ to ‘self-improving problem solver’ while it works.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook): Imagine you’re assembling a giant Lego set with hundreds of pieces. Instructions help, but sometimes you realize mid-build that a better way exists: stack the pieces in sets, label colors, or check the picture before snapping. If your instructions never change, you’ll repeat the same slow steps and even miss obvious fixes when pieces don’t fit.

🥬 Filling (The Actual Concept):

What it is: This paper studies how AI agents can handle huge, constantly changing information (‘context’) while they work—and proposes SCOPE, a way for agents to evolve their own prompts in real time so they manage that context better.
How it works (story of the field):
1. The world before: AI got really good at pulling in more context—using tools like web search and long context windows—so agents could see a lot. But seeing isn’t the same as using. Agents often treated error messages like noisy alarms instead of helpful fix-it notes and kept following the same sluggish strategies when no errors happened.
2. The problem: Two failures showed up again and again: Corrective Failures (the agent ignores explicit fixes hidden in error logs and retries the same thing) and Enhancement Failures (no errors happen, but the agent keeps doing something suboptimal, like searching only one keyword).
3. Failed attempts: Memory playbooks (like DC or ACE) learn after a task is over, so they can’t help mid-task at step 5 of 30. In-context feedback methods append reminders into the chat, but don’t change the agent’s instructions, so agents keep acknowledging mistakes while repeating them. Offline prompt optimizers make a great starting prompt, but it stays frozen once deployed.
4. The gap: Agents need step-level learning that updates the actual instructions, not just the chat, and they need per-agent learning because the browser’s problems (e.g., blocked pages) differ from the planner’s (e.g., bad delegation).
5. Real stakes: In daily life, this means fewer frustrating loops (like clicking the same broken link), less made-up information when files won’t open, faster answers when many similar items are needed, and smarter searches that try synonyms.
Why it matters: If an agent can’t adapt its own rules while it works, it wastes steps, misses clues that literally say how to fix the problem, and sometimes fabricates data to push forward—risky for research, coding, and everyday assistance.

🍞 Bottom Bread (Anchor): Picture a student doing a web research project. The site blocks access (Error: 403). A static agent keeps hitting refresh. SCOPE instead writes a new rule mid-task: “If access is blocked, immediately switch to search or use Archive.org,” and the student quickly finds an alternative source and finishes on time.

—

🍞 Top Bread (Hook): You know how a coach can whisper a quick tip mid-game (“stop dribbling into the corner”), and it changes the next play immediately? That beats giving feedback only after the game ends.

🥬 Filling (New Concept: Language Model Agent)

What it is: A language model agent is an AI that reads, plans, and acts (like searching the web or running tools) by following a written instruction sheet called a prompt.
How it works: 1) Read instructions; 2) Read the situation (context); 3) Decide the next action; 4) Observe what happened; 5) Repeat.
Why it matters: If the instructions never improve, the agent keeps repeating unhelpful moves.

🍞 Bottom Bread (Anchor): A coding agent that keeps calling the wrong function name will keep failing unless its instructions change to say, “Only use the functions on this list.”

—

🍞 Top Bread (Hook): Imagine a recipe card you never update, even after you discover your oven runs hot. You’d keep burning cookies.

🥬 Filling (New Concept: Prompt)

What it is: A prompt is the agent’s instruction sheet that tells it how to think, act, and prioritize.
How it works: It sets goals, rules (like safety and style), and tool-usage tips the agent follows on every step.
Why it matters: If the prompt stays static, the agent can’t learn new tricks or avoid old traps.

🍞 Bottom Bread (Anchor): “Always try synonyms in search” belongs in the prompt so the agent doesn’t forget next time.

—

🍞 Top Bread (Hook): When you keep a journal of what you tried in math class, it helps you learn what to do next time.

🥬 Filling (New Concept: Execution Trace)

What it is: An execution trace is the step-by-step history of what the agent did and what happened.
How it works: It logs actions, tool calls, errors, and successes.
Why it matters: It’s the best source of lessons—what to fix and what to repeat.

🍞 Bottom Bread (Anchor): “Tried ‘walks’ and got poor results—next time, include ‘base on balls (BB)’.”

—

🍞 Top Bread (Hook): Sometimes a fire alarm tells you to leave the building, but it doesn’t tell you where the exit is.

🥬 Filling (New Concept: Corrective Failure)

What it is: A corrective failure is when the agent sees an error but doesn’t use its message to fix the cause.
How it works: The agent treats errors as generic alarms and retries the same action.
Why it matters: It leads to loops, wasted steps, and even guesswork.

🍞 Bottom Bread (Anchor): The error lists valid arguments, but the agent keeps passing the wrong one anyway.

—

🍞 Top Bread (Hook): Getting a B on a test with no red marks doesn’t mean you can’t improve.

🥬 Filling (New Concept: Enhancement Failure)

What it is: An enhancement failure is when the agent misses a chance to optimize even though nothing broke.
How it works: The agent sticks to a one-track plan, ignoring hints to do better (like batching or synonyms).
Why it matters: It wastes time and misses higher accuracy.

🍞 Bottom Bread (Anchor): Searching only “walks” instead of also “BB” misses the right stats even without any error.

02Core Idea

🍞 Top Bread (Hook): Imagine your backpack checklist learns from your morning routine: after you forget your water bottle once, the list adds “Fill bottle” forever, and after you spend too long tying shoes, it adds “Double-knot quickly on busier days.”

🥬 Filling (The Aha! Concept: SCOPE)

What it is: SCOPE is a way for agents to evolve their prompts during the task by turning their own step-by-step history into short, useful rules.
How it works:
1. Watch what happened (successes and errors) in the execution trace.
2. Write tiny guidelines that fix errors or improve strategy.
3. Route each guideline to either a short-term bucket (tactical) or a long-term bucket (strategic).
4. Clean up and combine long-term rules so they stay sharp.
5. Update the agent’s prompt right now and continue.
Why it matters: Without live prompt updates, agents can’t recover mid-task, so they loop, stall, or stay average.

🍞 Bottom Bread (Anchor): A browser agent hits “Access Denied.” SCOPE immediately adds: “If blocked, switch to Search or use Archive.org,” and the agent pivots instead of retrying forever.

—

Three analogies:

Recipe notebook: You write a margin note the moment a cookie burns (“Reduce oven to 170°C”), so the next batch is correct.
Coach on the sidelines: Quick whispers change the very next play, not the season’s end.
Road signs while driving: If a bridge is closed, a fresh detour sign appears right away.

Before vs After:

Before: Agents could see more info but didn’t know how to use it; errors were alarms, not instructions; strategies stayed rigid.
After: Prompts evolve; errors teach fixes; successes teach optimizations; different agent roles learn different habits.

Why it works (intuition):

Guidance goes into the system prompt—like a constitution—so it’s internalized, not just temporarily noticed.
Dual-Stream separation prevents mixing one-off tips with evergreen rules.
Best-of-N picking and memory cleanup keep only high-quality guidance.
Multiple perspectives (Efficiency vs Thoroughness) cover different kinds of tasks, and the system keeps the winning result.

Building blocks (each introduced with a mini-sandwich):

🍞 Top Bread (Hook): You know how a teacher writes class rules after seeing common mistakes? 🥬 Filling (New Concept: Guideline Synthesis)

What it is: The agent turns traces into short, actionable rules.
How it works: When an error happens, write a corrective rule; when a success looks improvable, write an enhancement rule; generate a few candidates and pick the best.
Why it matters: Clear rules change behavior immediately instead of hoping the agent ‘gets the hint.’ 🍞 Bottom Bread (Anchor): “Always list plausible label synonyms when comparing figures.”

🍞 Top Bread (Hook): Like packing two bags: one for today’s trip, one for every trip ever. 🥬 Filling (New Concept: Dual-Stream Mechanism with Tactical vs Strategic Memory)

What it is: Tactical rules help only this task; strategic rules become permanent best practices.
How it works: Classify the new rule; low-confidence or specific rules stay tactical; high-confidence, general rules go strategic; combine them into the next step’s prompt.
Why it matters: You avoid cluttering long-term memory with one-off tips and avoid forgetting immediate fixes. 🍞 Bottom Bread (Anchor): Tactical: “This dataset has column X missing.” Strategic: “Batch similar queries instead of running them one-by-one.”

🍞 Top Bread (Hook): Like cleaning your locker: combine duplicates, remove overlaps, and settle disagreements. 🥬 Filling (New Concept: Memory Optimization)

What it is: A cleanup step that resolves conflicts, removes rules already covered by broader ones, and merges similar rules.
How it works: 1) Conflict resolution; 2) Subsumption pruning; 3) Consolidation.
Why it matters: A messy rulebook slows the agent; a tidy one speeds it up. 🍞 Bottom Bread (Anchor): “Cache API results” and “Cache intermediate results” become one crisp caching rule.

🍞 Top Bread (Hook): Solving a mystery is easier if one detective works fast and another digs deep. 🥬 Filling (New Concept: Perspective-Driven Exploration)

What it is: The agent runs two evolving prompts at once—one focused on Efficiency and one on Thoroughness—and keeps the better answer.
How it works: Each perspective learns different rules; at the end, choose the successful path.
Why it matters: Some tasks need speed; others need depth. Running both covers more ground. 🍞 Bottom Bread (Anchor): Efficiency says “fail fast and switch tools.” Thoroughness says “find alternate sources and verify.”

03Methodology

High-level pipeline: Input task → Execute with current prompt → Trigger on error or subtask success → Generate rule candidates → Select best rule → Classify as tactical or strategic → Optimize strategic memory → Update prompt → Continue.

Step-by-step (with mini-sandwiches and examples):

Triggers from the execution trace

What happens: While the agent works, SCOPE watches for two sparks: an error (corrective mode) or a success that looks improvable (enhancement mode).
Why this step exists: Without triggers, we’d never know when to learn.
Example: The agent calls a function with the wrong parameter; the error message lists the valid ones. Or a search finds results but took three redundant queries.

Guideline Synthesis (Generator + Best-of-N Selection) 🍞 Top Bread (Hook): Like writing a sticky note right after a mistake so you don’t repeat it. 🥬 Filling (Concept)

What happens: The Guideline Generator proposes 2 short rules (candidates) based on rubrics for either fixing errors or boosting efficiency/correctness; the Selector picks the best one.
Why this step exists: Generating a few options reduces random bad tips; picking the best improves quality.
Example data: From an error: “NameError: variable x not defined” → Candidate A: “Define all variables used in code snippets.” Candidate B: “Avoid using x.” Selector keeps A. 🍞 Bottom Bread (Anchor): From a success: “Search ‘walks’ gave weak results” → “When stats are ambiguous, include synonyms like ‘base on balls (BB).’”

Dual-Stream Routing (Classifier to Tactical vs Strategic) 🍞 Top Bread (Hook): Pack a lunch (just for today) vs stock the pantry (for every day). 🥬 Filling (Concept)

What happens: The Classifier scores generality and confidence, routing high-confidence general rules to Strategic Memory and specific or lower-confidence rules to Tactical Memory.
Why this step exists: Mixing one-off instructions with universal ones causes confusion and bloat.
Example: “This PDF is corrupted; skip it” (tactical). “If access is blocked, switch to search or Archive.org” (strategic). 🍞 Bottom Bread (Anchor): A browsing guideline about one site’s broken button is tactical; a rule about blocked access fallback is strategic.

Memory Optimization (Optimizer) 🍞 Top Bread (Hook): Like editing a study guide so it’s short, clear, and non-repeating. 🥬 Filling (Concept)

What happens: Resolve conflicts, prune specifics that a general rule already covers, and merge near-duplicates.
Why this step exists: A clean, compact rulebook makes the agent faster and more reliable.
Example data: 11 efficiency rules shrink to 5: batching, caching, local simple math, concise outputs, and early stop. 🍞 Bottom Bread (Anchor): “Cache API calls” + “Cache intermediate results” → “Cache results to avoid redundant work.”

Prompt Update (system prompt focus)

What happens: The new prompt is built from base instructions + strategic memory + current task’s tactical memory.
Why this step exists: Placing rules in the system prompt works best—it acts like background guidance, not bossy commands—leading to higher success rates.
Example: After adding “Try synonyms,” the very next search includes ‘BB’ automatically.

Perspective-Driven Exploration (K=2: Efficiency and Thoroughness) 🍞 Top Bread (Hook): Two runners take different routes; you keep the one who finishes. 🥬 Filling (Concept)

What happens: Run two parallel agents with different evolving prompts; each learns its own rules; keep the success.
Why this step exists: One strategy can’t cover all tasks; diversity catches more wins.
Example: When a page is blocked, Efficiency escalates to Search; Thoroughness tries Archive.org and alternate sources. 🍞 Bottom Bread (Anchor): On hard GAIA tasks, Efficiency often wins; on mid-level tasks, Thoroughness shines; together they beat either alone.

The secret sauce:

Step-level learning: Fixes appear within seconds, not after task completion.
Direct instruction updates: Rules go into the system prompt (the ‘constitution’), not just the chat history.
Quality filters: Best-of-N selection plus memory cleanup keep the rulebook precise.
Perspective diversity: Efficiency + Thoroughness cover speed and depth simultaneously.

Concrete mini-examples:

Corrective rule: “Use only allowed tool names from the provided list.” Stops repeated invalid tool calls.
Enhancement rule: “Batch similar queries (players A, B, C) in one call.” Cuts steps and cost.
Placement win: System prompt guidance improved accuracy even when it tolerated more exploratory errors—because subtle background rules encouraged better choices without over-restricting the agent.

04Experiments & Results

The test: Can evolving prompts improve real multi-step tasks? The team evaluated on three tough benchmarks where agents must plan, browse, retrieve, and reason:

HLE (expert-level questions across STEM and humanities)
GAIA (real-world assistant tasks with three difficulty levels)
DeepSearch (multi-hop retrieval and synthesis)

The competition:

Static baseline agent (no evolution)
Dynamic Cheatsheet (DC) and Agentic Context Engineering (ACE), two strong prompt/memory learners

The scoreboard (Pass@2, like getting two tries and counting a win if either succeeds):

HLE: Baseline 14.23% → SCOPE 38.64% (from a D to a strong B+, more than 2× improvement)
GAIA: Baseline 32.73% → SCOPE 56.97% (like jumping from a mid C to a solid A-)
DeepSearch: Baseline 14.00 → SCOPE 32.00 (over 2×)

Ablations (what parts matter):

Just adding a Guideline Generator beats baseline (+4.85%).
Dual-Stream routing (tactical vs strategic) adds +3.63%.
Best-of-N selection adds +3.03%.
Memory optimization adds +1.82% (small but steady).
Perspective-Driven Exploration (Efficiency + Thoroughness) is the single biggest jump: +10.91%.

Surprising findings:

System vs User prompt placement: Putting rules in the system prompt worked best—even though it allowed more exploration and some extra minor errors—because it acted like calm background guidance rather than strict orders that can prematurely stop good searches.
Model choice for SCOPE’s helper parts: Using GPT-4.1, Gemini-2.5-Pro, or matching the base agent all scored within ~1% of each other, even though Gemini generated 46% more guidelines. Lesson: quality filtering beats guideline quantity.
Baseline fragility: A plain agent wrapped with tools sometimes did worse than a raw model, because tool errors and loops tanked performance. SCOPE’s corrective and enhancement rules reduced those loops and turned error messages into clear fixes.
Enhancement dominates: About 61% of learned guidelines were enhancement (not just error fixes). SCOPE is more than a debugger—it’s a strategy upgrader.

Behavioral metrics (GAIA):

Baseline had the most errors and timeouts.
User-prompt rules cut visible errors but led to over-compliance and lower accuracy.
System-prompt rules encouraged smarter exploration with higher overall success.

Category gains:

HLE Biology/Medicine and Chemistry saw huge jumps where strict protocols and specialized tools matter—SCOPE’s domain rules helped agents recover and proceed correctly.
GAIA Level 3 (hardest) improved notably, showing SCOPE’s value in long, noisy, multi-turn tasks.

05Discussion & Limitations

Limitations:

Compute and latency overhead: Generating, selecting, classifying, and optimizing rules costs tokens and time—even if small per step, it adds up on long tasks.
Garbage in, garbage out: Poor traces can produce poor rules; low-quality signals risk teaching the wrong habits.
Rule drift and overfitting: If early rules are too specific, they can crowd the strategic memory before cleanup prunes them.
Conflict resolution isn’t perfect: Automated merging may oversimplify or drop rare-but-important cases.

Required resources:

Access to capable LLMs for the main agent and SCOPE’s helper parts (generator/selector/classifier/optimizer).
Good logging of execution traces and a tool-using agent framework.
Prompt capacity in the system prompt to host evolving rules.

When NOT to use:

One-shot, short tasks where step-level learning won’t kick in.
Ultra-tight latency budgets where any meta-reasoning is too costly.
Strictly regulated settings demanding fully fixed prompts and frozen behavior.

Open questions:

How to add stronger guarantees (e.g., rule verification or formal safety checks) before promoting a rule to strategic memory?
Can we auto-tune thresholds (like confidence 0.85) or pick the number of perspectives based on the task?
How to measure and improve rule ‘coverage’ so rare but critical situations still get good guidance?
How to combine SCOPE with retrieval/compression so evolving rules also manage what to load and what to trim?
Can we detect and reverse harmful rules quickly (rule rollback) with formal triggers?

06Conclusion & Future Work

Three-sentence summary: SCOPE lets AI agents evolve their own prompts during a task by turning step-by-step traces into tiny, targeted guidelines. A Dual-Stream design separates short-term fixes from long-term best practices, while two perspectives—Efficiency and Thoroughness—expand strategy coverage and keep the best result. Across tough benchmarks, SCOPE more than doubled success rates over static agents, showing that live prompt evolution beats fixed instructions.

Main achievement: Reframing context management as online prompt evolution—teaching agents to write, test, route, and refine their own rules in real time—so they adapt mid-task rather than only after they’re done.

Future directions: Add stronger safety checks before promoting rules; auto-tune confidence and perspectives; integrate with retrieval/compression to manage both what the agent sees and how it acts; and develop rollback for harmful rules.

Why remember this: SCOPE marks a shift from crafting perfect, static prompts to building agents that improve their own instructions on the fly—turning every error and success into a durable skill.

Practical Applications

•Research assistants that automatically add rules for handling blocked pages and citing sources, improving completeness and trust.
•Coding agents that learn to define variables before use and to run quick local math instead of opening tools, reducing errors and cost.
•Customer support bots that generalize fixes from one ticket (like retrying with valid parameters) to future similar issues.
•Data analysts that adopt batching and caching rules to speed up repeated queries across many items.
•Education helpers that learn to try synonyms, native scripts, or alternate spellings to improve search recall for students.
•Compliance checkers that add verification steps (e.g., confirm statistical assumptions) when domain pitfalls are detected.
•Web-browsing agents that pivot to Archive.org or alternate sources when facing 403/404 blocks.
•Project planners that evolve better delegation rules (which tool for which subtask) after each success or failure.
•Operations dashboards that reduce alert loops by turning recurring error patterns into immediate, actionable runbook steps.
•Knowledge workers’ copilots that keep the best long-term strategies in a compact ‘strategic memory’ while ignoring one-off clutter.

Version: 1