SCOPE: Prompt Evolution for Enhancing Agent Effectiveness
Key Summary
- ā¢SCOPE lets AI agents rewrite their own instructions while they are working, so they can fix mistakes and get smarter on the next step, not just the next task.
- ā¢It turns an agentās running history (its execution trace) into short, clear rules called guidelines that get added to the agentās prompt.
- ā¢A Dual-Stream system separates quick, task-only fixes (tactical) from long-lasting best practices (strategic), so rules donāt get mixed up.
- ā¢SCOPE looks at problems from two angles at the same timeāEfficiency and Thoroughnessāand keeps whichever one solves the task.
- ā¢On tough benchmarks, SCOPE more than doubled success rates compared to static agents (for example, 14.23% to 38.64% on HLE).
- ā¢It helps agents actually use error messages as instructions, avoiding loops and fake assumptions.
- ā¢It also learns from successes, adding smarter strategies (like trying synonyms in searches) even when no error happens.
- ā¢SCOPE updates the system prompt (the agentās āconstitutionā) so guidance is steady and not bossy, which led to the best results.
- ā¢It filters many candidate rules down to the best ones and cleans up duplicates so the prompt stays sharp.
- ā¢Because SCOPE works with different models, teams can pick cheaper or faster models for its helper parts without losing much accuracy.
Why This Research Matters
SCOPE makes AI assistants more reliable in real life by letting them learn from what just happened, not only from past training. That means fewer time-wasting loops, fewer made-up answers, and more smart shortcuts like batching or using synonyms. It helps with schoolwork, coding, research, and browsing by turning error messages into instant lessons and successes into reusable strategies. Because SCOPEās helper parts work well with different models, teams can choose cost-effective setups without losing accuracy. The approach also encourages safer behavior by promoting verification rules and fallback plans (like Archive.org) rather than guessing. Overall, it upgrades AI from āstatic rule followerā to āself-improving problem solverā while it works.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
š Top Bread (Hook): Imagine youāre assembling a giant Lego set with hundreds of pieces. Instructions help, but sometimes you realize mid-build that a better way exists: stack the pieces in sets, label colors, or check the picture before snapping. If your instructions never change, youāll repeat the same slow steps and even miss obvious fixes when pieces donāt fit.
š„¬ Filling (The Actual Concept):
- What it is: This paper studies how AI agents can handle huge, constantly changing information (ācontextā) while they workāand proposes SCOPE, a way for agents to evolve their own prompts in real time so they manage that context better.
- How it works (story of the field):
- The world before: AI got really good at pulling in more contextāusing tools like web search and long context windowsāso agents could see a lot. But seeing isnāt the same as using. Agents often treated error messages like noisy alarms instead of helpful fix-it notes and kept following the same sluggish strategies when no errors happened.
- The problem: Two failures showed up again and again: Corrective Failures (the agent ignores explicit fixes hidden in error logs and retries the same thing) and Enhancement Failures (no errors happen, but the agent keeps doing something suboptimal, like searching only one keyword).
- Failed attempts: Memory playbooks (like DC or ACE) learn after a task is over, so they canāt help mid-task at step 5 of 30. In-context feedback methods append reminders into the chat, but donāt change the agentās instructions, so agents keep acknowledging mistakes while repeating them. Offline prompt optimizers make a great starting prompt, but it stays frozen once deployed.
- The gap: Agents need step-level learning that updates the actual instructions, not just the chat, and they need per-agent learning because the browserās problems (e.g., blocked pages) differ from the plannerās (e.g., bad delegation).
- Real stakes: In daily life, this means fewer frustrating loops (like clicking the same broken link), less made-up information when files wonāt open, faster answers when many similar items are needed, and smarter searches that try synonyms.
- Why it matters: If an agent canāt adapt its own rules while it works, it wastes steps, misses clues that literally say how to fix the problem, and sometimes fabricates data to push forwardārisky for research, coding, and everyday assistance.
š Bottom Bread (Anchor): Picture a student doing a web research project. The site blocks access (Error: 403). A static agent keeps hitting refresh. SCOPE instead writes a new rule mid-task: āIf access is blocked, immediately switch to search or use Archive.org,ā and the student quickly finds an alternative source and finishes on time.
ā
š Top Bread (Hook): You know how a coach can whisper a quick tip mid-game (āstop dribbling into the cornerā), and it changes the next play immediately? That beats giving feedback only after the game ends.
š„¬ Filling (New Concept: Language Model Agent)
- What it is: A language model agent is an AI that reads, plans, and acts (like searching the web or running tools) by following a written instruction sheet called a prompt.
- How it works: 1) Read instructions; 2) Read the situation (context); 3) Decide the next action; 4) Observe what happened; 5) Repeat.
- Why it matters: If the instructions never improve, the agent keeps repeating unhelpful moves.
š Bottom Bread (Anchor): A coding agent that keeps calling the wrong function name will keep failing unless its instructions change to say, āOnly use the functions on this list.ā
ā
š Top Bread (Hook): Imagine a recipe card you never update, even after you discover your oven runs hot. Youād keep burning cookies.
š„¬ Filling (New Concept: Prompt)
- What it is: A prompt is the agentās instruction sheet that tells it how to think, act, and prioritize.
- How it works: It sets goals, rules (like safety and style), and tool-usage tips the agent follows on every step.
- Why it matters: If the prompt stays static, the agent canāt learn new tricks or avoid old traps.
š Bottom Bread (Anchor): āAlways try synonyms in searchā belongs in the prompt so the agent doesnāt forget next time.
ā
š Top Bread (Hook): When you keep a journal of what you tried in math class, it helps you learn what to do next time.
š„¬ Filling (New Concept: Execution Trace)
- What it is: An execution trace is the step-by-step history of what the agent did and what happened.
- How it works: It logs actions, tool calls, errors, and successes.
- Why it matters: Itās the best source of lessonsāwhat to fix and what to repeat.
š Bottom Bread (Anchor): āTried āwalksā and got poor resultsānext time, include ābase on balls (BB)ā.ā
ā
š Top Bread (Hook): Sometimes a fire alarm tells you to leave the building, but it doesnāt tell you where the exit is.
š„¬ Filling (New Concept: Corrective Failure)
- What it is: A corrective failure is when the agent sees an error but doesnāt use its message to fix the cause.
- How it works: The agent treats errors as generic alarms and retries the same action.
- Why it matters: It leads to loops, wasted steps, and even guesswork.
š Bottom Bread (Anchor): The error lists valid arguments, but the agent keeps passing the wrong one anyway.
ā
š Top Bread (Hook): Getting a B on a test with no red marks doesnāt mean you canāt improve.
š„¬ Filling (New Concept: Enhancement Failure)
- What it is: An enhancement failure is when the agent misses a chance to optimize even though nothing broke.
- How it works: The agent sticks to a one-track plan, ignoring hints to do better (like batching or synonyms).
- Why it matters: It wastes time and misses higher accuracy.
š Bottom Bread (Anchor): Searching only āwalksā instead of also āBBā misses the right stats even without any error.
02Core Idea
š Top Bread (Hook): Imagine your backpack checklist learns from your morning routine: after you forget your water bottle once, the list adds āFill bottleā forever, and after you spend too long tying shoes, it adds āDouble-knot quickly on busier days.ā
š„¬ Filling (The Aha! Concept: SCOPE)
- What it is: SCOPE is a way for agents to evolve their prompts during the task by turning their own step-by-step history into short, useful rules.
- How it works:
- Watch what happened (successes and errors) in the execution trace.
- Write tiny guidelines that fix errors or improve strategy.
- Route each guideline to either a short-term bucket (tactical) or a long-term bucket (strategic).
- Clean up and combine long-term rules so they stay sharp.
- Update the agentās prompt right now and continue.
- Why it matters: Without live prompt updates, agents canāt recover mid-task, so they loop, stall, or stay average.
š Bottom Bread (Anchor): A browser agent hits āAccess Denied.ā SCOPE immediately adds: āIf blocked, switch to Search or use Archive.org,ā and the agent pivots instead of retrying forever.
ā
Three analogies:
- Recipe notebook: You write a margin note the moment a cookie burns (āReduce oven to 170°Cā), so the next batch is correct.
- Coach on the sidelines: Quick whispers change the very next play, not the seasonās end.
- Road signs while driving: If a bridge is closed, a fresh detour sign appears right away.
Before vs After:
- Before: Agents could see more info but didnāt know how to use it; errors were alarms, not instructions; strategies stayed rigid.
- After: Prompts evolve; errors teach fixes; successes teach optimizations; different agent roles learn different habits.
Why it works (intuition):
- Guidance goes into the system promptālike a constitutionāso itās internalized, not just temporarily noticed.
- Dual-Stream separation prevents mixing one-off tips with evergreen rules.
- Best-of-N picking and memory cleanup keep only high-quality guidance.
- Multiple perspectives (Efficiency vs Thoroughness) cover different kinds of tasks, and the system keeps the winning result.
Building blocks (each introduced with a mini-sandwich):
š Top Bread (Hook): You know how a teacher writes class rules after seeing common mistakes? š„¬ Filling (New Concept: Guideline Synthesis)
- What it is: The agent turns traces into short, actionable rules.
- How it works: When an error happens, write a corrective rule; when a success looks improvable, write an enhancement rule; generate a few candidates and pick the best.
- Why it matters: Clear rules change behavior immediately instead of hoping the agent āgets the hint.ā š Bottom Bread (Anchor): āAlways list plausible label synonyms when comparing figures.ā
š Top Bread (Hook): Like packing two bags: one for todayās trip, one for every trip ever. š„¬ Filling (New Concept: Dual-Stream Mechanism with Tactical vs Strategic Memory)
- What it is: Tactical rules help only this task; strategic rules become permanent best practices.
- How it works: Classify the new rule; low-confidence or specific rules stay tactical; high-confidence, general rules go strategic; combine them into the next stepās prompt.
- Why it matters: You avoid cluttering long-term memory with one-off tips and avoid forgetting immediate fixes. š Bottom Bread (Anchor): Tactical: āThis dataset has column X missing.ā Strategic: āBatch similar queries instead of running them one-by-one.ā
š Top Bread (Hook): Like cleaning your locker: combine duplicates, remove overlaps, and settle disagreements. š„¬ Filling (New Concept: Memory Optimization)
- What it is: A cleanup step that resolves conflicts, removes rules already covered by broader ones, and merges similar rules.
- How it works: 1) Conflict resolution; 2) Subsumption pruning; 3) Consolidation.
- Why it matters: A messy rulebook slows the agent; a tidy one speeds it up. š Bottom Bread (Anchor): āCache API resultsā and āCache intermediate resultsā become one crisp caching rule.
š Top Bread (Hook): Solving a mystery is easier if one detective works fast and another digs deep. š„¬ Filling (New Concept: Perspective-Driven Exploration)
- What it is: The agent runs two evolving prompts at onceāone focused on Efficiency and one on Thoroughnessāand keeps the better answer.
- How it works: Each perspective learns different rules; at the end, choose the successful path.
- Why it matters: Some tasks need speed; others need depth. Running both covers more ground. š Bottom Bread (Anchor): Efficiency says āfail fast and switch tools.ā Thoroughness says āfind alternate sources and verify.ā
03Methodology
High-level pipeline: Input task ā Execute with current prompt ā Trigger on error or subtask success ā Generate rule candidates ā Select best rule ā Classify as tactical or strategic ā Optimize strategic memory ā Update prompt ā Continue.
Step-by-step (with mini-sandwiches and examples):
- Triggers from the execution trace
- What happens: While the agent works, SCOPE watches for two sparks: an error (corrective mode) or a success that looks improvable (enhancement mode).
- Why this step exists: Without triggers, weād never know when to learn.
- Example: The agent calls a function with the wrong parameter; the error message lists the valid ones. Or a search finds results but took three redundant queries.
- Guideline Synthesis (Generator + Best-of-N Selection) š Top Bread (Hook): Like writing a sticky note right after a mistake so you donāt repeat it. š„¬ Filling (Concept)
- What happens: The Guideline Generator proposes 2 short rules (candidates) based on rubrics for either fixing errors or boosting efficiency/correctness; the Selector picks the best one.
- Why this step exists: Generating a few options reduces random bad tips; picking the best improves quality.
- Example data: From an error: āNameError: variable x not definedā ā Candidate A: āDefine all variables used in code snippets.ā Candidate B: āAvoid using x.ā Selector keeps A. š Bottom Bread (Anchor): From a success: āSearch āwalksā gave weak resultsā ā āWhen stats are ambiguous, include synonyms like ābase on balls (BB).āā
- Dual-Stream Routing (Classifier to Tactical vs Strategic) š Top Bread (Hook): Pack a lunch (just for today) vs stock the pantry (for every day). š„¬ Filling (Concept)
- What happens: The Classifier scores generality and confidence, routing high-confidence general rules to Strategic Memory and specific or lower-confidence rules to Tactical Memory.
- Why this step exists: Mixing one-off instructions with universal ones causes confusion and bloat.
- Example: āThis PDF is corrupted; skip itā (tactical). āIf access is blocked, switch to search or Archive.orgā (strategic). š Bottom Bread (Anchor): A browsing guideline about one siteās broken button is tactical; a rule about blocked access fallback is strategic.
- Memory Optimization (Optimizer) š Top Bread (Hook): Like editing a study guide so itās short, clear, and non-repeating. š„¬ Filling (Concept)
- What happens: Resolve conflicts, prune specifics that a general rule already covers, and merge near-duplicates.
- Why this step exists: A clean, compact rulebook makes the agent faster and more reliable.
- Example data: 11 efficiency rules shrink to 5: batching, caching, local simple math, concise outputs, and early stop. š Bottom Bread (Anchor): āCache API callsā + āCache intermediate resultsā ā āCache results to avoid redundant work.ā
- Prompt Update (system prompt focus)
- What happens: The new prompt is built from base instructions + strategic memory + current taskās tactical memory.
- Why this step exists: Placing rules in the system prompt works bestāit acts like background guidance, not bossy commandsāleading to higher success rates.
- Example: After adding āTry synonyms,ā the very next search includes āBBā automatically.
- Perspective-Driven Exploration (K=2: Efficiency and Thoroughness) š Top Bread (Hook): Two runners take different routes; you keep the one who finishes. š„¬ Filling (Concept)
- What happens: Run two parallel agents with different evolving prompts; each learns its own rules; keep the success.
- Why this step exists: One strategy canāt cover all tasks; diversity catches more wins.
- Example: When a page is blocked, Efficiency escalates to Search; Thoroughness tries Archive.org and alternate sources. š Bottom Bread (Anchor): On hard GAIA tasks, Efficiency often wins; on mid-level tasks, Thoroughness shines; together they beat either alone.
The secret sauce:
- Step-level learning: Fixes appear within seconds, not after task completion.
- Direct instruction updates: Rules go into the system prompt (the āconstitutionā), not just the chat history.
- Quality filters: Best-of-N selection plus memory cleanup keep the rulebook precise.
- Perspective diversity: Efficiency + Thoroughness cover speed and depth simultaneously.
Concrete mini-examples:
- Corrective rule: āUse only allowed tool names from the provided list.ā Stops repeated invalid tool calls.
- Enhancement rule: āBatch similar queries (players A, B, C) in one call.ā Cuts steps and cost.
- Placement win: System prompt guidance improved accuracy even when it tolerated more exploratory errorsābecause subtle background rules encouraged better choices without over-restricting the agent.
04Experiments & Results
The test: Can evolving prompts improve real multi-step tasks? The team evaluated on three tough benchmarks where agents must plan, browse, retrieve, and reason:
- HLE (expert-level questions across STEM and humanities)
- GAIA (real-world assistant tasks with three difficulty levels)
- DeepSearch (multi-hop retrieval and synthesis)
The competition:
- Static baseline agent (no evolution)
- Dynamic Cheatsheet (DC) and Agentic Context Engineering (ACE), two strong prompt/memory learners
The scoreboard (Pass@2, like getting two tries and counting a win if either succeeds):
- HLE: Baseline 14.23% ā SCOPE 38.64% (from a D to a strong B+, more than 2Ć improvement)
- GAIA: Baseline 32.73% ā SCOPE 56.97% (like jumping from a mid C to a solid A-)
- DeepSearch: Baseline 14.00 ā SCOPE 32.00 (over 2Ć)
Ablations (what parts matter):
- Just adding a Guideline Generator beats baseline (+4.85%).
- Dual-Stream routing (tactical vs strategic) adds +3.63%.
- Best-of-N selection adds +3.03%.
- Memory optimization adds +1.82% (small but steady).
- Perspective-Driven Exploration (Efficiency + Thoroughness) is the single biggest jump: +10.91%.
Surprising findings:
- System vs User prompt placement: Putting rules in the system prompt worked bestāeven though it allowed more exploration and some extra minor errorsābecause it acted like calm background guidance rather than strict orders that can prematurely stop good searches.
- Model choice for SCOPEās helper parts: Using GPT-4.1, Gemini-2.5-Pro, or matching the base agent all scored within ~1% of each other, even though Gemini generated 46% more guidelines. Lesson: quality filtering beats guideline quantity.
- Baseline fragility: A plain agent wrapped with tools sometimes did worse than a raw model, because tool errors and loops tanked performance. SCOPEās corrective and enhancement rules reduced those loops and turned error messages into clear fixes.
- Enhancement dominates: About 61% of learned guidelines were enhancement (not just error fixes). SCOPE is more than a debuggerāitās a strategy upgrader.
Behavioral metrics (GAIA):
- Baseline had the most errors and timeouts.
- User-prompt rules cut visible errors but led to over-compliance and lower accuracy.
- System-prompt rules encouraged smarter exploration with higher overall success.
Category gains:
- HLE Biology/Medicine and Chemistry saw huge jumps where strict protocols and specialized tools matterāSCOPEās domain rules helped agents recover and proceed correctly.
- GAIA Level 3 (hardest) improved notably, showing SCOPEās value in long, noisy, multi-turn tasks.
05Discussion & Limitations
Limitations:
- Compute and latency overhead: Generating, selecting, classifying, and optimizing rules costs tokens and timeāeven if small per step, it adds up on long tasks.
- Garbage in, garbage out: Poor traces can produce poor rules; low-quality signals risk teaching the wrong habits.
- Rule drift and overfitting: If early rules are too specific, they can crowd the strategic memory before cleanup prunes them.
- Conflict resolution isnāt perfect: Automated merging may oversimplify or drop rare-but-important cases.
Required resources:
- Access to capable LLMs for the main agent and SCOPEās helper parts (generator/selector/classifier/optimizer).
- Good logging of execution traces and a tool-using agent framework.
- Prompt capacity in the system prompt to host evolving rules.
When NOT to use:
- One-shot, short tasks where step-level learning wonāt kick in.
- Ultra-tight latency budgets where any meta-reasoning is too costly.
- Strictly regulated settings demanding fully fixed prompts and frozen behavior.
Open questions:
- How to add stronger guarantees (e.g., rule verification or formal safety checks) before promoting a rule to strategic memory?
- Can we auto-tune thresholds (like confidence 0.85) or pick the number of perspectives based on the task?
- How to measure and improve rule ācoverageā so rare but critical situations still get good guidance?
- How to combine SCOPE with retrieval/compression so evolving rules also manage what to load and what to trim?
- Can we detect and reverse harmful rules quickly (rule rollback) with formal triggers?
06Conclusion & Future Work
Three-sentence summary: SCOPE lets AI agents evolve their own prompts during a task by turning step-by-step traces into tiny, targeted guidelines. A Dual-Stream design separates short-term fixes from long-term best practices, while two perspectivesāEfficiency and Thoroughnessāexpand strategy coverage and keep the best result. Across tough benchmarks, SCOPE more than doubled success rates over static agents, showing that live prompt evolution beats fixed instructions.
Main achievement: Reframing context management as online prompt evolutionāteaching agents to write, test, route, and refine their own rules in real timeāso they adapt mid-task rather than only after theyāre done.
Future directions: Add stronger safety checks before promoting rules; auto-tune confidence and perspectives; integrate with retrieval/compression to manage both what the agent sees and how it acts; and develop rollback for harmful rules.
Why remember this: SCOPE marks a shift from crafting perfect, static prompts to building agents that improve their own instructions on the flyāturning every error and success into a durable skill.
Practical Applications
- ā¢Research assistants that automatically add rules for handling blocked pages and citing sources, improving completeness and trust.
- ā¢Coding agents that learn to define variables before use and to run quick local math instead of opening tools, reducing errors and cost.
- ā¢Customer support bots that generalize fixes from one ticket (like retrying with valid parameters) to future similar issues.
- ā¢Data analysts that adopt batching and caching rules to speed up repeated queries across many items.
- ā¢Education helpers that learn to try synonyms, native scripts, or alternate spellings to improve search recall for students.
- ā¢Compliance checkers that add verification steps (e.g., confirm statistical assumptions) when domain pitfalls are detected.
- ā¢Web-browsing agents that pivot to Archive.org or alternate sources when facing 403/404 blocks.
- ā¢Project planners that evolve better delegation rules (which tool for which subtask) after each success or failure.
- ā¢Operations dashboards that reduce alert loops by turning recurring error patterns into immediate, actionable runbook steps.
- ā¢Knowledge workersā copilots that keep the best long-term strategies in a compact āstrategic memoryā while ignoring one-off clutter.