Reinforcement Learning for Self-Improving Agent with Skill Library

Jiongxiao Wang; Qiaojing Yan; Yawei Wang; Yijun Tian; Soumya Smruti Mishra; Zhichao Xu; Megha Gandhi; Panpan Xu; Lin Lee Cheong

Reinforcement Learning for Self-Improving Agent with Skill Library

Intermediate

Jiongxiao Wang, Qiaojing Yan, Yawei Wang et al.12/18/2025

arXiv PDF

Key Summary

•This paper teaches AI agents to learn new reusable skills and get better over time by using reinforcement learning, not just prompts.
•The key idea, called SAGE, trains on pairs of similar tasks so the agent can create a skill in the first task and reuse it in the second.
•A new reward (Skill-integrated Reward) gives extra points when the agent both succeeds at a task and uses or creates a helpful skill.
•Sequential Rollout lets skills made in earlier tasks carry over to later tasks during training, so learning sticks.
•They first do supervised fine-tuning with expert examples to help the model follow the skill format reliably.
•On the AppWorld benchmark, SAGE improves Scenario Goal Completion by 8.9 points over a strong RL baseline.
•It also needs 26% fewer steps and 59% fewer tokens, so it is faster and cheaper to run.
•Simply prompting a skill library didn’t work well; RL with skill-aware rewards was needed to really improve.
•Ablations show that longer task chains didn’t help, skill retrieval choices matter, and the new reward outperforms simpler rewards.
•Results suggest a practical path to self-improving agents that keep getting better after deployment.

Why This Research Matters

Self-improving agents that learn reusable skills can finish everyday digital tasks faster and cheaper, which lowers real-world costs. By rewarding both success and good skill habits, SAGE helps agents build reliable routines instead of reinventing steps each time. This means assistants can adapt to new but similar jobs after deployment without constant retraining. The approach also makes multi-step systems more interpretable: skills are explicit functions you can inspect, audit, and update. With fewer tokens and steps, the same hardware can serve more users. Over time, a shared skill library could raise the floor for quality across many applications.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

You know how when you learn a new trick for solving math problems, you save it in your head so you can use it again later on a similar question? That’s what we want AI agents to do: learn a trick once, then reuse it.

🍞 Hook: Imagine you’re playing a video game where you learn special moves (skills) like double jump or dash. Once you learn them, you don’t want to relearn them every level—you just use them again and again.

🥬 The Concept (Reinforcement Learning): Reinforcement Learning (RL) is a way to train an AI by giving it rewards for good actions and not rewarding bad ones. How it works: 1) The agent tries something; 2) It gets a reward or not; 3) It changes its behavior to get more rewards next time. Why it matters: Without RL, the agent doesn’t have a clear signal about what worked well, so it won’t reliably improve. 🍞 Anchor: Like training a puppy—sit gets a treat; jumping on the couch doesn’t.

🍞 Hook: Think about your school backpack. Inside, you keep tools—pencils, rulers, and calculators—so you don’t have to invent them every class.

🥬 The Concept (Skill Library): A skill library is a collection of small, reusable procedures (skills) that the agent can run like mini-programs to finish tasks faster. How it works: 1) When solving a task, the agent writes a small function (a skill); 2) If it works, the agent saves it; 3) In the next similar task, the agent calls that saved function instead of rebuilding every step. Why it matters: Without a skill library, the agent keeps repeating long action sequences, wasting time and tokens and making more mistakes. 🍞 Anchor: Once you learn a “send_email(to, subject)” function, you can reuse it in many email tasks without retyping every step.

The world before: LLM agents could already reason and act over many steps, like browsing websites or calling APIs. People used prompting tricks to get agents to write and reuse skills. But prompt-only methods depended heavily on the model’s instruction-following ability, which led to inconsistent skill quality and usage.

The problem: Agents trained with RL often got good at specific training tasks but didn’t keep learning well once deployed in new situations. They didn’t reliably turn new experiences into solid, reusable skills.

What was tried and why it failed: - Prompt-based skill libraries: The model might write skills sometimes, but quality and reuse were hit-or-miss. - Post-hoc skills after finishing a task: This made training harder (longer contexts) and split “doing the task” from “making skills,” so the agent didn’t learn a smooth habit of using skills.

🍞 Hook: Picture finishing a science fair project and only afterward writing down your cool techniques. That’s helpful, but it’s slow and hard to connect to what you’re doing next.

🥬 The Concept (Unified skill format): Instead of separating doing and documenting, the agent writes a function first and then calls it immediately to act. Why it matters: Without this, training gets messy—too long, too late, and less learnable. 🍞 Anchor: Like writing a helper function right before you use it in your code.

The gap: There wasn’t a reinforcement learning approach that (1) makes the agent generate and sharpen reusable skills while (2) rewarding both success and smart skill use across similar tasks.

Real stakes: - Everyday assistants could finish multi-step jobs (like planning trips or handling emails) faster and cheaper. - Companies save compute costs if agents use fewer tokens and steps. - Safety and reliability improve when agents reuse proven steps instead of improvising every time.

02Core Idea

🍞 Hook: You know how athletes practice drills that they then reuse in real games? They don’t just play random matches; they practice, save the moves, then use them when it counts.

🥬 The Concept (SAGE): SAGE is a reinforcement learning method that trains an agent on chains of similar tasks so it learns a skill in one task and reuses it in the next, with extra rewards for doing so well. How it works: 1) Show the agent two similar tasks in a row; 2) The agent solves the first by creating a function (a skill) and saves it; 3) The second task can call that skill; 4) The reward gives points for winning the task and bonus points for generating and using helpful skills; 5) The policy updates to make this habit stronger. Why it matters: Without encouraging both success and skill reuse, agents won’t consistently learn to create, store, and apply good skills. 🍞 Anchor: Learn a “search_product(keyword)” move on one shopping task, then win faster on the next one by calling that move again.

One-sentence aha: Don’t just reward single-task success; reward success that comes from creating and reusing skills across a short chain of similar tasks.

Three analogies: - School: Solve a math problem by inventing a mini-formula (skill), then reuse it on the next worksheet for extra credit. - Cooking: Write a sauce recipe while making dish #1, then reuse the sauce for dish #2 and get a chef’s bonus. - Sports: Practice a free-throw routine in drill #1, then use the same routine during the next play and get extra points for consistency.

Before vs. after: - Before: RL agents optimized per task, often ignoring the value of building reusable steps. Prompt-only skills were fragile. - After: SAGE explicitly trains agents to build and reuse skills, leading to higher accuracy and much better efficiency in multi-step environments.

Why it works (intuition): - Credit assignment across tasks: If the second task succeeds using a skill from the first, SAGE can trace success back and reward the earlier skill creation. - Habit formation: By always writing and then calling a function, the agent learns a dependable routine. - Efficiency pressure: Calling a saved skill shortens future action sequences, lowering steps and tokens.

Building blocks:

🍞 Hook: Think of a toolbox you fill as you go, then use on the very next job. 🥬 The Concept (Sequential Rollout): Train on pairs of similar tasks so skills from task 1 are available for task 2. Why it matters: Without it, the agent can’t prove its skill works beyond the task it was born in. 🍞 Anchor: Learn “login(spotify)” in task 1, use it instantly in task 2.
🍞 Hook: Extra credit stickers make you want to do the good behavior again. 🥬 The Concept (Skill-integrated Reward): Add bonus reward when success comes from good skill creation or use. Why it matters: Without bonus points, agents won’t strongly prefer reusable skills over ad-hoc steps. 🍞 Anchor: Get 1 point for finishing the level, and an extra point if you used your special move well.
🍞 Hook: Following a practiced routine keeps you focused. 🥬 The Concept (Unified skill format): Always define a function first, then call it to act. Why it matters: Without a unified routine, learning breaks into disjointed parts. 🍞 Anchor: Write def send_email(...) and call it immediately, instead of sprinkling raw API calls everywhere.
🍞 Hook: Training wheels help you learn the right motion. 🥬 The Concept (Supervised Fine-tuning, SFT): Start with expert trajectories so the model learns how to format and use skills before RL. Why it matters: Without SFT, open models struggle to follow the new skill pattern reliably. 🍞 Anchor: Watch a coach do three perfect backhands, then try it yourself.

03Methodology

At a high level: Input (two similar tasks and a current skill library) → Step A: Retrieve any matching skills → Step B: Solve Task 1 by defining-and-calling a function (skill), possibly saving it → Step C: Solve Task 2 by reusing or refining the saved skill → Step D: Compute the Skill-integrated Reward for each task → Step E: Update the policy with a GRPO-style objective.

🍞 Hook: Think of a handy robot that solves chores using a set of tools. It can invent a new tool on Monday, then use it on Tuesday.

🥬 The Concept (Tool-using agent): The agent solves tasks by writing small pieces of code that call APIs and can be saved as reusable functions. How it works: 1) Read task + API docs; 2) Define a function (skill) that sequences APIs; 3) Call it; 4) If it works, save it; if not, fix it and try again. Why it matters: Without code-and-API tools, the agent can’t perform complex multi-step operations reliably. 🍞 Anchor: To message someone, define send_message(user, text) using the right API, then use it for different friends.

Step-by-step recipe:

Build the task pair. - Sample two tasks from the same scenario (AppWorld gives trios of similar tasks). - Why this exists: Similar tasks make skill reuse meaningful. - What breaks without it: You can’t tell if the skill transfers.
Retrieve relevant skills. - Pull skills saved from earlier in the scenario (or via a retriever in practical settings). - Why this exists: It gives the agent a head start. - What breaks without it: The agent might reinvent the wheel.
Solve Task 1 with define-then-call. - The agent first writes a function (a skill) that bundles API calls; then it calls it. - If the function fails, the agent edits it and retries; if it works, it saves it. - Why this exists: It forms the habit of packaging actions into reusable skills. - What breaks without it: Action sequences stay long and fragile.
Solve Task 2 by reusing the library. - The new task can call the just-saved skill, adjust arguments, or update the function if needed. - Why this exists: This tests if the skill truly generalizes. - What breaks without it: No proof of transfer learning.
Compute rewards (Skill-integrated Reward). - Base reward: outcome-based success (0–1) from AppWorld’s verifier. - Bonus: add +1 when a successful task used a saved skill well; add +1 to the earlier task if its skill was later used successfully. - Penalty: −1 if the agent refuses to write code and just stops. - Why this exists: It bakes in the habit of creating and using skills. - What breaks without it: The agent might succeed once but never learn reusable structure.
Policy update (GRPO-style). - Group multiple rollouts, score them, and update the policy toward higher-reward behaviors. - Why this exists: Stable RL updates need comparisons within groups. - What breaks without it: Learning becomes noisy and unstable.

🍞 Hook: Like grading a two-problem quiz where problem #2 gets easier if you made a helpful formula on problem #1.

🥬 The Concept (GRPO): GRPO compares multiple answers from the same prompt in a group and pushes the policy toward the relatively better ones. How it works: 1) Sample several outputs; 2) Score each; 3) Push up the better ones and down the worse. Why it matters: Without group-relative comparisons, the update signal can be weak or drift. 🍞 Anchor: Judge five attempts at the same question and learn from the best try.

Concrete example (simplified AppWorld): - Task 1: “Text Alex: ‘Meeting at 3 PM.’” The agent defines send_text(user, message) using the Phone API and calls send_text("Alex", "Meeting at 3 PM"). It works, so the skill is saved. - Task 2: “Text Jordan: ‘Running late.’” The agent calls send_text("Jordan", "Running late"). Because the task succeeds and used the saved skill, Task 2 gets success + skill-use bonus. Task 1 also gets a bonus because the skill it created helped Task 2 succeed.

🍞 Hook: Sometimes you need a coach first.

🥬 The Concept (Supervised Fine-tuning, SFT): Before RL, train the model on expert trajectories that already follow the define-and-call pattern. How it works: 1) Collect expert examples; 2) Fine-tune the model; 3) Now the model can reliably write and use skills. Why it matters: Without SFT, open models often fail to follow the new skill format, breaking rollouts. 🍞 Anchor: Watch a pro do three perfect passes, then practice them yourself.

Secret sauce: - The define-then-call format forces skills to exist as first-class code, so they are easy to save and reuse. - The two-task chain makes credit assignment across tasks possible. - The skill-integrated reward makes skills the star, not just a side effect.

04Experiments & Results

🍞 Hook: Imagine a school test made of mini-units, where each unit has three very similar questions. If you invent a helpful formula in Q1, you can ace Q2 and Q3.

🥬 The Concept (AppWorld): AppWorld is a simulated world of apps (like Gmail, Spotify, Venmo) with real APIs and a checker that grades whether you completed the task. How it works: 1) The agent reads API docs; 2) Writes code to call APIs; 3) The environment verifies success and returns a 0–1 score. Why it matters: Without a reliable checker, you can’t trust the reward. 🍞 Anchor: It’s like a lab with instruments and an automatic grader that says if your experiment succeeded.

What they measured and why: - 🍞 Hook: Report cards don’t just have grades—they also show effort and time. - 🥬 The Concept (Task Goal Completion, TGC): TGC is the percent of single tasks solved. How it works: Count tasks that pass the checker. Why it matters: Without TGC, you don’t know overall accuracy. 🍞 Anchor: How many homework problems you got right. - 🥬 The Concept (Scenario Goal Completion, SGC): SGC is the percent of scenarios where all three related tasks are solved. How it works: A scenario is a trio of similar tasks; you must get all three. Why it matters: Without SGC, you don’t know if the skill really transfers across similar tasks. 🍞 Anchor: Three questions in a row—all must be correct to get the scenario star. - Efficiency: Avg. Steps and Avg. Tokens measure how many interactions and tokens the agent needs. Fewer often means cheaper and faster.

The competition: They compare against strong baselines, including GRPO (a solid RL training method) and prompting-based agents like ReAct with big models (GPT-4o, Claude). They also include LOOP from prior work.

The scoreboard (Test-Normal): - SAGE: 72.0% TGC, 60.7% SGC, 12.1 steps, 1,475 tokens. - Baseline GRPO: 69.2% TGC, 51.8% SGC, 16.4 steps, 3,613 tokens. Context: That SGC jump (+8.9 points) is like moving from a B- to an A-, while also finishing the test faster and writing less. Tokens drop by 59%—big cost savings.

On harder Test-Challenge: - SAGE: 50.1% TGC, 32.4% SGC, 17.3 steps, 1,807 tokens. - Baseline GRPO: 40.7% TGC, 26.9% SGC, 21.9 steps, 5,211 tokens. Context: Still clearly ahead and much more efficient.

What changed along the way: - Prompt-only skill library underperformed (inconsistent skills and usage). - SFT with expert data made formatting and basic skill use reliable. - RL with SAGE delivered the big jump in both accuracy and efficiency.

Surprising or notable findings: - Longer task chains (using all three tasks in a scenario at once) didn’t help. Likely due to uneven reward distribution and gradient variance; also more compute. - Retrieval matters: When you can’t assume “same scenario,” simple query-similarity retrieval (n-grams) nearly matched the ideal case; embedding-based methods traded some accuracy for efficiency. - Reward design matters: The Skill-integrated Reward beat outcome-only and chain-only bonuses, showing it’s important to directly reward skill creation and use. - Skills help even without training: Evaluating with an empty library hurt SGC and increased steps/tokens, showing that skills are genuinely useful.

05Discussion & Limitations

Limitations: - Domain coverage: Experiments are only on AppWorld (a rich simulated environment). Results may differ on other platforms, tools, or action spaces. - Initialization needs: SFT on expert data was crucial; starting from scratch with open models was weak at following the skill format. - Skill quality reliance: If early skills are poor, they can mislead later tasks. - Chain length trade-offs: Longer chains didn’t help performance and cost much more compute. - Retrieval in the wild: Using the perfect “same scenario” label is unrealistic; retrieval quality strongly affects results.

Required resources: - Significant compute for rollouts and training (multi-GPU nodes reported). - Access to an expert model (e.g., Claude 3.5) for high-quality SFT data collection. - A stable tool environment with verifiable rewards (like AppWorld) to make the RL signal dependable.

When not to use: - Tasks with little structural similarity (few reusable patterns) may not benefit much from a skill library. - Environments without reliable verifiers; noisy rewards can confuse learning. - Ultra-short tasks where writing a function costs more tokens than it saves.

Open questions: - Can we design better retrieval tuned for function/skill semantics, not just text similarity? - How to scale beyond pairs to longer but stable chains with good credit assignment? - Can we reduce or remove the dependence on expert SFT through improved exploration? - How well does SAGE generalize to other domains (robotics, desktop control, web) with different API styles? - Safety and governance: How to prevent saving harmful or brittle skills? Versioning and auditing of skill libraries?

06Conclusion & Future Work

Three-sentence summary: This paper introduces SAGE, a reinforcement learning framework that teaches agents to create and reuse skills by training on pairs of similar tasks and rewarding both success and smart skill behavior. By unifying “define then call” skills, using Sequential Rollout, and adding a Skill-integrated Reward, the agent learns to transfer knowledge across tasks efficiently. On AppWorld, SAGE significantly boosts accuracy and slashes steps and tokens compared to strong RL baselines.

Main achievement: Turning skills into first-class citizens in RL—rewarding the making and the using—so agents truly self-improve over time.

Future directions: Improve skill retrieval for open-world deployment, reduce reliance on expert SFT with better exploration, refine reward design for longer chains, and test across diverse tool-using domains. Investigate safety, versioning, and auditing for large shared skill libraries.

Why remember this: It shows a practical, scalable path to agents that don’t just solve one problem now but learn reusable tricks that help them solve the next one faster and better.

Practical Applications

•Automated email and calendar workflows that reuse scheduling and messaging skills across similar requests.
•Customer support agents that learn reusable troubleshooting procedures for recurring issues.
•E-commerce helpers that reuse product-search, filter, and checkout skills to speed up shopping tasks.
•Internal tools that automate HR or finance tasks by saving and reusing validated form-filling and approval skills.
•Research assistants that build and reuse web-search and data-extraction skills for repeated information-gathering patterns.
•DevOps bots that package common deployment and monitoring steps as reusable skills for faster operations.
•Education tutors that save step-by-step solution patterns (skills) for similar math or science problems.
•Healthcare admin agents that reuse appointment booking and insurance verification skills within policy constraints.
•CRM workflows that reuse contact-update, lead-qualification, and follow-up skills to improve sales efficiency.
•Desktop automation where copy-rename-move patterns become callable skills for file organization.

Version: 1