Evolving Programmatic Skill Networks

Haochen Shi; Xingdi Yuan; Bang Liu

Evolving Programmatic Skill Networks

Intermediate

Haochen Shi, Xingdi Yuan, Bang Liu1/7/2026

arXiv PDF

Key Summary

•This paper teaches a computer agent to grow a toolbox of skills that are real, runnable programs, not just text ideas.
•The agent fixes broken skills by reading its own “what happened” logs and pinpointing exactly where things went wrong.
•It protects strong, reliable skills from being accidentally changed while keeping weaker skills flexible to learn.
•It cleans up its toolbox by merging duplicates and creating reusable building blocks, so the toolbox stays neat and compact.
•These learning moves look a lot like how neural networks learn: credit flows back, strong parts freeze, and the structure gets tuned.
•In Minecraft and Crafter, the agent learned faster, forgot less, and solved longer, trickier tasks than popular baselines.
•It unlocked diamond tools in Minecraft in far fewer tries and kept earlier skills working as new ones were added.
•Online refactoring (while learning) beat one-time offline cleanup, proving that timing and feedback matter.
•The big idea: treat skills as programs in a network that evolves through planning, reflection, and careful rewrites.
•This could make future robots, game AIs, and web agents smarter, steadier learners who improve for years without getting messy or forgetful.

Why This Research Matters

As AI systems move into open-ended worlds, they must keep learning without forgetting. PSN shows a practical way to do this by making each skill executable code that can be traced, fixed, and reorganized. That means agents can grow libraries of dependable skills that compound over time, just like people develop expertise. The result is faster progress on long, multi-step problems and less time wasted reinventing the wheel. Because refactoring stays online and uses fresh feedback, the skill library remains compact and efficient. This approach could power steadier, safer learning in robotics, game AI, tutoring, and web automation.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you’re building a LEGO city. At first, you make simple cars and houses. As the city grows, you reuse pieces, fix shaky parts, and replace clumsy builds with cleaner ones. You don’t throw everything away each time—you evolve your city. That’s how an AI should learn skills, too.

🥬 The world before: Many AI agents learned to solve tasks by writing plans or bits of code on the fly. But often those pieces were temporary (they vanished after use), or they piled up into a messy, flat list of skills with no good way to improve or reorganize them. Agents could solve simple things, but as tasks got longer and more connected (like building a diamond pickaxe in Minecraft), progress slowed or broke.

🍞 Anchor: ReAct, Reflexion, AutoGPT, and Voyager were strong early steps. They could reason, plan, or store skills. But they struggled to steadily improve old skills, decide who caused a failure in a multi-step plan, and clean up duplicates without starting over.

🍞 Hook: You know how a good teacher helps you figure out exactly which step of a math problem you got wrong, instead of saying, “The whole thing is wrong”? AI needed that kind of step-by-step accountability.

🥬 The problem: Agents faced two big roadblocks. First, skills often lived in flat libraries or static graphs, so there was no principled way to keep upgrading them as new tasks arrived. Second, when a big task failed, there wasn’t a unified method to assign credit or blame across many nested skills—so fixes were guessy, and strong skills could get damaged.

🍞 Anchor: Picture trying to bake a cake. If it fails, was the problem the oven temp, the batter mix, or the frosting? Without a trace of what happened at each step, you might fix the wrong thing and ruin your perfectly fine frosting.

🍞 Hook: Think of cleaning your room: you don’t just shove more stuff into it. You combine duplicates, label boxes, and toss junk so you can find things later. AI skill libraries need that, too.

🥬 Failed attempts: Some agents stored many skills but didn’t know how to assign responsibility after a failure. Others tried refactoring once, offline, but missed feedback from real execution. Many generated new skills for everything, leading to bloat. And most had no strong way to protect already-reliable skills from being constantly rewritten.

🍞 Anchor: A one-time closet clean-up helps for a day. But if you keep tossing things in without a system, chaos returns.

🍞 Hook: Imagine if skills were actual code with clear inputs, outputs, and checks—like recipes with ingredients and results—connected so they could call each other.

🥬 The gap: The missing piece was a framework where skills are executable programs with clear preconditions and postconditions, linked into a network that keeps evolving. On top of that, it needed three powers: (1) pinpoint which step failed (fault localization), (2) stabilize mature skills while letting new ones learn (maturity-aware gating), and (3) keep the library tidy by merging and abstracting (refactoring with safety checks).

🍞 Anchor: That’s exactly what Programmatic Skill Networks (PSN) provide: a living network of code-skills that plans, reflects, and reorganizes itself.

🍞 Hook: Why should anyone care? Because real life is open-ended.

🥬 Real stakes: Robots, game agents, and web assistants must keep learning new tasks without forgetting old ones. They must diagnose what went wrong, fix it safely, and avoid code clutter. If they can do that, they’ll handle longer projects (like building a house, managing a farm, or booking complex travel) with rising confidence instead of constant resets.

🍞 Anchor: In Minecraft and Crafter, PSN showed exactly that—faster progress, stronger reuse, less forgetting, and a cleaner skill toolbox as the world got tougher.

02Core Idea

🍞 Hook: You know how a sports team reviews game footage, then practices specific drills, and updates its playbook so the same mistake won’t happen again? That loop—review, fix, reorganize—is the heart of this paper.

🥬 Aha! Moment (one sentence): Treat every skill as a small, executable program in a network, then learn by (1) tracing failures to the exact skill step that went wrong, (2) protecting stable skills while adjusting shaky ones, and (3) refactoring the network structure so it stays compact and reusable.

🍞 Anchor: It’s like LEGO: build a model, test it, reinforce strong joints, swap wobbly parts, and eventually replace repeated chunks with a single neat sub-assembly.

🍞 Hook: Three analogies for the same idea:

Kitchen: Recipes (skills) call other recipes. If the cake flops, read the cooking log, fix the wrong step, keep the perfect steps unchanged, and rewrite the cookbook to combine repeated sub-recipes.
Library: Books (skills) cite other books. If a fact is wrong, correct that chapter only, lock trusted reference books, and merge duplicate pamphlets into one clear volume.
City planning: Streets (skills) connect. If traffic jams happen, adjust that intersection, keep smooth roads as-is, and redesign neighborhoods to remove redundant roads.

🥬 Before vs. After:

Before: Flat skill lists, unclear blame when big plans fail, constant rewriting of good parts, toolbox bloat.
After: A living network with execution traces for blame, guarded updates to mature skills, and compact structure via refactoring.

🥬 Why it works (intuitively, no equations):

Traces are like breadcrumbs: they show exactly which skills ran, in what order, with what preconditions and results. So fixes target the right spot.
Maturity-aware gating is like a coach saying, “Don’t mess with our star player.” Strong skills are updated rarely; weaker ones learn more often.
Structural refactoring turns repeated code into one reusable skill and merges duplicates. Rollback checks prevent harmful rewrites.
Together, this mirrors how neural nets learn: errors flow back along used paths, mature layers get lower learning rates, and architectures get tuned.

🥬 Building blocks (explained with Sandwich):

🍞 You know how a toolbox holds tools that can call for other tools? 🥬 Programmatic Skill Network (PSN): It’s a network where each skill is an executable program with clear “when it’s safe to run” (preconditions) and “what it achieves” (postconditions). How it works: (1) skills call subskills; (2) a planner chains skills backward from goals; (3) execution leaves a trace; (4) failures trigger targeted fixes; (5) successes can trigger structural cleanups. Why it matters: Without executable, linked skills, you can’t target the right fix or reuse powerfully. 🍞 Anchor: In Minecraft, “ensurePickaxe(type, n)” can reuse “craftSticks” and “ensurePlanks,” instead of rewriting wood logic each time.
🍞 You know how detectives replay what happened to find the exact clue that broke the case? 🥬 REFLECT (fault localization): It finds which step (branch, parameter, or subskill) most likely caused the failure by reading the execution trace. Steps: (1) read feedback and trace; (2) push responsibility down to involved subskills; (3) propose concrete code edits bottom-up. Why it matters: Without precise blame, you either change too much or the wrong part. 🍞 Anchor: If smelting fails, REFLECT may point to “ensureFuel,” not “mineIron,” saving time.
🍞 You know how teachers don’t reteach what you already mastered every day? 🥬 Maturity-aware update gating: A reliability score per skill reduces how often strong skills are edited. Steps: (1) track success rate with uncertainty; (2) update often if shaky, rarely if solid; (3) always keep a tiny chance to revise if it breaks in a new combo. Why it matters: Prevents good skills from being ruined by noise. 🍞 Anchor: Don’t rewrite perfect “craftPlanks” just because “openChest” failed.
🍞 You know how you convert many tiny helpers into one shared function when coding? 🥬 Structural refactoring with rollback: Finds patterns like duplicates or missing abstractions and applies safe rewrites, then tests recent tasks; revert if performance drops too much. Why it matters: Keeps the library compact and fast to plan with. 🍞 Anchor: Merge “mineOakLogs” and “mineBirchLogs” into “mineLogs(type).”

🍞 Bottom Bread: The result is a skill toolbox that grows, fixes itself, and stays tidy—strong enough for today’s tasks and ready for tomorrow’s surprises.

03Methodology

🍞 Hook: Imagine a relay race. A planner picks runners, a manager watches each handoff, a coach reviews the replay if the baton drops, and a team lead reorganizes the lineup after wins. That’s PSN’s recipe.

🥬 High-level overview: Input (task in words) → Plan (reuse skills via backward-chaining; if stuck, ask LLM) → Execute (run skills, record a trace) → If fail: REFLECT + targeted patches; If succeed: structural refactor with safety checks → Output: an evolved, cleaner, more reliable skill network.

— Step A: Network-aware hybrid planner — 🍞 Hook: You know how you start with the goal and ask, “What do I need just before that?” and keep stepping backward? 🥬 What it is: A planner that tries to reuse existing skills by working backward from the goal (backward-chaining). How it works: (1) Find skills whose postconditions match the current subgoal; (2) expand unmet preconditions; (3) break ties using each skill’s reliability score; (4) if no known skill helps, call an LLM to propose a forward plan; (5) distill that plan into a new code-skill. Why it matters: Reuse beats reinvent—fewer new skills, more stability. 🍞 Anchor: To “mine diamond,” it looks for “ensureDiamondPickaxe,” which needs “ensureIron,” which needs “smeltIron,” and so on.

— Step B: Execution and trace construction — 🍞 Hook: Think of a video replay that shows every player’s move and the score at each moment. 🥬 What it is: When a plan runs, the system logs an execution trace: which skills ran, their preconditions, postconditions, status, and environment snapshots. How it works: (1) run the skill; (2) store tuples like <skill, pre-state, post-state, status>; (3) aggregate feedback. Why it matters: Without a trace, you can’t locate which step failed. 🍞 Anchor: If “craftPickaxe” fails, the trace might show “ensureSticks” didn’t get enough planks.

— Step C: Skill optimization via trace-based credit assignment — 🍞 Hook: Like reviewing a group project to see which part needs edits. 🥬 What it is: A two-phase repair system that sends responsibility to the right subskills and then applies code fixes in a safe order. How it works: (1) Top-down: REFLECT pushes failure signals to only the skills that actually ran; (2) Bottom-up: apply localized patches starting from leaf skills, then update parents to stay consistent; (3) keep a short buffer of recent proposals to avoid back-and-forth contradictions. Why it matters: Fixes land exactly where needed; unrelated skills are left alone. 🍞 Anchor: If “ensureFuel” is the real problem, only it and its parent get edits; “mineCobblestone” stays untouched.

— Step D: Maturity-aware update gating — 🍞 Hook: A coach won’t redesign the star player’s shot every practice. 🥬 What it is: A gate that decides how often to update a skill based on its reliability and confidence. How it works: (1) compute a value from success rate minus uncertainty; (2) reduce update chance as the skill matures; (3) keep a small floor so rare combinations can still trigger a fix. Why it matters: Prevents catastrophic forgetting and oscillation. 🍞 Anchor: Once “craftPlanks” is near-perfect, it’s rarely changed, even if a different skill fails.

— Step E: Online structural refactoring — 🍞 Hook: Like reorganizing your notebook so solutions are easy to find later. 🥬 What it is: A set of safe rewrite rules that spot five relationships: parametric coverage (special case → wrapper), behavioral coverage (replace reimplemented subgraph with a call), sibling specializations (create an abstract skill), common subskill extraction, and duplication removal. How it works: (1) after a successful run, look at parents/children plus top-5 similar skills; (2) detect a case; (3) apply the rewrite; (4) run quick regression tasks; (5) rollback if success drops too much. Why it matters: Keeps the network compact and boosts reuse, which makes planning faster and learning steadier. 🍞 Anchor: Turn “mineOakLogs” and “mineBirchLogs” into wrappers of “mineLogs(type).”

— Secret sauce (what makes it clever) —

Execution traces power precise fixes.
Update gating balances stability and plasticity.
Refactoring keeps growth under control.
The three run at different speeds: fast fixes on failure, medium stabilization of reliable skills, slow structural cleanups after successes.

— Concrete example (Minecraft: craft a wooden pickaxe) —

Plan: ensurePlanks → craftSticks → ensureCraftingTable → craftWoodenPickaxe.
Execute: trace shows not enough planks because sticks also consume planks.
Optimize: REFLECT points to resource math; patch adds the missing plank calculation and pre-checks.
Stabilize: after repeated success, gates reduce updates to this skill.
Refactor: later, if multiple tools craft flows repeat stick/plank logic, extract a shared “ensurePlanks(n)” subskill.

— What breaks without each step —

No planner reuse → skill bloat and slower progress.
No trace → random or harmful fixes.
No two-phase repair → parent/child inconsistencies.
No gating → good skills get worse over time.
No refactor/rollback → toolbox grows messy; planning slows and regressions sneak in.

04Experiments & Results

🍞 Hook: Imagine a video game tournament where one player keeps learning new moves without forgetting old ones, cleans up their strategies, and wins more often as matches get harder. That’s what PSN did.

🥬 The tests and why:

Minecraft tech tree (MineDojo): measures long, chained tasks (e.g., wooden → stone → iron → diamond tools, then obsidian). It checks if the agent can reuse earlier steps reliably while pushing to deeper goals.
Crafter: a survival world with dense feedback. It tests steady, safe progress where early mistakes can snowball.
Continual learning metrics: Skill Retention Rate (SRR) shows if mastered skills stay strong after new ones are learned.
Library growth: Does the toolbox stay trim via reuse, or does it balloon?

🥬 The competition:

ReAct and Reflexion: strong reasoning and self-reflection, but no evolving program network.
AutoGPT: plans code/actions but treats them as disposable.
Voyager: keeps a skill library but lacks trace-based fault localization and principled refactoring.
PSN ablations: PSN w/o optimizer (to see if representation alone helps) and PSN (Create New Skills) that skips reuse to test bloat.

🥬 Scoreboard with context:

Minecraft tech tree: PSN unlocked higher-tier tools faster and more consistently. For diamond tools, PSN averaged about 51 iterations across all three runs (like getting an A when others got a C+ to B-), whereas Voyager succeeded in only one run and needed about 102 iterations. PSN even tackled obsidian via a multi-step, reusable composed skill (bucket crafting → water-lava interaction → diamond-pick mining), showing deep reuse.
Crafter cumulative reward: PSN’s curve stayed higher and steadier. Voyager was more stable than planning-only baselines but plateaued earlier. PSN’s policy of fix-then-stabilize-then-refactor led to fewer compounding mistakes and longer survival.
Forgetting (SRR): PSN kept earlier skills strong as new ones arrived. Voyager showed sharp drops (catastrophic forgetting). Maturity gating and targeted credit assignment prevented good skills from being unintentionally broken.
Library growth: PSN’s skill count flattened and even shrank later thanks to refactoring. The “Create New Skills” variant kept growing, proving reuse beats proliferation. Online refactoring (PSN) also beat an offline cleanup (Voyager-R), which reduced duplication on paper but didn’t match PSN’s behavior-level robustness.

🥬 Surprising findings:

Online beats offline: Cleaning structure during learning, tied to fresh execution traces, worked better than one-time refactors done later.
Optimizer matters most at depth: PSN without the optimizer was okay early (like Voyager) but stumbled at diamond/obsidian. Precise fault localization enabled deep progress.
Stabilization is a quiet hero: Gating prevented oscillations—skills stopped getting rewritten back and forth.
Reuse composes power: The obsidian skill neatly snapped together earlier subskills, proving the network acts like LEGO for long tasks.

🍞 Anchor: Put simply, PSN learned faster, remembered better, and stayed tidy—turning long, wobbly adventures into repeatable victories.

05Discussion & Limitations

🍞 Hook: Even great teams have limits: you can’t play everyone at once, and sometimes practice fields are small.

🥬 Limitations:

Batch-size-one learning: The current system updates online, one experience at a time. That limits parallelism and may slow large-scale training.
No formal convergence guarantees (yet): REFLECT and refactoring work well empirically, but we don’t have mathematical proofs about always finding the best symbolic fixes.
LLM dependence (modest): While the architecture drives learning, code synthesis and diagnosis still use an LLM backend. Weak LLMs could reduce patch quality.

🥬 Required resources:

A capable code LLM (e.g., gpt-5-mini-2025) for program synthesis and reflection.
Environment APIs (Mineflayer for Minecraft, a Crafter API) and logging to collect traces.
Modest compute for many small, frequent updates plus quick rollout validation during refactoring.

🥬 When not to use:

Ultra real-time control where even tiny delays are unacceptable (e.g., high-speed drones) or where continuous, differentiable control is essential.
Very small, single-shot tasks where a big evolving toolbox is overkill.
Domains with no stable APIs for executable skills (hard to make reliable pre/postconditions).

🥬 Open questions:

Theory: Can we formalize guarantees for symbolic projection, safety of refactors, and convergence of the two-phase optimizer?
Scale: How does PSN behave with large-batch or distributed training and richer domains (robotics, web workflows)?
Robustness: How to detect and handle rare but harmful refactors even faster? Can we auto-tune update gates?
Generality: How well do the same ideas transfer to other program spaces (e.g., robot scripts, web automation, spreadsheets)?

🍞 Anchor: The authors believe these are not brick walls—just milestones on the road to sturdier, larger, and more automated self-improving skill networks.

06Conclusion & Future Work

🍞 Hook: Think of a student who keeps their notes as working code, fixes exact mistakes after every test, protects what they’ve mastered, and reorganizes their notebook to learn faster next time.

🥬 Three-sentence summary: This paper introduces Programmatic Skill Networks (PSN), where each skill is an executable program inside a compositional network that evolves through planning, reflection, and refactoring. Failures trigger trace-based fault localization and gated updates; successes invite safe structural cleanups—together mirroring neural network learning dynamics. In Minecraft and Crafter, PSN learned faster, reused better, forgot less, and kept its toolbox compact.

🥬 Main achievement: Showing that neural-style optimization principles (credit assignment, stabilization, architecture tuning) can be lifted into a symbolic world of programs to power continual, open-ended skill learning.

🥬 Future directions: Scale updates with more parallelism; add theory for guarantees; broaden to robotics and web domains; sharpen refactor discovery; and enhance safety/rollback policies.

🥬 Why remember this: It’s a blueprint for self-improving agents whose skills are real code you can inspect, fix, and reuse—growing steadily from simple tasks to complex missions without drowning in clutter or forgetting how they got there.

Practical Applications

•Robotic assembly lines that evolve task skills (pick, place, fasten) and keep them reliable over months.
•Game AIs that learn tech trees and quests, reusing and refining skills to master late-game goals.
•Web agents that build a durable toolbox of login, scrape, transform, and fill-form routines with safe refactors.
•Personal assistants that compose chores (shop → cook → clean) from reusable subskills without forgetting steps.
•Educational tutors that grow and reorganize skill libraries for multi-topic problem solving.
•Enterprise RPA bots that consolidate duplicate workflows and safely roll back risky changes.
•Lab automation that turns protocols into reusable program-skills with execution traces and precise fixes.
•Code assistants that maintain a living library of utility functions, refactoring wrappers and abstractions over time.
•Warehouse bots that learn stable picking/packing routines while adding new product-specific skills.
•Multi-agent systems that factor shared subskills (navigation, communication) for team-wide reuse.

Version: 1