🎓How I Study AIHISA
📖Read
📄Papers📰Blogs🎬Courses
💡Learn
🛤️Paths📚Topics💡Concepts🎴Shorts
🎯Practice
🧩Problems🎯Prompts🧠Review
Search
Knowledge is Not Enough: Injecting RL Skills for Continual Adaptation | How I Study AI

Knowledge is Not Enough: Injecting RL Skills for Continual Adaptation

Intermediate
Pingzhi Tang, Yiding Wang, Muhan Zhang1/16/2026
arXivPDF

Key Summary

  • •Big language models can learn new facts with simple tutoring (SFT), but that doesn’t automatically teach them how to use those facts well.
  • •Reinforcement Learning (RL) teaches models how to think and act, but it is too expensive to redo every time the world changes.
  • •The paper finds that SFT-updates (facts) and RL-updates (skills) change different, almost perpendicular parts of the model’s brain (nearly orthogonal).
  • •Because of this, the authors extract a ‘Skill Vector’ from an RL-trained model and simply add it to a newly SFT-updated model in a new domain.
  • •This plug-in approach is called Parametric Skill Transfer (PaST).
  • •On SQuAD (closed-book), PaST boosts accuracy to 56.9%, beating the strong SEAL baseline by up to 9.9 points.
  • •On LooGLE (very long documents), PaST adds an absolute +8.0 points, helping with long-range retrieval from memory.
  • •On ToolBench (tool use), PaST raises zero-shot success by +10.3 points across 20 categories the RL never saw.
  • •An iterative version of the Skill Vector makes it more general and less tied to any one dataset.
  • •PaST is a fast, modular way to keep models updated with both new knowledge and reusable reasoning skills.

Why This Research Matters

Real-world knowledge changes fast, but retraining a model from scratch or re-running costly RL every time isn’t practical. PaST gives teams a simple, modular way to keep models both informed (new facts) and capable (reusable reasoning) without breaking the bank. It helps models not only ‘know’ the latest updates but also ‘use’ them under pressure, like handling errors or planning multi-step tool calls. This can cut latency and token costs compared to always fetching long contexts with RAG. It also enables organizations to share and reuse powerful reasoning skills across products and domains. In short, PaST makes continual adaptation faster, cheaper, and more reliable.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you studied a new science chapter last night (you memorized it), but during today’s quiz you still can’t solve the tricky questions that use those facts. You know, but you can’t use it well yet.

🥬 The Concept: Knowledge Cutoff (Parametric Memory)

  • What it is: Big language models store what they know inside their weights, and that stored knowledge stops at their last training date (their “knowledge cutoff”).
  • How it works:
    1. The model is trained on tons of data and the facts get baked into its parameters.
    2. After training ends, those parameters are frozen, so new events or tools aren’t inside.
    3. If you ask about new stuff, it guesses using old patterns and can be wrong.
  • Why it matters: Without a way to add new knowledge, the model falls behind the real world. 🍞 Anchor: If a model was trained in 2023, it might not know a 2026 rule change unless we update its weights.

🍞 Hook: You know how you can quickly look things up in a book instead of memorizing every page?

🥬 The Concept: Retrieval-Augmented Generation (RAG)

  • What it is: A method where the model fetches outside documents while answering.
  • How it works:
    1. Search for related passages.
    2. Stuff the best passages into the prompt.
    3. Generate an answer using both the prompt and the passages.
  • Why it matters: Without RAG, the model must rely only on its old memory. But RAG can be slow/expensive on very large libraries and long documents. 🍞 Anchor: Like searching a wiki each time you answer, which works but takes time and tokens.

🍞 Hook: Think of practicing with a tutor who gives you examples and correct answers so you can copy the pattern.

🥬 The Concept: Supervised Fine-Tuning (SFT)

  • What it is: A training method where the model learns from labeled examples to better mimic good outputs.
  • How it works:
    1. Gather input–answer pairs from the new topic.
    2. Nudge the model to produce those answers.
    3. The model memorizes patterns in the new data.
  • Why it matters: Without SFT, the model won’t internalize new facts into its weights. But SFT often teaches ‘what’ to say, not ‘how’ to reason. 🍞 Anchor: Like studying with answer keys; you learn the facts, but not always how to solve trickier problems.

🍞 Hook: Think of learning by trial and error with rewards—like a video game where you score points for good moves.

🥬 The Concept: Reinforcement Learning (RL)

  • What it is: A training method where the model explores actions and gets rewards for doing well, building reusable reasoning skills.
  • How it works:
    1. Try steps toward a goal (plan, act, observe).
    2. Get a reward signal (correct, incorrect, or quality score).
    3. Adjust the policy to make rewarded choices more likely.
  • Why it matters: Without RL, models can memorize facts but may not learn robust step-by-step problem solving and error handling. 🍞 Anchor: In tool use, RL helps a model recover from API errors instead of hallucinating a fake tool.

🍞 Hook: Imagine you updated your notes with new facts (SFT), but your problem-solving strategies (RL) live in a different notebook.

🥬 The Concept: The Gap (Knowledge vs. Skills)

  • What it is: SFT adds facts; RL builds reasoning. Doing only SFT often doesn’t teach the model how to use new knowledge well.
  • How it works:
    1. SFT lowers perplexity and memorizes domain text.
    2. But on tasks requiring planning or handling errors, performance can collapse.
    3. RL can fix this, yet it’s too costly to rerun for every new domain.
  • Why it matters: Without bridging this gap, models can ‘know’ but still fail to ‘do.’ 🍞 Anchor: A model might recall the Instagram downloader API’s name but panic when it sees a ‘private account’ error and start making up non-existent tools.

🍞 Hook: Think of two knobs on a machine that control different parts; turning one doesn’t affect the other.

🥬 The Concept: Why Past Work Struggled

  • What it is: Prior knowledge-updating methods improved what was stored but not how it was used under pressure.
  • How it works:
    1. Knowledge editing or SFT slipped in facts.
    2. Without RL skills, models were brittle in realistic tasks.
    3. RAG helped sometimes, but was slow/expensive at inference.
  • Why it matters: We needed a way to add skills once and reuse them many times while we keep adding new facts. 🍞 Anchor: We want a ‘skills plug-in’ that works even after updating the model with new textbooks.

02Core Idea

🍞 Hook: Imagine you have a ‘strategy badge’ you earned in one game. Now you can clip that badge onto your backpack for any new class—you don’t have to relearn the strategy every time.

🥬 The Concept: The Aha! Moment

  • What it is: The key insight is that SFT (facts) and RL (skills) change almost perpendicular parts of the model’s weights, so you can extract a Skill Vector from RL and simply add it to a newly SFT-updated model.
  • How it works:
    1. Train on a source domain with SFT (facts), then with RL (skills).
    2. Subtract: Skill Vector = (Source RL model) – (Source SFT model).
    3. In a new domain, do light SFT for new facts, then add the Skill Vector.
  • Why it matters: You get the ‘how-to-think’ upgrade without paying the price of new RL runs each time. 🍞 Anchor: Like plugging a universal ‘problem-solving module’ into a freshly updated knowledge base.

Multiple Analogies (3 ways):

  1. Lego analogy: SFT builds the castle (facts). RL builds a crane (skills). The crane snaps onto any new castle without rebuilding it.
  2. Sports analogy: You practiced teamwork (skills) with a soccer team. Now you join a new team (new facts about plays) and bring the teamwork ability with you.
  3. Kitchen analogy: You learned a reliable ‘taste-then-adjust’ cooking habit (skills) once. Now each new recipe (facts) benefits from that habit automatically.

🍞 Hook: Imagine two roads that never cross—one for facts, one for skills—so cars don’t crash.

🥬 The Concept: Orthogonality of Parameter Updates

  • What it is: The paper finds SFT updates and RL updates are nearly orthogonal (they point in different directions in weight space).
  • How it works:
    1. Measure cosine similarity layer-by-layer between SFT and RL weight changes.
    2. Find values near zero across modules and depths (unlike two SFT runs, which are correlated).
    3. Conclude skills and facts live in disentangled subspaces.
  • Why it matters: If the updates don’t interfere, we can add them like vectors without breaking either. 🍞 Anchor: Like tuning treble and bass separately—adjusting one doesn’t ruin the other’s sound.

🍞 Hook: Think of a small arrow that points exactly in the direction of ‘better reasoning.’

🥬 The Concept: Skill Vector (Task Vector)

  • What it is: A vector formed by subtracting the SFT-only model from the RL-refined model on the same source domain.
  • How it works:
    1. Train base → SFT on source docs.
    2. Continue training with RL to learn robust reasoning.
    3. Subtract parameters: v_skill = θ_S^rl − θ_S^sft.
  • Why it matters: This arrow captures ‘how to manipulate knowledge’ rather than the knowledge itself. 🍞 Anchor: It’s like the difference between ‘I know the recipe’ and ‘I know how to rescue a salty soup.’

🍞 Hook: Picture sliding a reusable skill badge into a new backpack.

🥬 The Concept: Parametric Skill Transfer (PaST)

  • What it is: A plug-in method that adds the Skill Vector to a target model that has just learned new facts via SFT.
  • How it works:
    1. Extract v_skill from a source RL–SFT pair.
    2. Do lightweight SFT on target docs to internalize new facts.
    3. Add: θ_final = θ_T^sft + λ·v_skill (λ≈1).
  • Why it matters: You instantly boost reasoning and execution on the new knowledge, skipping expensive RL. 🍞 Anchor: Like attaching a universal ‘error-handling and planning’ module to any freshly updated knowledge base.

Before vs After:

  • Before: Each new domain needed careful RL to get robust reasoning on top of fresh facts, which was slow and costly.
  • After: Do a quick SFT for new facts, then add the already-learned Skill Vector for reasoning—fast and effective.

Why It Works (Intuition, no equations):

  • The facts signal and the skills signal don’t overlap much inside the network. Because their directions are almost perpendicular, adding them doesn’t blur either one. High-dimensional geometry keeps the signals separate during processing, so they can peacefully coexist.

Building Blocks:

  • SFT (adds facts), RL (adds skills), Orthogonality (proves they’re separable), Skill Vector (captures skills), PaST (adds skills-once, reuse everywhere).

03Methodology

At a high level: Input → Stage I (Source SFT) → Stage I (Source RL) → Extract Skill Vector → Stage II (Target SFT) → Inject Skill Vector → Output (Target model with new facts + reusable skills)

Step-by-step (what, why, example):

  1. Source SFT (anchor knowledge)
  • What happens: Fine-tune the base model on source-domain documents to internalize their facts: θ_S^sft.
  • Why this step exists: We need a clean ‘facts-only’ version of the source to compare against after RL.
  • Example: Train on a set of long manuals so the model knows the content well (memorizes terms, definitions, and structures).
  • What breaks without it: If we skip it, subtracting from a non-anchored base mislabels which changes came from RL skills.
  1. Source RL (learn reasoning)
  • What happens: Continue training the same model with RL on tasks/trajectories from the source domain to learn robust reasoning and error recovery: θ_S^rl.
  • Why this step exists: RL distills ‘how to think/act’—planning, checking, handling failures.
  • Example: In tool use, the agent gets rewards for correct format, successful calls, and solving the user’s intent; in QA, it gets judged for correct answers.
  • What breaks without it: We’d have facts but no reusable skills to transfer.
  1. Extract Skill Vector
  • What happens: Compute v_skill = θ_S^rl − θ_S^sft.
  • Why this step exists: The subtraction cancels facts held in common and isolates the extra ‘procedural’ improvements from RL.
  • Example: Like taking two nearly identical instruction books and highlighting only the extra margin notes teaching tricks and safeguards.
  • What breaks without it: Trying to copy skills without subtracting risks copying back some source-domain facts (overfitting or interference).
  1. Target SFT (lightweight knowledge update)
  • What happens: Fine-tune the base (or compatible) model on new target documents to add fresh facts: θ_T^sft.
  • Why this step exists: We need the model to know the new domain’s content before adding reasoning skills.
  • Example: Feed a new set of FAQs or API schemas; the model learns what each item means.
  • What breaks without it: Injecting skills into a model that doesn’t know the target facts won’t help much.
  1. Inject Skill Vector (post-hoc composition)
  • What happens: Add θ_final = θ_T^sft + Ν¡v_skill (Îť set to 1 in experiments).
  • Why this step exists: This is the ‘plug-in’ moment that equips the target model with reusable reasoning/execution patterns.
  • Example: The model now not only knows the new API names but can also handle errors and plan multi-step calls.
  • What breaks without it: The target model often stays brittle—knows terms but stumbles under pressure.
  1. Output (ready-to-use model)
  • What happens: You get a target-domain model that can both recall newly learned facts and use them with robust logic.
  • Why this step exists: The goal is practical performance without running RL again.
  • Example: Closed-book QA: answers correctly from weights; Tool use: calls the right API, handles failures, and finishes the task.

🍞 Hook: Imagine polishing a universal strategy badge over multiple mini-tournaments so it works anywhere.

🥬 The Concept: Iterative Skill Refinement

  • What it is: Learn and extract the Skill Vector across several different source subsets, refreshing it each time to make it more general.
  • How it works:
    1. Split the source data into K parts.
    2. For each part: do SFT, add the previous skill vector, run RL, re-extract an updated skill vector.
    3. Repeat, making the vector less tied to any one dataset’s quirks.
  • Why it matters: A single-shot vector can overfit; iteration finds content-agnostic reasoning patterns. 🍞 Anchor: Like practicing strategy with different teams so your teamwork works with anyone.

The Secret Sauce:

  • Orthogonality: Facts and skills live in different subspaces, so vector addition works cleanly.
  • Post-hoc composition: First anchor facts in the target with SFT, then add skills. Ablations show this beats pre-injecting skills or sequentially fine-tuning an RL model on target facts.
  • Simple scaling: A single Îť worked well (set to 1), hinting the method is robust; future work can auto-tune Îť.

Concrete mini-examples:

  • Closed-book QA (SQuAD): The model reads a passage during SFT, then, with injected skills, pulls the exact answer from memory rather than giving a generic reply.
  • Long-context QA (LooGLE): After SFT encodes 21k-token docs, skills help the model navigate and retrieve the right details from its weights.
  • Tool use (ToolBench): After SFT internalizes API names/schemas, skills help it plan multi-step calls and recover from API errors instead of hallucinating tools.

04Experiments & Results

The Tests (what and why):

  • SQuAD (closed-book knowledge incorporation): Measures if the model can answer questions by retrieving from its updated weights, not by re-reading context. Why: Tests true parametric updates.
  • LooGLE (long-context QA): Documents average >21k tokens. Why: Checks if the method scales to massive information and reduces hallucinations when retrieving from memory.
  • ToolBench (closed-book tool use): Only API names are shown; the model must recall parameters and plan actions from memory. Why: Tests real ‘do the thing’ execution skills and error handling.

The Competition (baselines):

  • Base model (no adaptation).
  • Passage-only SFT; SFT with synthetic data; SFT with GPT-4.1 data.
  • SEAL (state-of-the-art self-editing SFT baseline).

Scoreboard and Meaning:

  • SQuAD (no context in prompt): • Target SFT + Synthetic baseline: 39.7%. • SEAL: 47.0%. • PaST (single-round): ~50.8% (+11.1 over its SFT baseline). • PaST (two-round/iterative): 56.9%—about a 9.9-point win over SEAL. This feels like moving from a solid B to an A- while others stay around B-.
  • LooGLE (Short Dependency QA): • Target SFT: 30.1%. • PaST after Round 1 (from 5 source docs): 35.0% (+4.9). • PaST after Round 2: 38.1% (total +8.0). That’s like finding the right answer an extra 8 times out of every 100 very long questions.
  • ToolBench (StableToolBench, 20 target categories, zero-shot): • Target SFT average: 21.9%. • PaST average: 32.2% (+10.3 points). In multiple categories where SFT scored 0%, PaST lights up non-zero success (e.g., Advertising 0→16.7%, SMS 0→11.1%). It wins in all 20 categories—broad, positive transfer.

Surprising Findings:

  • Post-hoc injection beats alternatives: Adding the Skill Vector after target SFT works best. Pre-injection (before SFT) helps but less; sequentially fine-tuning an RL model on target facts can even underperform, likely due to optimization conflicts.
  • Small source data can go far: Even a Skill Vector from just 5 long documents gave a big lift on LooGLE, revealing strong cross-domain generality.
  • Iteration matters: Iterative refinement outperforms simply using more data in a single round; it seems to scrub away content-specific quirks and sharpen reusable procedures.

Takeaway:

  • Across three very different settings—short QA, ultra-long QA, and multi-step tool use—PaST consistently turns ‘I know’ into ‘I can use what I know,’ and does so without re-running expensive RL in each new domain.

05Discussion & Limitations

Limitations (honest look):

  • Domain breadth: Experiments cover QA and tool use, but more domains (coding assistants, robotics, safety-critical workflows) need testing.
  • Fixed scaling (Îť=1): A constant injection strength worked well here, but some domains might need tuning for best results.
  • Architecture range: Shown on Qwen2.5-7B(-Instruct). It is likely general, but should be validated on more model families and sizes.
  • Source RL still costs: You need at least one good source RL run (with rewards/judges/simulators). PaST saves you from repeating RL for every new target, but it doesn’t remove the initial RL investment.
  • Judge dependence: Some evaluations rely on model-based judges (e.g., GPT-4.1). While common in research, human or grounded auto-metrics would add confidence.

Required resources:

  • Base LLM checkpoints compatible across SFT/RL phases (e.g., Qwen2.5-7B family).
  • GPUs (A100-class) for efficient SFT and RL source training.
  • Optional reward models/judges (e.g., GPT-4.1) and, for tool use, an API simulator (e.g., GPT-4o-mini) or a sandbox.

When NOT to use PaST:

  • If simple RAG is fast, cheap, and sufficient for your workload (no need for parametric updates).
  • If the target domain is wildly different from the source skills (e.g., skills from math proofs applied to musical composition), where transfer could be weak.
  • In ultra safety-critical domains without extensive validation; composing vectors changes behavior globally, so careful auditing is required.
  • When you can’t afford the initial source RL at all.

Open questions:

  • Automatic Îť selection: Can we learn the best per-layer or per-module scaling for skill injection?
  • Multi-skill composition: How well do several Skill Vectors (planning, math, coding) add up? Do they remain orthogonal?
  • Lifelong skill hubs: Could we build a public ‘skill library’ where teams share portable skill vectors for many tasks?
  • Safety and guardrails: How to ensure transferred skills respect domain-specific constraints and policies?
  • Theory: Can we more precisely predict where (layers/heads) skill vectors live and how they interact with attention routes?

06Conclusion & Future Work

Three-sentence summary:

  • The paper shows that updates from SFT (facts) and RL (skills) are nearly orthogonal, so their effects don’t clash.
  • Using this, the authors extract a reusable Skill Vector from a source RL-trained model and add it to a newly SFT-updated target model.
  • This PaST method improves closed-book QA and tool use without re-running RL in every new domain.

Main achievement:

  • A simple, modular, and empirically strong recipe—θ_final = θ_T^sft + Ν¡v_skill—that turns ‘I learned the facts’ into ‘I can use the facts,’ delivering large gains on SQuAD, LooGLE, and ToolBench.

Future directions:

  • Learn per-layer Îť, compose multiple skill vectors, expand to more domains and architectures, and develop safer, audited deployment practices.

Why remember this:

  • PaST reframes adaptation: don’t just add facts—plug in skills you’ve already mastered. This could make continual learning cheaper, faster, and more reliable across fast-changing real-world tasks.

Practical Applications

  • •Keep enterprise assistants updated with new company policies while reusing a single, proven reasoning Skill Vector.
  • •Upgrade internal developer bots when APIs change, so they recall new schemas and still plan robust multi-step tool calls.
  • •Improve customer support chatbots after product launches, enabling confident, closed-book answers without huge context windows.
  • •Maintain long-document QA systems (handbooks, regulations) that can recall precise clauses from memory instead of re-reading everything.
  • •Distribute a shared ‘reasoning plug-in’ across multiple business units, each with their own SFT-updated knowledge.
  • •Rapidly adapt agents to new tool categories (finance, mapping, email) without running fresh RL, reducing time-to-deploy.
  • •Build a library of Skill Vectors (planning, error recovery, math reasoning) and combine them for specialized tasks.
  • •Lower inference costs vs. heavy RAG by relying more on parametric retrieval, especially for stable, frequently used knowledge.
  • •Speed up A/B experimentation: swap in/out different Skill Vectors to test reasoning behaviors without retraining from scratch.
#Parametric Skill Transfer#Skill Vector#Task Arithmetic#Orthogonality of Updates#Supervised Fine-Tuning#Reinforcement Learning#Closed-Book QA#Tool Use Agents#Knowledge Updating#Continual Adaptation#SQuAD#LooGLE#ToolBench#Post-hoc Composition#Iterative Skill Refinement
Version: 1