Privileged Information Distillation for Language Models

Emiliano Penaloza; Dheeraj Vattikonda; Nicolas Gontier; Alexandre Lacoste; Laurent Charlin; Massimo Caccia

Privileged Information Distillation for Language Models

Intermediate

Emiliano Penaloza, Dheeraj Vattikonda, Nicolas Gontier et al.2/4/2026

arXiv PDF

Key Summary

•The paper shows how to train a language model with special extra hints (privileged information) during practice so it can still do well later without any hints.
•It introduces two training tricks, π-Distill and OPSD, that use the same model as both a teacher (with hints) and a student (without hints) to share skills.
•These tricks work even when the expert model hides its step-by-step thoughts and only shows the final actions it took.
•Across travel-planning and customer-support tool-use tasks, π-Distill and often OPSD beat the usual method of supervised fine-tuning plus reinforcement learning.
•They also generalize to new, out-of-domain tool-use tasks better than standard reinforcement learning and often better than methods that need full chain-of-thought.
•Success depends on how useful the hints are and how different the teacher’s behavior (with hints) is from the student’s behavior (without hints).
•Joint training (teacher and student together) is the most stable option and avoids common failure modes.
•A small KL penalty (a measure of how far two behaviors differ) helps keep training stable and prevents collapse.
•These methods reduce reliance on closed-source models’ hidden reasoning, making distillation from action traces practical.
•The authors release code and provide detailed analyses on when each method works best.

Why This Research Matters

Many real assistants must solve long, multi-step tasks like planning travel or fixing customer issues, and a single mistake can ruin the outcome. These methods let us learn robust behaviors from strong experts even when the experts hide their step-by-step thoughts, using only the actions they took. That makes training cheaper, more practical, and more privacy-friendly because we don’t need proprietary reasoning traces. Better training from action-only data means faster iteration for companies and safer models that avoid reward hacking and collapse. The approaches generalize to new tasks, so one round of training can boost performance across many tools and domains. In short, this work turns hidden expert behavior into real, dependable skills for everyday AI agents.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook): You know how a coach may whisper special tips to a player during practice, but on game day the player has to play without those whispers? The goal is to practice with extra help so you can still win without it.

🥬 The Concept: Privileged Information (PI)

What it is: Extra, helpful hints only available during training, not during real use.
How it works (recipe):
1. Gather expert runs that succeeded on a task (like a travel plan).
2. Turn those runs into helpful training hints (for example, a list of tool calls or a tiny summary hint).
3. Let a "teacher" model see the hints while a "student" model trains to act without them.
Why it matters: Without PI, weaker models may never stumble onto successful solutions in long, tricky tasks, so they can’t learn what “good” looks like. 🍞 Bottom Bread (Anchor): A travel agent model gets a hint in practice: “First check the user’s dates, then call the hotel search with city=Paris.” Later, without the hint, it remembers the pattern and succeeds.

🍞 Top Bread (Hook): Imagine learning to ride a bike. You try, fall, and try again, getting better because good tries feel rewarding.

🥬 The Concept: Reinforcement Learning (RL)

What it is: A way for models to learn by trying actions and getting rewards or penalties.
How it works:
1. The model tries a sequence of steps (like using tools in a conversation).
2. The environment gives a score (reward) based on success.
3. The model adjusts to make good scores more likely next time.
Why it matters: Without RL, the model can’t easily learn long, multi-step tasks where success only appears at the end. 🍞 Bottom Bread (Anchor): Booking a trip needs many steps. RL rewards the model when the whole plan satisfies the user, nudging it toward better future plans.

🍞 Top Bread (Hook): When you solve a math problem, you think step by step, even if you only write the final answer.

🥬 The Concept: Chain-of-Thought (CoT)

What it is: The hidden, step-by-step reasoning a model uses before giving an action or answer.
How it works:
1. The model thinks internally (tokens of reasoning not always shown).
2. It decides the next action (like a tool call).
3. It repeats this across turns to finish the task.
Why it matters: If we can’t see CoT (many closed models hide it), classic “copy the reasoning” training won’t work; we need another way. 🍞 Bottom Bread (Anchor): An expert model silently reasons: “Check restaurant hours → filter by vegan options → book at 7 pm,” but only shows the final booking action. We must learn from those actions alone.

🍞 Top Bread (Hook): Think of a board game where each move depends on the whole history so far and a bit of luck.

🥬 The Concept: Markov Decision Process (MDP)

What it is: A formal way to describe decision-making over time with states, actions, and rewards.
How it works:
1. State = everything seen so far in the conversation.
2. Action = the model’s next message or tool call.
3. Reward = how well the final outcome matches the goal.
Why it matters: Framing tool-use chats as an MDP lets us use RL reliably. 🍞 Bottom Bread (Anchor): In customer support, the “state” is the transcript so far, the “action” is the next tool call, and the “reward” is whether the customer’s issue is solved.

The World Before: Language models could talk, but many struggled in long, multi-turn tasks that need the right sequence of tool calls. Distillation from experts usually copied both what the expert did and how it reasoned (full CoT). But closed-source frontier models often hide their CoT.

The Problem: If you only see the expert’s actions (what buttons they pressed) and not their internal thoughts, how do you teach a smaller model to do the same? And if you give hints during training (PI), how do you make sure the model still performs when the hints are gone at test time?

Failed Attempts:

Just fine-tuning on expert outputs without CoT often underperforms: the model mimics surface patterns and misses deeper strategy.
Sequential pipelines (first train a "with-PI" teacher, then distill into a student) can be unstable, off-policy, and hard to tune (which checkpoint to copy?).

The Gap: We needed a training method that (1) learns from hints in practice, (2) works without hints at test time, and (3) doesn’t require seeing the expert’s hidden CoT.

Real Stakes:

Travel planning and customer support agents must coordinate many steps; better training means fewer mistakes for users.
Companies can benefit from action logs of strong (but closed) models without needing proprietary CoT.
Safer, more stable learning methods reduce “reward hacking” and regressions as models train longer.

02Core Idea

🍞 Top Bread (Hook): Imagine training wheels. They help you balance at first, but later you ride smoothly without them because you learned the right motions.

🥬 The Concept: Knowledge Distillation (as a teacher–student setup)

What it is: A way to transfer skills from a stronger guide (teacher) to a learner (student).
How it works:
1. The teacher shows examples or provides guidance.
2. The student practices matching good behaviors.
3. Over time, the student internalizes the skill and no longer needs the teacher.
Why it matters: Distillation is the bridge that moves skills from extra-help conditions (training) to normal conditions (testing). 🍞 Bottom Bread (Anchor): A piano teacher plays a tricky piece with tips; the student practices until they can play it solo.

Aha! Moment (one sentence): Train one shared model to act as both a PI-seeing teacher and a no-PI student at the same time, so skills learned with hints transfer into a policy that works without hints.

Three Analogies:

Bike with training wheels: The teacher uses wheels (PI) while the student practices balance; because both are the same bike (shared model), balance skills transfer.
Whispered rehearsal vs. live performance: In rehearsal (teacher), a pro whispers cues (PI). On stage (student), no whispers—but the timing sticks because the singer trained with both minds in sync.
Cooking with recipe cards: You first cook with a detailed recipe (PI, teacher mode), then repeat from memory (student mode); practicing both together makes the memory strong.

Before vs After:

Before: Distilling from closed experts needed CoT or separate teacher/student models and tricky pipelines; action-only traces weren’t enough to get top performance.
After: π-Distill (and often OPSD) let you learn from action-only hints during training and perform strongly without hints, beating or matching methods that required full CoT.

🍞 Top Bread (Hook): You know how we compare two drawings to see how different they are?

🥬 The Concept: Reverse KL Divergence

What it is: A number that says how different one behavior (distribution) is from another, measured in a way that prefers picking one good mode and avoiding spread-out errors.
How it works:
1. Look at the student’s choices.
2. Compare how likely those choices are under the teacher.
3. Penalize the student when it strays from high-quality teacher behavior.
Why it matters: Without this, the teacher and student can drift apart, making transfer weak or unstable. 🍞 Bottom Bread (Anchor): If the teacher favors the action “call get_order(id=3)” often, reverse KL nudges the student to also favor that exact good call.

Building Blocks of the Core Idea:

Shared-parameter teacher and student: One model, two modes—teacher sees PI, student does not.
Joint optimization (π-Distill):
- Teacher mode: Learn to succeed with PI but stay close to the student so transfer is easy.
- Student mode: Learn from teacher’s successful trajectories, even without seeing PI.
On-policy alternative (OPSD): The student acts; a reverse-KL penalty keeps it close to a PI-conditioned teacher, turning teacher agreement into a dense learning signal.
Managing gap (distribution shift): KL regularization keeps teacher-with-PI behavior from straying too far from student-without-PI behavior.

Why It Works (intuition):

The teacher can reach good solutions thanks to PI (like a spotlight on useful paths).
Because teacher and student share parameters and are nudged close with KL, the student picks up the useful patterns.
Over time, the student learns to find those good paths even when the spotlight (PI) is turned off.

🍞 Bottom Bread (Anchor): In Travel Planner, the teacher (with hints like “search restaurants, then verify dietary needs”) discovers good sequences. The student, trained alongside and kept close by KL, learns to do those sequences without seeing the hint.

03Methodology

High-Level Recipe: Input (task + optional PI) → Teacher rollout (with PI) → Student learning (from teacher traces) → Output (a student policy that works without PI)

🍞 Top Bread (Hook): Imagine practicing a play. During rehearsal, a prompter feeds you lines (PI), but you also rehearse performing without the prompter so opening night goes smoothly.

🥬 The Concept: π-Distill (Privileged Information Distillation)

What it is: A joint teacher–student training method using one shared model; teacher sees PI, student does not.
How it works (step by step):
1. Teacher rollout with PI: Sample multi-turn trajectories using the teacher mode (model gets the PI in the prompt/system message).
2. Reward and stabilize: Score each trajectory by the environment’s reward; add a KL term pushing the teacher to stay near the student (prevents drifting too far).
3. Student learning from teacher: Train the student on the teacher’s high-reward trajectories with off-policy RL, using importance weights to correct for sampling bias.
4. Joint objective: Mix teacher and student updates with a knob α (0=only student focus, 1=only teacher focus, 0.5=both) and a KL weight β.
5. Share parameters: Improvements in representations (like how to use tools) benefit both modes.
Why it matters: Without joint training, you must pick a teacher checkpoint to copy, risk instability, and often get worse transfer. 🍞 Bottom Bread (Anchor): For a retail support chat, the teacher (with PI listing tool calls) navigates a shortest successful path. The student learns from those paths and later replicates them without PI.

Detailed Steps with Purpose and Examples:

Collect action-only expert traces and turn them into PI formats:
- Tool calls & arguments (e.g., get_order(id=3)).
- Tool calls only (function names, arguments omitted).
- Self-generated hints (a short strategy summary made from an expert run). Why this step: Without usable PI, the teacher can’t get a strong head start in hard tasks. Example: “Look up user ID by email; then fetch order; then confirm shipping address.”
Teacher rollout with PI:
- The model runs in teacher mode (prompt contains PI) to sample trajectories.
- Compute environment reward (success or failure) and add a reverse-KL penalty to keep teacher near the student. Why this step: If teacher drifts far, student can’t learn from those far-away behaviors. Example: Teacher solves a Travel Planner task in 5 steps; reverse-KL nudges it not to invent wild, student-foreign moves.
Compute advantages with GRPO (group-based policy optimization):
- Compare each trajectory’s reward to the group average; use clipped importance ratios per token for stable updates. Why this step: Stabilizes learning and reduces variance in long text sequences. Example: Among 4 sampled plans, the best one gets positive advantage; worse ones get negative.
Student learning from teacher’s trajectories:
- Use off-policy RL with importance weights so the student can learn from teacher-sampled data.
- Add a forward KL (or reverse, per objective) from student to teacher as needed for stability. Why this step: Transfers success patterns from teacher (with PI) to student (without PI). Example: Student increases probability of the sequence “id_by_email → get_order → confirm” even without seeing the hint.
Joint update:
- Combine teacher objective and student objective via α.
- Tune β (KL strength) to avoid collapse (teacher=student) or runaway drift. Why this step: Balances learning to use PI and learning to act without PI. Example: α=0.5 often most stable: both teacher and student improve together.

The Secret Sauce:

Shared parameters + KL: Because teacher and student are the same model in two modes, new skills the teacher learns with PI become weights the student also has. KL keeps both modes aligned so skills don’t get lost.

🍞 Top Bread (Hook): Think of scrimmage practice where the team plays real games and reviews how closely their plays match a coach’s ideal tape.

🥬 The Concept: OPSD (On-Policy Self-Distillation)

What it is: The student samples its own trajectories; a reverse-KL penalty pushes it toward a PI-conditioned teacher view of the same model.
How it works (step by step):
1. Student acts on-policy (no PI in input) and collects trajectories.
2. Compute reward from the environment.
3. Add reverse-KL that compares the student’s choices to the teacher’s probabilities (teacher gets PI in its conditioning), creating a dense learning signal per token.
4. Update the student to both get higher reward and stay close to the PI-teacher.
Why it matters: Without on-policy sampling, distribution shift can make learning brittle; OPSD stays on the student’s actual paths and still learns from the teacher’s guidance. 🍞 Bottom Bread (Anchor): In τ-Bench retail, the student runs a conversation; at each step, reverse-KL says, “lean more toward what the PI-teacher would have done,” improving success rates.

Inputs → Steps → Output (both methods):

Inputs: tasks (states), optional PI, environment reward.
Steps: rollouts (teacher or student), compute rewards + KL, group-relative advantages, policy update.
Output: a student policy that performs strongly without PI.

What breaks without each step:

No PI: teacher can’t reach successful modes in hard tasks.
No KL: teacher/student drift or collapse, transfer weak.
No on-policy or importance weighting: unstable learning from mismatched data.
No joint objective: hard-to-tune sequential pipelines and poorer results.

04Experiments & Results

🍞 Top Bread (Hook): Imagine testing two basketball drills: one team trains with whispered tips in practice but must play games without them. Who wins more often?

🥬 The Concept: Evaluations and Benchmarks

What it is: Careful tests on tool-use environments to see which training method wins.
How it works:
1. Train on Travel Planner and τ-Bench retail using PI-built from expert action traces.
2. Compare to strong baselines: plain RL, SFT with and without CoT, and SFT+RL.
3. Check generalization on τ-Bench airline (OOD) and the 7 GEM tool-use QA environments.
Why it matters: Numbers show whether action-only PI really substitutes for hidden CoT and if skills transfer to new tasks. 🍞 Bottom Bread (Anchor): If π-Distill scores like getting an A while SFT+RL gets a B, then training with action-only hints truly pays off.

The Test and Competition:

Models: Qwen3-4B, Qwen3-8B, and R1-Distill-Llama-8B.
Baselines: Base model, SFT (with/without CoT), RL, SFT+RL, plus variants of π-Distill (α=0, 0.5, 1) and OPSD.
PI Types: (1) Tool calls & arguments, (2) Tool calls only, (3) Self-generated hints.

Scoreboard with Context (highlights):

Travel Planner (Qwen3-8B):
- π-Distill (teacher-only α=1) reached about 44.1%, while the commonly used SFT w/ CoT + RL was about 32.3%—that’s like jumping from a solid B to a strong A-.
- OPSD also performed strongly, reaching around 37.5%, beating SFT+CoT+RL.
τ-Bench Retail (Qwen3-8B):
- π-Distill (student-only α=0) hit ~31.1%, and (α=1) ~29.7%, both edging past SFT w/ CoT + RL at ~29.1%.
- OPSD scored ~27.3%, also better than plain RL and SFT w/o CoT + RL.
τ-Bench Airline (OOD, Qwen3-8B):
- OPSD reached ~14.0%, topping SFT w/ CoT + RL (~8.0%) and standard RL (~6.67%). π-Distill variants were competitive (7–12%).

Generalization (GEM suite, Pass@1/10):

Using the best τ-Bench retail checkpoint, π-Distill and OPSD generally outperformed base and RL on new tool-use QA tasks.
With larger models (Qwen3-8B), π-Distill (α=0 or 0.5) and OPSD often surpassed SFT w/ CoT + RL, suggesting stronger reasoners benefit from staying on-policy and from PI-guided training.

Surprising Findings:

Action-only PI can outperform methods that assume full CoT access, especially with π-Distill.
Joint training (α=0.5) is frequently the most stable, rarely the worst performer.
OPSD shines more as model size grows; smaller models can overfit teacher guidance or suffer when student–teacher KL is too large.
The usefulness of PI and the initial teacher–student divergence strongly predict success; too much divergence can harm transfer.

Bottom Line:

π-Distill consistently beats industry-standard SFT+RL pipelines (even those using CoT) on multiple settings, and OPSD is often a strong second, especially for larger models and OOD tasks.

05Discussion & Limitations

🍞 Top Bread (Hook): Even great recipes have limits—you still need the right ingredients and kitchen tools.

Limitations:

PI Source: This study builds PI from frontier model action traces. If you lack a strong expert to mine, results may vary.
Model Scale: Experiments used up to 8B parameters; behavior at much larger scales needs testing.
Observational Analysis: The paper reports careful observations, but not every factor was isolated in controlled ablations.
Failure Modes: If PI is low-utility or causes huge teacher–student divergence, π-Distill/OPSD may struggle; OPSD can overfit small models or get hurt by very high KL.

Required Resources:

Multi-GPU training (e.g., H100s) and long context windows for multi-turn tool use.
Environments with reliable rewards and safeguards (like length penalties) to prevent reward hacking.
A source of successful trajectories to convert into PI (tool calls, arguments, or hints).

When NOT to Use:

When you have no way to create useful PI (e.g., no expert traces) and tasks are too hard for the teacher to bootstrap.
When student–teacher distributions are wildly different and KL regularization can’t be tuned to stabilize training.
For tiny models that can’t leverage on-policy guidance—OPSD may underperform there.

Open Questions:

How do these methods scale beyond 8B parameters and across more domains?
Can we create strong, synthetic PI without relying on frontier models?
What automated knobs can adapt α and β during training to avoid collapse or drift?
How to best choose PI type per task to maximize utility and minimize divergence?

🍞 Bottom Bread (Anchor): Think of picking the right training wheels and tightening the screws just right—too tight or too loose, and the ride gets wobbly. The same careful balance is needed here.

06Conclusion & Future Work

Three-Sentence Summary:

This paper introduces π-Distill and OPSD, two ways to train with privileged information (PI) during practice so a language model still excels without PI later.
By sharing parameters between a PI-seeing teacher and a no-PI student, and by using KL to keep them aligned, the student reliably absorbs useful multi-step behaviors.
Across travel planning, customer support, and OOD tool-use QA, these methods often beat standard SFT+RL pipelines, even those assuming full chain-of-thought.

Main Achievement:

Demonstrating that action-only PI (no CoT) can be distilled effectively into a strong test-time policy via joint teacher–student training, simplifying pipelines and improving results.

Future Directions:

Scale to larger models and more domains; explore synthetic or self-generated PI without frontier models; develop adaptive schedules for α and β; design PI selectors that optimize utility vs. divergence.

Why Remember This:

It shows we can learn from what strong models do—even when they hide how they think—and still build reliable, generalizing agents that perform well without extra hints at test time.

Practical Applications

•Distill closed-source expert agents into smaller open models using only action logs (no CoT).
•Improve customer-support bots to gather the right info and call the correct tools in fewer steps.
•Train planning agents (travel, shopping, scheduling) to handle multi-turn constraints reliably.
•Boost out-of-domain tool-use performance by selecting checkpoints trained with π-Distill or OPSD.
•Reduce dependence on supervised CoT datasets by converting expert trajectories into PI formats.
•Stabilize RL training with KL regularization to prevent drift or collapse in long-horizon tasks.
•Choose PI types (tool calls, arguments, or hints) to balance information content vs. divergence per task.
•Use OPSD for larger models to turn teacher agreement into a dense learning signal on-policy.
•Avoid reward hacking by incorporating length and leakage penalties in training rewards.
•Deploy more consistent agents without needing to ship extra hints or prompts at inference time.

Version: 1