Masking Teacher and Reinforcing Student for Distilling Vision-Language Models

Byung-Kwan Lee; Yu-Chiang Frank Wang; Ryo Hachiuma

Masking Teacher and Reinforcing Student for Distilling Vision-Language Models

Intermediate

Byung-Kwan Lee, Yu-Chiang Frank Wang, Ryo Hachiuma12/23/2025

arXiv PDF

Key Summary

•Big vision-language models are super smart but too large to fit on phones and small devices.
•Masters is a new way to teach a small model using a big one by first making the big teacher simpler and then slowly restoring its full power.
•This “mask-progressive” plan lets the student learn easy patterns first and harder ones later, so training stays stable.
•Masters also uses offline reinforcement learning with two rewards: one for being correct (accuracy) and one for being easy to learn from (distillation).
•Instead of expensive online think–answer training, Masters pre-generates multiple answers up front to save a lot of time and compute.
•Generating multiple responses from both teacher and student improves alignment and makes learning smoother.
•Gradually moving from a medium teacher to a big teacher (like 14B → 38B) beats jumping straight to the biggest teacher.
•Across many benchmarks, small models trained with Masters beat other compact VLMs and even challenge some larger ones.
•Masters is practical: it cuts training time and avoids slow inference, making on-device AI more realistic.
•The framework is unified and scalable, pointing to a reliable path for building compact, deployable VLMs.

Why This Research Matters

Masters makes it realistic to run smart multimodal AI on everyday devices by shrinking the compute bill without shrinking the brains. It stabilizes training for small models, so they can safely learn from big ones and still perform well on complex tasks like charts, documents, and math-in-vision. The offline design and dual rewards bring both speed and quality, reducing the need for costly online think–answer training. This means faster, cheaper updates and better accessibility, from education apps to assistive tools. With gradual teacher scaling, teams can build strong compact models step by step instead of relying on one giant leap. The approach also generalizes across model families, making it a versatile recipe rather than a one-off trick. In short, Masters helps turn cutting-edge research into usable, deployable AI.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

You know how kids learn to ride a bike with training wheels first, and only after they feel steady do they try riding without them? That gradual path keeps them from wobbling and falling. Big AI models that see images and read text together—called vision-language models (VLMs)—are like expert cyclists: powerful, but heavy and hard to carry around. We want smaller riders (models) that can still zip around safely on everyday devices like phones.

What it is (the situation): For the past few years, large VLMs have gotten really good at tasks such as describing pictures, reading charts, and answering open questions about images. They do this by combining vision (images) and language (text) understanding. But these models are huge and need lots of memory and compute. That makes them slow and expensive, especially for mobile or edge devices.

How it works (the old way): The classic fix is knowledge distillation, where a big “teacher” model teaches a smaller “student” model. The student looks at the teacher’s outputs (like answer probabilities) and tries to mimic them. If this works well, the small model becomes smart without needing to be giant.

Why it matters (what breaks without better methods): When the gap between teacher and student is very large, the student struggles to copy the teacher’s rich thinking patterns. Training can become unstable, and performance drops. It is like asking a beginner cyclist to keep up with a Tour de France pro on a steep mountain—too much, too soon.

🍞 Top Bread (Hook): Imagine you want to learn a new sport. If your coach shows you the most advanced moves on day one, you’ll probably feel lost. 🥬 Filling (Reinforcement Learning): Reinforcement learning (RL) is a way for a model to improve by trying things and getting rewards for better behavior.

What it is: Learning by doing, guided by rewards.
How it works: 1) Try an answer. 2) Get a score (reward). 3) Adjust to get higher scores next time.
Why it matters: Without rewards, the model doesn’t know which answers are better, so it can’t improve directionally. 🍞 Bottom Bread (Anchor): Like a video game that gives you points for good choices, helping you discover the best strategy over time.

🍞 Top Bread (Hook): Think of a top student sharing neat notes with a classmate who missed the lecture. 🥬 Filling (Knowledge Distillation): Knowledge distillation is when a small model learns from a big model’s outputs.

What it is: A small student copies the big teacher’s “soft answers.”
How it works: 1) Teacher answers questions. 2) Student compares its answers with teacher’s. 3) Student adjusts to get closer to teacher.
Why it matters: Without distillation, the small model would learn slower and need more data. 🍞 Bottom Bread (Anchor): Like practicing with an answer key that also shows how confident the teacher is about each choice.

🍞 Top Bread (Hook): If a puzzle has 1,000 tiny pieces, you don’t start by forcing someone to solve the hardest parts first. 🥬 Filling (Masking Teacher): Masking teacher means turning off the least important weights in the big model so it becomes simpler.

What it is: Temporarily quieting less helpful parts of the teacher’s brain.
How it works: 1) Measure weight sizes. 2) Mask (set to zero) the small ones. 3) Use this simpler teacher to guide the student.
Why it matters: Without masking, the student faces overwhelming complexity and training can wobble. 🍞 Bottom Bread (Anchor): It’s like a coach who first teaches the basic moves and saves fancy tricks for later.

🍞 Top Bread (Hook): You don’t remove training wheels all at once—you loosen them step by step. 🥬 Filling (Mask-Progressive Strategy): Gradually unmask the teacher so it gets smarter over time during training.

What it is: Start simple, then slowly restore the teacher’s full capacity.
How it works: 1) Begin with a heavily masked teacher. 2) Reduce masking in stages. 3) Student learns coarse patterns first, then fine details.
Why it matters: Without progression, the student jumps into deep water too soon and may sink. 🍞 Bottom Bread (Anchor): Like puzzles that start with big pieces and then add smaller pieces as you get better.

🍞 Top Bread (Hook): When you’re practicing, it helps to get two kinds of feedback: Did you get it right? And was your method easy to follow? 🥬 Filling (Dual Rewards): Two rewards guide learning: one for correctness and one for how easy the answer is to transfer from teacher to student.

What it is: Accuracy reward + distillation reward.
How it works: 1) Score for being semantically correct. 2) Score for being close to the teacher’s output style. 3) Combine both to train.
Why it matters: Without both, you might be correct but hard to learn from, or easy to learn from but wrong. 🍞 Bottom Bread (Anchor): Like a teacher grading both your final answer and your clarity of explanation.

🍞 Top Bread (Hook): Studying with old quizzes is faster than taking a brand-new test every minute. 🥬 Filling (Offline RL): Learn from a fixed set of pre-generated answers instead of live, expensive “think–answer” training.

What it is: RL that uses stored experiences.
How it works: 1) Teacher and student pre-generate multiple answers. 2) Compute rewards offline. 3) Train efficiently.
Why it matters: Without offline RL, compute costs explode and you can’t scale training. 🍞 Bottom Bread (Anchor): Like practicing from a workbook rather than running a live exam each time.

In short, the world before this paper had strong but heavy VLMs. We needed small, fast models that still understand images and text well. Previous fixes tried different loss functions or matching internal layers, but they didn’t directly solve the “capacity gap” problem. Masters fills that gap by simplifying the teacher first (masking), then turning the complexity back on slowly (progressive unmasking), and using offline RL with smart rewards so the student learns stably and efficiently. This matters for daily life because it makes powerful on-device AI more practical—helping with accessibility, education, safety, and productivity, without needing giant servers.

02Core Idea

Aha! Moment in one sentence: Make the big teacher simpler at first and only gradually restore its full power while the student trains, then reinforce the student with two targeted rewards using fast, offline practice.

Three analogies to lock it in:

Sports coach: Start with basic drills (masked teacher) before advanced plays (unmasked teacher), and give two scores: did you score (accuracy) and was your form copyable (distillation)?
Puzzle builder: Begin with big, easy pieces (coarse patterns), then introduce tiny pieces (fine details), all while checking both if the picture is correct and if the steps are easy to imitate.
Music teacher: Practice a simpler arrangement first, then add complexity, grading both correctness of notes and how teachable the fingering is.

Before vs After:

Before: Distillation often used a giant, unchanged teacher. The student tried to swallow complex, high-dimensional knowledge in one gulp, causing unstable learning.
After: Masters shrinks the teacher at first (masking), restores it gradually (progressive), and gives the student two helpful signals (dual rewards) using offline RL with multiple responses. Learning becomes smooth, stable, and efficient.

Why it works (intuition without equations):

The capacity gap is the main reason small students wobble. If the teacher is too complex from the start, the student’s low-dimensional brain can’t mirror it well. Masking lowers the teacher’s active complexity so the student can match the teacher’s patterns at a similar scale. As the student grows more capable, unmasking raises the teacher’s complexity step by step, letting the student climb a staircase instead of a cliff.
Dual rewards provide two guardrails: accuracy (so answers are right) and transferability (so the pattern is learnable). Accuracy alone can push toward stylistic answers that the student struggles to internalize; transferability alone can yield easy-to-mimic but wrong answers. Together, they push the student toward correct and learnable behavior.
Offline RL with multiple responses lets you train on lots of diverse attempts without paying the huge cost of generating new long, chain-of-thought answers every step. This preserves efficiency and diversity—key for robust learning.

Building blocks (each with a sandwich):

🍞 Top Bread (Hook): You don’t start math class with calculus. 🥬 Filling (Masking Teacher): Temporarily zero-out small-magnitude weights in the teacher to make it simpler.

What: A pruned version of the teacher used only during training.
How: Rank weights by size, mask the small ones per layer, form a simpler teacher checkpoint.
Why: Without this, the student faces too much detail too soon. 🍞 Bottom Bread (Anchor): Like using a shorter textbook before tackling the full edition.

🍞 Top Bread (Hook): Training wheels come off slowly. 🥬 Filling (Mask-Progressive Strategy): Reduce the masking ratio in stages as training goes on.

What: A schedule that restores teacher capacity from simple to full.
How: Set a maximum mask (e.g., 20%), then step down (20%→15%→10%→5%→0%).
Why: Without a schedule, the jump from simple to complex is abrupt and destabilizing. 🍞 Bottom Bread (Anchor): Like raising the bar little by little in high jump practice.

🍞 Top Bread (Hook): Studying from a question bank is efficient. 🥬 Filling (Offline RL): Use pre-generated answers from both teacher and student.

What: RL that learns from stored responses.
How: Generate multiple responses per question once, then compute rewards and train repeatedly.
Why: Without offline RL, training becomes too slow and costly. 🍞 Bottom Bread (Anchor): Like reusing flashcards instead of writing new ones every time.

🍞 Top Bread (Hook): A good grade checks both the answer and how you showed your work. 🥬 Filling (Dual Rewards): Combine two signals: correctness and ease of transfer.

What: Accuracy reward + distillation reward.
How: Use an LLM-judge for semantic correctness; measure logit similarity for transferability, normalized so small divergences get big rewards.
Why: Without both, learning skews toward either brittle correctness or easy-but-wrong answers. 🍞 Bottom Bread (Anchor): Like scoring a science fair project on results and clarity.

What changes because of this idea:

Stability: Training stops wobbling because the student always faces an appropriately complex teacher.
Efficiency: Offline RL avoids slow online generation and uses more data diversity.
Performance: The student not only catches up to other compact models but can sometimes rival larger ones.
Scalability: You can scale teacher sizes gradually (e.g., 14B → 38B → 78B) for even smoother convergence and stronger generalization.

In essence, Masters turns distillation into a carefully staged lesson plan: simplify first, then deepen; check correctness and teachability; study efficiently from a rich, prebuilt workbook of examples.

03Methodology

At a high level: Images and questions → masked teacher generates multiple responses; student also generates multiple responses → compute two rewards (accuracy and distillation) offline → train student to improve on both while progressively restoring the teacher’s capacity → distilled student model.

Step 1: Build and schedule masked teachers

What happens: You take the big teacher and temporarily quiet its least important weights (the ones with small magnitudes) per layer. You save several teacher checkpoints with different masking ratios, such as 0.20, 0.15, 0.10, 0.05, and 0.
Why this step exists: A too-powerful teacher from the start overwhelms the student. The masked teacher aligns better with the student’s current capacity.
Example: Suppose the teacher is InternVL3.5-38B. At 20% masking, it acts like a simplified coach. As training proceeds, you change to 15%, 10%, 5%, and finally 0% masking—gradually revealing full expertise.

🍞 Top Bread (Hook): Don’t jump straight to the hardest textbook. 🥬 Filling (Masking Teacher): Temporarily zero-out small weights to simplify.

What: A simpler, temporary teacher.
How: Rank weights; mask smaller ones layer-by-layer; store multiple masked versions.
Why: Without it, the student is overwhelmed. 🍞 Bottom Bread (Anchor): Like practicing with a summary sheet before the full book.

Step 2: Progressive unmasking schedule

What happens: You decide a maximum masking ratio (like 20%) and a step size (like 5%). Every training phase, you reduce the mask to make the teacher a bit more complex.
Why this step exists: Smooth difficulty ramps keep training stable and let the student learn coarse patterns first, then fine details.
Example: If you have 5 phases (0.20→0.15→0.10→0.05→0), each phase uses its own pre-generated responses and targets.

🍞 Top Bread (Hook): Training wheels off, a little at a time. 🥬 Filling (Mask-Progressive Strategy): Restore capacity in stages.

What: A schedule for unmasking.
How: Choose r_max and decrement; step through masked checkpoints during training.
Why: Without staging, complexity jumps and the student stumbles. 🍞 Bottom Bread (Anchor): Like moving from easy mode to normal to hard in a game.

Step 3: Multi-response generation from both teacher and student

What happens: For each question-image pair, you ask both the masked teacher and the student to generate multiple answers (e.g., 8). You store these in a dataset for offline RL.
Why this step exists: One answer per question is too limiting and can cause overfitting to a narrow style. Multiple responses increase diversity and robustness.
Example: Question: “What type of vehicle is the dog sitting on?” Teacher responses might include “motorcycle,” “Harley-style motorcycle,” or “bike.” Student responses might vary from correct to slightly off. You keep them all.

🍞 Top Bread (Hook): Practice many ways to say the right thing. 🥬 Filling (Multiple Responses): Generate several answers per question from both models.

What: A diverse set of candidate answers.
How: Sample with temperature/top-p to get 8 answers per question.
Why: Without diversity, the student becomes brittle and less general. 🍞 Bottom Bread (Anchor): Like rehearsing several ways to explain the same math problem.

Step 4: Offline RL with dual rewards

What happens: For each stored answer, you compute two scores: accuracy and distillation. The accuracy reward uses an LLM-as-a-Judge to check if the answer is semantically correct (0 to 1). The distillation reward checks how close the student’s logits are to the teacher’s for that answer, then normalizes so more teacher-like answers get higher rewards.
Why this step exists: Accuracy ensures correctness; distillation ensures the answer is easy for the student to learn to produce.
Example: If “Harley-style motorcycle” is judged correct (1.0) and closely matches the teacher’s output distribution, it gets a high combined reward. A fancy but confusing answer, even if correct, may score lower on distillation; an easy-to-mimic but wrong answer scores low on accuracy.

🍞 Top Bread (Hook): A good grade checks both the answer and the method. 🥬 Filling (Dual Rewards, Offline RL, Accuracy Reward, Distillation Reward):

What: Two rewards, computed offline, guide training.
How: Accuracy via LLM-judge; distillation via normalized teacher–student divergence.
Why: Without both, you either learn brittle correctness or comfortable wrongness. 🍞 Bottom Bread (Anchor): Like scoring a lab report on results and clarity of procedure.

Step 5: Unified training objective

What happens: The student is trained to increase both rewards on the stored responses, while also minimizing the gap from the progressively less-masked teacher. In practice, this combines reinforcement (use rewards) and distillation (match teacher) in one loop.
Why this step exists: Merging the two signals stabilizes learning: rewards push toward useful behavior, distillation anchors the student to teacher-like patterns.
Example: With InternVL3.5-8B as student, you see steady score increases across benchmarks as you move through the masking schedule and reinforce with rewards.

Step 6: Teacher-size scaling (optional but powerful)

What happens: Start with a mid-size teacher (e.g., 14B) and then move to a larger one (e.g., 38B or 78B) after the student is warmed up.
Why this step exists: A mid teacher bridges the gap, improving stability and final performance compared to jumping straight to the largest teacher.
Example: 14B → 38B improves convergence and generalization versus 38B alone.

The secret sauce:

Capacity alignment through mask-progressive unmasking: keep challenge appropriate.
Dual rewards that value both correctness and teachability.
Offline, multi-response training: big diversity at small compute cost.
Optional teacher-size scaling: step-ladder between student and the biggest teacher.

What breaks without each piece:

No masking: student faces too much complexity, unstable training.
No progression: sudden jumps in difficulty cause regressions.
No multi-response: narrow style, weaker generalization.
No dual rewards: either correct-but-unlearnable or learnable-but-wrong behavior.
No offline RL: training becomes slow and data-poor, limiting scale.

04Experiments & Results

The test: The authors measured how well small students taught by Masters perform on many public VLM benchmarks that check skills like diagram reading (AI2D), chart understanding (ChartQA), math-in-vision reasoning (MathVista), broad multimodal abilities (MMB, MM-Vet, MMStar), and expert-level exams (MMMU, MMMU-Pro), plus others like BLINK, SEED, SEED2+, and RealWorldQA. These cover recognition, OCR, reasoning, and open-ended answers.

The competition: Masters was compared against three main baselines and many strong open/closed models.

Baselines:
1. Naive distillation: direct teacher→student with a standard divergence loss.
2. Mask-progressive only: use the masking schedule but no RL rewards.
3. RL-applied: add dual-reward offline RL on top of mask-progressive.
Models: Multiple student families (Qwen2.5-VL, Qwen3-VL, InternVL3, InternVL3.5) at sizes like 2B, 3/4/7/8B, and teachers up to 72B/78B.

The scoreboard with context:

Mask-progressive distillation alone already beats naive distillation across averages. This is like moving from a shaky C to a steady B+ just by staging the difficulty.
Adding reward feedback (accuracy + distillation) gives another jump, often pushing students into A-range performance compared to peers, and in some cases rivaling larger models. For example, InternVL3.5-8B with Masters shows clear gains across most benchmarks relative to its base and mask-only versions.
Teacher-size scaling (e.g., 14B → 38B → 78B) further boosts stability and performance beyond one-shot large-teacher training. Think of it as taking two medium steps instead of one giant leap—fewer stumbles, better finish.
Multi-response count: Performance rises as you increase the number of stored answers and plateaus around eight responses—good bang for the buck.
Masking ratio: A grid search shows that many teacher families work best around a 20% max mask, though some prefer 40%. The key point is not the exact number but that a modest mask creates a sweet spot for stable learning.
Efficiency: Because Masters uses offline RL without think–answer, it avoids huge compute bills and long inference times. The training and inference are much more practical for real deployments.

Surprising findings:

Students sometimes surpass parts of larger models’ performance while being far more efficient. This suggests that a right-sized, well-taught student can punch above its weight.
Using only teacher answers or only student answers is worse than mixing both 1:1. The balance keeps the student exposed to rich teacher guidance while also staying aligned with what it can produce.
Removing the accuracy reward makes it hard for the student to beat the teacher; removing normalization in the distillation reward weakens its usefulness. The two rewards, designed just right, are both necessary.
Masters helps other distillation frameworks too. Plugging Masters ideas into DistilLM, LLaVA-KD, VLsI, or RIL mitigates their stability issues with big gaps—often matching or beating their default multi-step pipelines, but in a simpler single-step process.

Make the numbers meaningful:

Think of averages like report cards across many classes. Naive distillation might be a B-, mask-progressive lifts to a solid B/B+, and adding dual-reward RL turns many lines into A-/A—especially on complex reasoning and open-ended tasks.
In leaderboards featuring both open and closed models, Masters-trained small models frequently rise near the top for their size class and sometimes nip at the heels of much larger ones. That’s like a middle-schooler solving contest problems that stump some high schoolers.

Bottom line: Across diverse tests and model families, each ingredient—masking, progressive unmasking, multi-response offline RL, and dual rewards—adds up. The whole Masters recipe consistently delivers stronger, more stable, and more efficient students.

05Discussion & Limitations

Limitations:

Offline setup: Masters trains from pre-generated answers. This is great for speed, but it can’t adapt instantly to new data or real-time feedback like full online RL could.
Judge dependency: The accuracy reward relies on an LLM-as-a-Judge. While robust, it can still make rare mistakes or carry biases.
Data generation cost: You still need to pre-generate multiple responses from all masked teachers and the student. This is cheaper than online RL, but not free.
Mask choice: Magnitude-based masking is simple and effective, but not data-aware. Smarter, data-driven masking might work even better.
Not chain-of-thought focused: Masters avoids slow think–answer traces on purpose, which is efficient, but it may miss some benefits of explicit reasoning traces in certain domains.

Required resources:

At least one strong teacher (often multiple sizes help).
GPUs to pre-generate responses and to train offline RL (authors used A100 80GB with DeepSpeed/ZeRO-3).
LLM-as-a-Judge capability (can be the same model family used for generation with careful prompts).

When not to use:

If you must adapt online to fast-changing data streams where offline pre-generation won’t cut it.
If your use case absolutely requires long, think–answer reasoning traces at training time (and you’re willing to pay the compute).
If you have no access to a competent teacher model; Masters relies on teacher quality.

Open questions:

Can we design data-driven masking that chooses which weights to mask based on examples, not just magnitude?
Can we build a hybrid online–offline approach that retains Masters’ speed but gains partial real-time adaptability?
How can we best de-bias or calibrate LLM-as-a-Judge, or replace it with more transparent evaluators?
What’s the optimal balance of teacher vs student response ratios beyond the observed 1:1? Does it vary by domain or size?
Can we extend dual rewards to include other properties like safety, style control, or factuality checks against external tools?

Overall, Masters is a practical, scalable step toward strong, deployable VLMs, with room to grow in adaptive training, smarter masking, and richer reward design.

06Conclusion & Future Work

Three-sentence summary: Masters teaches small vision-language models by first simplifying the big teacher (masking), then gradually restoring its power (progressive unmasking), all while reinforcing the student with two offline rewards for correctness and transferability. This capacity-aligned, reward-guided approach keeps training stable, efficient, and scalable, beating standard distillation and often challenging larger models. By using multiple pre-generated responses and optional teacher-size scaling, Masters provides a reliable path to high-quality, on-device AI.

Main achievement: The paper shows that directly addressing the teacher–student capacity gap—via mask-progressive distillation—and coupling it with offline dual-reward RL delivers consistently better, more stable small models across many benchmarks, at a fraction of the compute of online think–answer methods.

Future directions: Explore data-driven masking, smarter reward shaping (including safety and factuality), hybrid online–offline training for mild real-time adaptation, and improved replay schemes (e.g., RL buffers that emphasize hard or diverse examples). Investigate domain-specific schedules for masking and teacher-size scaling tailored to tasks like OCR, charts, and math-in-vision.

Why remember this: Masters reframes distillation as a staged lesson plan with the right difficulty at the right time and the right feedback signals. It’s a simple, unified, and scalable recipe that turns big-model wisdom into small-model usefulness—bringing powerful multimodal intelligence closer to everyday devices.

Practical Applications

•On-device assistive reading: Help users read signs, menus, and documents via phone cameras with fast, offline responses.
•Educational tools: Explain diagrams, charts, and math problems in textbooks without needing a server connection.
•Field inspection: Enable technicians to identify parts, hazards, or meter readings on-site with low-latency guidance.
•Retail shelf analytics: Count products and read labels in stores using compact edge devices.
•Medical triage aids: Assist clinicians with quick, on-device visual question answering (privacy-friendly) for non-diagnostic support.
•Robotics vision: Provide lightweight perception and reasoning for home or warehouse robots where compute is limited.
•Document processing: Extract key facts from receipts, invoices, and forms at the edge for secure, fast workflows.
•Navigation assistance: Describe scenes and landmarks in real time for accessibility and AR experiences.
•Industrial dashboards: Read charts and detect anomalies on factory floors without high-bandwidth backhaul.
•Wildlife and environment monitoring: Identify species or conditions using battery-powered cameras with limited compute.

Version: 1