LaViT: Aligning Latent Visual Thoughts for Multi-modal Reasoning

Linquan Wu; Tianxiang Jiang; Yifei Dong; Haoyu Yang; Fengji Zhang; Shichaang Meng; Ai Xuan; Linqi Song; Jacky Keung

LaViT: Aligning Latent Visual Thoughts for Multi-modal Reasoning

Intermediate

Linquan Wu, Tianxiang Jiang, Yifei Dong et al.1/15/2026

arXiv PDF

Key Summary

•LaViT is a new way to teach smaller vision-language models to look at the right parts of an image before they speak.
•It fixes the Perception Gap: students copy the teacher’s words but not the teacher’s gaze, so they guess from language instead of seeing.
•The student first generates a few latent visual thoughts (special hidden tokens) that reconstruct the teacher’s visual concepts and attention trail.
•A curriculum sensory gate starts by hiding most raw pixels, forcing the student to rely on its visual thoughts, then gradually reopens vision.
•LaViT aligns both what to see (semantic concepts) and where to look (attention trajectory) using white-box signals from the teacher.
•On tough tests like BLINK Relative Depth, a compact 3B LaViT beats bigger open models and even GPT-4o (78.23% vs. 64.52%).
•It also reduces hallucinations and ‘CLIP-blindness’ on fine-grained perception benchmarks like MMVP.
•Ablations show every piece matters: remove trajectory alignment or the curriculum gate and performance drops notably.
•Surprisingly, the student’s gaze becomes more stable than the teacher’s thanks to top‑K attention sparsification.
•Bottom line: teaching how to look and think visually before answering is more effective than just scaling model size.

Why This Research Matters

LaViT helps AI truly ‘see before it speaks,’ which is crucial for safety and trust. In everyday tools that read homework diagrams, shopping apps that compare products, or navigation aids that interpret signs, grounded vision reduces silly and risky mistakes. In professional settings like manufacturing inspection or medical triage, focusing on the right pixels can be the difference between catching a defect or missing it. Because LaViT works well even on a compact 3B model, it brings high-quality multimodal reasoning to edge devices and cost-sensitive deployments. Its curriculum prevents shortcut learning, so improvements are not just numbers—they reflect more honest, evidence-based answers. That means fewer hallucinations, better alignment with human expectations, and AI that explains decisions more reliably.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you’re playing I‑Spy. If you don’t actually look at the red backpack your friend mentions, you might still blurt out an answer based on guesses—but you’ll likely be wrong. You need to both look and think before you speak.

🥬 The Story So Far: For years, multimodal AI (models that use pictures and words) got better at describing images and answering questions. Early systems mostly converted pictures into text first, then did all the reasoning in language. Later, researchers realized that ‘thinking with images’—actively referring back to visual evidence during reasoning—helped on tricky problems like depth, geometry, or tiny attributes (like the difference between striped and dotted).

What was the problem? Even with this progress, many efforts still trained students (smaller models) to copy teachers (bigger models) mainly by matching final answers or text probabilities. That assumes if the student says the same words, it must have learned the same skill. But vision-language models can sometimes sound right without actually looking right—they lean on language patterns (like common sense or stereotypes) instead of grounded seeing. This is risky on tasks where details decide the answer.

What people tried before and why it fell short: Some methods add helper images, marked boxes, or use extra tools to push the model’s attention. Others improve training with reinforcement learning to encourage multi-step visual CoT (chain-of-thought). These helped, but they often relied on external supervision (like annotated regions) or focused on the words the model writes rather than the internal visual process. Most did not directly align how the model’s eyes (attention) moved across the image while it reasoned.

The key gap: Researchers suspected a hidden mismatch between what a model says and what it looks at. The paper proves this with two findings. First, when attention isn’t focused on the right region, answers get worse and hallucinations appear. Second, even if a student’s text matches the teacher’s, their visual attention paths can be very different—especially for attribute-heavy words (like ‘red,’ ‘striped,’ or ‘on the left’). In short, students were learning what to say without learning where to look.

Why this matters in real life: Think of apps that read homework diagrams, driver-assist systems judging distances, or medical tools that must notice a tiny shadow in an X‑ray. If the model copies the answer without grounding its gaze, it can fail in ways that look convincing but are dangerously wrong. Better grounding means fewer made-up details (hallucinations), more trustworthy decisions, and small models that punch above their weight.

🍞 Anchor: Picture a question: “Which point is closer to the camera, A or B?” A language-guessing model might default to the more centered point, but a grounded one compares visual cues like relative size and occlusion. If it didn’t ‘look’ at A and B, it can’t judge depth reliably. That’s exactly the kind of problem LaViT tackles—teaching the student not just to answer, but to actually look in the right place first.

02Core Idea

🍞 Hook: You know how a good art teacher makes you do quick thumbnail sketches before painting the final piece? Those sketches lock in what to focus on before any grand brushstrokes.

🥬 The Aha! Moment in one sentence: Make the student model silently sketch its visual thoughts—where to look and what to see—before speaking the answer, and align those sketches with the teacher’s own internal gaze and visual meaning.

Three analogies:

Detective flashlight: Before announcing the culprit, sweep the flashlight over the key clues (attention trajectory) and jot down what they mean (visual semantics). Then give the verdict.
GPS route: Don’t just give the destination (answer). First plan and follow the route (attention over image), noting landmarks (semantics). The path shapes the final arrival.
Cooking mise en place: Prep ingredients (latent visual thoughts) by matching the chef’s picks and order (teacher alignment). Only then start cooking (text generation).

Before vs. After:

Before: Students trained to match answers could sound right but look wrong, leaning on language priors. Visual errors like confusing left/right or color slipped through.
After: Students are trained to recreate the teacher’s internal where-to-look (attention path) and what-to-see (contextual visual concepts) as a few latent tokens, before any text. This yields sharper grounding, fewer hallucinations, and stronger performance on geometry and fine-grained details.

Why it works (the intuition):

Vision is causal here: correct answers need the right evidence. By first reconstructing the teacher’s visual semantics and attention trail, the student ‘earns the right’ to answer.
A curriculum sensory gate starts by dimming the direct view of raw pixels, so the student must rely on its latent visual thoughts. Later, the gate fully opens to let direct vision complement those thoughts. This prevents shortcuts and matches inference-time behavior.
Aligning two streams—what to see (semantic features) and where to look (attention distribution)—transfers the teacher’s visual thinking, not just its words.

Building blocks (each with a mini-sandwich):

🍞 You know how you outline a story before writing the paragraphs? 🥬 Latent Visual Thoughts are small hidden tokens the model writes first that capture the key visual ideas and gaze pattern. How: (1) See the image and question. (2) Autoregressively produce K visual-thought tokens. (3) Use them to guide the final text answer. Why: Without them, the model can jump straight to word-guessing. 🍞 Example: Before answering “How many stripes on the shirt?”, the model writes a few hidden ‘notes’ that point to the shirt region and encode ‘count stripes’ info.

🍞 Think of watching a coach demonstrate footwork step by step. 🥬 White-box Trajectory Distillation lets the student observe the teacher’s inner moves: the attention over image patches (where to look) and contextual visual features (what they mean). How: (1) Extract teacher attention maps and last-layer image features conditioned on the question. (2) Train the student’s visual-thought tokens to match them. (3) Then generate the answer. Why: Without this, the student may copy the words but miss the seeing. 🍞 Example: If the teacher looks at the left wheel to judge ‘which wheel is bigger?’, the student learns to look there too.

🍞 Imagine your teacher first gives you sketch paper with dim lights, so you must feel the forms, then turns the lights up. 🥬 Curriculum Sensory Gating gradually opens direct vision to the answer tokens. How: (1) Start with a near-closed gate so answer tokens can’t directly attend to pixels, forcing reliance on visual-thought tokens. (2) Smoothly open the gate over training. (3) End fully open to match inference. Why: Without the gate, the model shortcuts by staring at pixels and ignoring its visual thoughts; with a hard permanent mask, training and inference don’t match. 🍞 Example: Early on, the model must route ‘Which point is closer?’ through its visual thoughts; later, with full vision open, it fine-tunes the decision.

🍞 Anchor: LaViT’s student first sketches a tiny internal map: ‘look at A and B,’ ‘compare sizes and overlaps.’ Only after that sketch matches the teacher’s, it says, ‘B is closer.’ That pre-speech visual thinking is the core idea.

03Methodology

At a high level: Image + Question → (A) Student generates K latent visual-thought tokens → (B) These tokens are trained to match the teacher’s visual semantics and attention trajectory → (C) Student produces the final text answer, with a curriculum gate regulating direct vision access during training.

Step-by-step with the Sandwich pattern for every core piece:

🍞 You know how a student traces a teacher’s pencil marks to learn the exact strokes? 🥬 Teacher Signals (What to distill): The teacher provides two white-box signals: (a) contextual visual semantics (what the teacher ‘sees’ in response to the question), and (b) an attention trajectory (where the teacher looks over the image). How: (1) Run the teacher on Image + Question. (2) Extract last-layer image-token features (semantics) after they’ve interacted with the question. (3) Aggregate cross-attention over layers/heads to form a normalized attention map (trajectory). Why: Without these, you can only match final words, not the path and meaning behind them. 🍞 Example: For ‘Which cup is taller?’, the teacher’s attention map lights up both cups, and its semantic features encode shape and top-edge cues.
🍞 Imagine writing a handful of bullet notes before an oral presentation. 🥬 Student Latent Visual Thoughts (K tokens): The student first writes K hidden tokens that summarize where to look and what to see before answering. How: (1) Given Image + Question, autoregressively emit v-trace1 … v-traceK. (2) These tokens become a bottleneck that must carry the needed visual evidence. (3) Then the model generates the text answer using those tokens as context. Why: Without the bottleneck, the model may skip forming visual thoughts and rely on language guesswork. 🍞 Example: Before saying ‘left cup,’ the student’s tokens encode ‘focus left region, compare heights.’
🍞 Think of aligning your sketch to a teacher’s overlayed tracing paper. 🥬 Semantic Reconstruction Alignment (What to see): Align the student’s latent hidden states to the teacher’s contextual visual concepts. How: (1) Pass the student’s latent states through a small projector to the teacher’s feature space. (2) Maximize similarity between student and teacher features. (3) Keep the teacher fixed as a semantic anchor. Why: Without this, the latent tokens may not capture the right concepts (e.g., confusing texture vs. color). 🍞 Example: If the teacher encodes ‘striped pattern on shirt,’ the student’s latent features are nudged to match that meaning.
🍞 Like practicing your eye movement reading music notes exactly as your piano teacher does. 🥬 Trajectory Alignment (Where to look): Match the distribution of the student’s attention over image patches to the teacher’s attention trajectory. How: (1) Treat the teacher’s normalized attention as the target map. (2) Train the student’s visual-thought-induced attention to agree with it. (3) Use only the top-K strongest teacher attention spots to avoid noise. Why: Without this, the student might describe the right concept but still look in the wrong place, leading to fragile reasoning. 🍞 Example: For ‘color of the car logo,’ the student learns to look right at the logo, not the windshield.
🍞 Picture dimmer lights for rehearsal, full lights on show night. 🥬 Curriculum Sensory Gating (Prevent shortcuts): Control how much the answer tokens can directly see image pixels during training. How: (1) Early phase: gate is nearly closed, so answer tokens can hardly attend to raw pixels; they must use latent visual thoughts. (2) Gradual warm-up: smoothly open the gate to avoid shocks. (3) Later phase: gate fully open, matching inference time. Why: Without this schedule, the model can ignore its visual thoughts (too open) or face train-test mismatch (too closed forever). 🍞 Example: Early ‘Which point is closer?’ relies on the latent notes; later, the model can refine using direct pixels too.
🍞 Following a recipe, you still taste and adjust the final dish. 🥬 Next-Token Prediction (Speak accurately): Standard training to produce the correct answer tokens. How: (1) Compute the usual next-token loss on the final text. (2) Early on, gradients flow mainly through the visual thoughts because the gate dims pixel access. (3) Later, both latent thoughts and direct pixels are optimized together. Why: Without this, the model might align to the teacher’s gaze/semantics but fail to express the correct words. 🍞 Example: After forming the visual thought ‘B is nearer,’ the model learns to actually say ‘B.’
🍞 Cleaning up a messy desk so you only keep what matters. 🥬 Data and Stability Tricks: Use a curated 15k set with correct, visually grounded samples and filter out weak traces; sparsify teacher attention to top-8 hotspots. How: (1) Keep only samples where the teacher’s attention matches ground-truth regions. (2) Use top-K attention to denoise supervision. (3) Freeze the vision encoder, fine-tune the language backbone. Why: Without clean signals and sparsification, students may inherit teacher’s indecision or noise. 🍞 Example: The distilled attention becomes sharper and more consistent than the teacher’s.

Worked example (concrete flow):

Input: A photo with two dots labeled A and B; question: “Which is closer to the camera?”
Step A: Student emits 4 latent visual-thought tokens that (i) point to A and B and (ii) encode the idea ‘compare relative size/occlusion.’
Step B: These tokens are trained to match the teacher’s last-layer image features for A/B and the teacher’s attention map focusing on both points. Early training dims direct pixel access.
Step C: The student generates the answer ‘B’ with stronger grounding. If trajectory alignment were missing, it might still say ‘B’ but look elsewhere—fragile and less reliable.

Secret sauce: Forcing the student to think visually first (latent tokens) and aligning both meaning (semantics) and movement (attention trajectory), while a curriculum gate prevents cheating. This combination transfers the teacher’s visual reasoning behavior—not just its words.

04Experiments & Results

The test: Researchers measured how well models handle fine-grained perception (like MMVP) and complex visual reasoning (BLINK: Relative Depth, IQ-Test, Relative Reflectance, Spatial Relations), plus a robustness suite (MMStar). These tasks punish language-guessing and reward grounded seeing: tiny attributes, depth judgments, and geometric logic.

The competition: LaViT-3B was compared to its own backbone (Qwen2.5‑VL‑3B), larger open models (Qwen2.5‑VL‑7B), state-of-the-art latent or reasoning methods (LVR‑7B, DMLR, PAPO, R1‑OneVision), and even a proprietary model, GPT‑4o.

The scoreboard with context:

BLINK Relative Depth: LaViT‑3B reaches 78.23%. That’s like getting an A when others hover around B/C: baseline 3B at 61.29% and GPT‑4o at 64.52%. It even tops LVR‑7B (76.61%).
BLINK IQ‑Test: 32.0% for LaViT‑3B vs 30.0% for GPT‑4o and 28–30% for many baselines. On puzzles requiring mental manipulation, the student shows grounded improvements.
BLINK Relative Reflectance: 45.52% for LaViT‑3B vs 29.85% baseline 3B (+15.67 points) and also above LVR‑7B by about 3 points—strong evidence of better structural visual semantics.
MMVP (fine-grained perception): 67.33% for LaViT‑3B, beating DMLR (61.33%) and PAPO (50.0%). This tackles ‘CLIP-blindness,’ where models miss subtle visual differences.
MMStar (robustness vs language priors): LaViT‑3B scores 54.07% vs 50.2% baseline, suggesting gains come from real seeing, not better guessing.

Why these numbers matter: Depth and reflectance are classic ‘don’t-guess’ tasks; you must look at occlusions, shading, and geometry. Big jumps (+16.94 points on depth, +15.67 on reflectance) show the student truly learned where to look before speaking.

Surprising findings:

Sharper and steadier gaze: LaViT’s attention is more concentrated (lower entropy) and more consistent across samples than both its baseline and even the large teacher. Top‑K sparsification and data filtering seemed to ‘denoise’ the teacher’s variability, yielding a student with pinpoint focus.
Real reliance on visual thoughts: If you mask out latent tokens at inference, performance drops, proving the model actually uses those visual sketches.
Curriculum matters: Removing the gradual sensory gate or training in one always-open stage cuts accuracy, especially on reflectance and MMVP. The gate prevents shortcut learning and builds durable grounding.

Ablations (what breaks without each piece):

Without trajectory alignment (where-to-look), depth and IQ‑Test fall notably—students know concepts but glance in wrong places.
Without semantic reconstruction (what-to-see), attribute-heavy and reflectance tasks drop—students look correctly but misinterpret what’s there.
Without curriculum gating, MMVP plunges—students learn to peek at pixels and skip forming visual thoughts, leading to brittle reasoning.

Takeaway: A small 3B student, if taught to think visually first and guided to share the teacher’s gaze and meaning, can outperform bigger peers and even proprietary heavyweights on tasks where seeing is believing.

05Discussion & Limitations

Limitations (specific and honest):

Teacher dependence: LaViT needs a capable, white-box teacher to extract attention maps and contextual visual features. If the teacher’s gaze is biased or wrong, the student may inherit those issues (though top‑K denoising helps).
Attention as a proxy: We assume attention trajectories capture ‘where the model looks.’ While useful, attention is an imperfect window into causality; other probes (e.g., gradient-based attributions) might refine alignment.
Data scope: The curated 15k distillation set is carefully filtered but relatively small; broader, more diverse scenarios (e.g., specialized medical or scientific images) may require extra traces.
Modality focus: The method is designed for image–text. Extending to video (temporal attention), audio, or 3D could need new trajectory definitions and gates.
Hyperparameters: K (number of latent tokens) and the warm-up schedule are important. Too many tokens can add noise; the paper found K=4 best in their setup, but other domains may differ.

Required resources:

Access to a strong, instrumented teacher model (attention and last-layer features available).
Modest fine-tuning compute for a 3B student; vision encoder can be frozen to save cost.
Clean, grounded data where ‘correct where-to-look’ can be validated for filtering.

When not to use:

Purely textual tasks where images add little—the visual thoughts/bottleneck won’t help.
Closed-source teachers that don’t expose attention/features—white-box distillation isn’t possible.
Ultra-low-detail or highly compressed images where attention maps become noisy.

Open questions:

Can we align beyond attention—e.g., causal patches or counterfactual probes—to better capture ‘why’ the teacher looks there?
How to scale to video: align motion-aware trajectories and temporal visual thoughts.
Can we distill from multiple teachers to reduce bias and improve coverage?
Can we learn the gate schedule automatically or adapt it per task difficulty?
How robust is LaViT to distribution shifts (cartoons, medical scans, satellite)?

06Conclusion & Future Work

Three-sentence summary: LaViT teaches a student model to form and align latent visual thoughts—small hidden sketches of what to see and where to look—before it speaks. By distilling the teacher’s contextual visual semantics and attention trajectory, and by using a curriculum sensory gate to prevent shortcuts, the student actually learns to observe, not just guess. This yields large gains on depth, reflectance, and fine-grained perception, letting a 3B model rival or beat bigger systems and even GPT‑4o on key benchmarks.

Main achievement: Turning visual grounding into a first-class, learned prerequisite to text generation—aligning both meaning and gaze—so the student inherits the teacher’s visual reasoning rather than just its words.

Future directions: Extend trajectories to video and 3D, explore causal or counterfactual alignment beyond attention, combine multiple teachers for diversity, and adapt the gate schedule automatically. Investigate task-aware numbers of latent tokens and integrate robustness checks against dataset or teacher bias.

Why remember this: LaViT flips the script from ‘say the right thing’ to ‘see the right thing, then say it.’ That simple shift—sketch first, speak second—turns small models into careful observers, reducing hallucinations and improving trust where it matters most.

Practical Applications

•Document QA assistants that accurately read charts and diagrams by first focusing on relevant regions.
•Shopping and comparison tools that attend to fine-grained attributes (colors, textures, labels) before recommending.
•Driver-assist analytics that judge relative depth (e.g., which car is closer) more reliably in dashcam feeds.
•Educational tutors that solve geometry problems by grounding steps in the actual figure, not just text patterns.
•Quality control in factories where subtle defects (scratches, misalignments) must be visually confirmed.
•Medical pre-triage support that highlights and reasons about small anomalies under expert supervision.
•Robotics perception modules that plan actions after forming latent visual thoughts about target objects.
•AR assistants that identify and compare real-world markers (e.g., sign distances, object sizes) with fewer hallucinations.
•Satellite or drone image analysis where attention must lock onto key regions (flooded areas, damaged roofs).
•Customer support bots that ground troubleshooting steps in photos or videos users upload (e.g., appliance parts).

Version: 1