Latent Implicit Visual Reasoning

Kelvin Li; Chuyi Shang; Leonid Karlinsky; Rogerio Feris; Trevor Darrell; Roei Herzig

Latent Implicit Visual Reasoning

Intermediate

Kelvin Li, Chuyi Shang, Leonid Karlinsky et al.12/24/2025

arXiv PDF

Key Summary

•Large Multimodal Models (LMMs) are great at reading text and looking at pictures, but they usually do most of their thinking in words, which limits deep visual reasoning.
•This paper introduces Latent Implicit Visual Reasoning (LIVR), which teaches models to invent their own ‘visual thinking notes’ (latent tokens) without being told exactly what those notes should look like.
•A special attention ‘bottleneck’ forces all image information to flow through these latent tokens during training, so the model learns to pack the right visual details into them.
•Training happens in two stages: first, strict bottlenecking to make latents carry visual info; second, normal attention so the model learns to combine original image tokens with its new latents.
•Across nine vision-heavy tasks (like jigsaw completion, localization, correspondence, reflectance, and visual similarity), LIVR beats standard supervised fine-tuning by up to +12% on some tasks and +6.24% on average (on one backbone).
•In multi-task training, LIVR also wins across all six tested tasks, showing it’s task-agnostic and scales beyond single tasks.
•Ablations show the gains come from both parts: adding latent tokens alone isn’t enough, and the bottleneck mask alone isn’t enough; the combo is key.
•Compared to a method that requires helper images (Mirage), LIVR performs better on both Jigsaw (+19.4%) and Visual Spatial Planning (+20%) without any extra annotated visuals.
•Attention visualizations reveal the latent tokens focus on the right regions (like the dog box or the cow count), indicating meaningful, learned visual structure.
•The main limitation is interpretability: the learned latents are powerful but not as easy to read as step-by-step text explanations.

Why This Research Matters

Many useful decisions in everyday tools depend on seeing, not just reading—like picking matching photos, finding the right part to click, or checking product quality. LIVR lets AI learn its own visual ‘notes’ without expensive extra labels, so it can improve on a wide mix of visual tasks. That means better photo organizers, smarter design assistants, and more reliable vision in robotics and AR. Because the method is task-agnostic, one recipe can help many jobs instead of building a new system each time. The attention visualizations show it really looks at the right places, building trust. Over time, this could reduce costs and speed up the development of visual AI that’s accurate, adaptable, and broadly useful.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: You know how sometimes you can solve a picture puzzle in your head without putting it into words? Like seeing where jigsaw pieces fit just by looking at their shapes. 🥬 The Concept: Before this paper, most Large Multimodal Models (LMMs) tried to turn what they see into words and then did almost all their thinking in text. That’s like trying to solve a jigsaw by writing essays about every piece. It works for some tasks, but it struggles when the problem is mostly visual. Why it matters: If the model can’t truly think in pictures, it will fumble on tasks like matching shapes, tracking parts, or judging fine visual differences. 🍞 Anchor: Imagine asking, “Which of these two patches completes the missing corner of this photo?” Describing the exact textures and edges in words is clumsy; it’s much easier to just “see” it.

🍞 Hook: You know how a kid might focus more on a picture than on the caption to figure out what’s happening? 🥬 The Concept (Large Multimodal Models, LMMs): LMMs are AI systems that take in images and text and usually answer with text. How it works: (1) An image goes through a vision encoder; (2) a projector turns those visual features into a form the language model can read; (3) the language model uses text tokens (and some image-derived tokens) to generate an answer. Why it matters: If the language part dominates, the model may ignore rich visual structure and rely on guesses from text patterns. 🍞 Anchor: An LMM can read “Count the cows,” but if it mostly thinks in words, it might miss cows hiding behind a fence.

🍞 Hook: Picture solving a maze: sometimes your brain traces paths silently rather than narrating it. 🥬 The Problem: LMMs often have a “language bias”—they prefer to reason in text—even for tasks that are mostly visual (like matching styles, aligning points, or picking the most similar photo). How it works: Visual info is projected once and then gets overshadowed by the text engine. Why it matters: The model can’t form flexible, detailed visual abstractions (like shapes, layouts, or subtle textures) that are tough to describe in words. 🍞 Anchor: Choosing which picture looks most like a reference is often a “see it” judgment, not a “say it” one.

🍞 Hook: Imagine a teacher giving you step-by-step hints for every picture puzzle—helpful, but time-consuming and not always the best hints. 🥬 Failed Attempts (Explicit Supervision): Many methods try to give the model hand-designed visual steps, like crops, boxes, or helper images. How it works: The model is trained to produce or reuse these intermediates. Why it matters: This is costly to annotate, bakes in human guesses about what’s “useful,” and doesn’t generalize when the right middle steps are unclear. 🍞 Anchor: For visual similarity or reflectance, it’s hard even for humans to define the perfect intermediate pictures or labels.

🍞 Hook: Think of a Swiss Army knife that works for many jobs without swapping blades. 🥬 The Gap: We need a task-agnostic way for models to discover and use their own visual reasoning steps—without handcrafted instructions or extra labels. How it works: Let the model invent internal “visual notes” it finds helpful, guided only by whether it answers correctly. Why it matters: This scales to many tasks and avoids costly, biased supervision. 🍞 Anchor: Whether the task is counting balloons or matching a paint style, the same mechanism should help.

🍞 Hook: Imagine self-invented ‘sticky notes’ where your brain writes down the important visual bits before answering. 🥬 The Stakes: In daily life—sorting photos, checking product defects, understanding diagrams—judgments depend on fine visual cues. AI that can truly “think in pictures” can make fewer silly mistakes and help in broader, more visual jobs. Why it matters: Better visual reasoning improves safety (finding the right part), accuracy (matching layouts), and trust (the AI focuses on what actually matters visually). 🍞 Anchor: A phone app that picks the best photo look-alike for your ID picture benefits from real visual reasoning, not just wordy guesses.

02Core Idea

🍞 Hook: You know how you might circle important parts of a picture before answering a question about it? 🥬 The Aha! Moment: Give the model a small set of special, learnable “visual thinking notes” (latent tokens) and force all image information to pass through them first, so they become the place where useful visual knowledge is stored—without any extra labels. Why it matters: The model invents the right visual abstractions for each task on its own. 🍞 Anchor: If the question is “Which patch completes the photo?”, the latent notes will focus on edges and textures that decide the answer.

Multiple analogies for the same idea:

Backpack pockets: The model gets new pockets (latent tokens). During training, the rule is “put all image clues in those pockets before answering.” Soon it learns which clues to pack. 2) Flashlight cones: The model must shine its flashlights (latents) on the image to fetch info; the answer can only see what the flashlights find. 3) Sticky notes: The model writes down the key visual facts on sticky notes; the final answer reads from the notes, not directly from the image.

Before vs After:

Before: Image is projected once, then most reasoning happens in text; visual structure can get diluted. - After: Dedicated latent tokens re-encode the image in a task-aware way; the model keeps a living, compact visual memory aligned with the question.

Why it works (intuition, no equations):

Pressure to carry signal: The bottleneck mask blocks the answer from looking at raw image tokens, so the only way to do well is to stuff the needed visual info into the latent tokens. - More expressive than words: Text tokens are about language; latents are free-form vectors that can encode shapes, layouts, and alignments. - Reduced language bias: Since the answer can’t peek at image tokens, it must depend on visual latents, not just word patterns.

Building blocks (each with a Sandwich):

🍞 Hook: You know how you create your own mini legend for a map? 🥬 Latent Tokens: Special learnable tokens added to the vocabulary that hold task-relevant visual info. How it works: They are appended after the question; their embeddings are trainable; the model doesn’t output them—it uses them internally. Why it matters: They act as the model’s private visual notes. 🍞 Anchor: For counting cows, some latents light up over each cow-like region.
🍞 Hook: Imagine a doorway that everyone must pass through. 🥬 Bottleneck Attention Masking: An attention rule that prevents answer and prompt tokens from directly attending to image tokens during Stage 1. How it works: The answer can read the question and the latents, but not the image tokens; the question also can’t read image tokens, so all image info must flow through latents. Why it matters: This forces the latents to carry the visual facts. 🍞 Anchor: To choose the right bounding box, the answer needs edge/shape cues, so those cues must appear in the latents.
🍞 Hook: Practice with training wheels, then ride free. 🥬 Two-Stage Training: Stage 1 uses the bottleneck so latents learn to store visual info; Stage 2 removes the bottleneck so the model learns to blend latents with original image tokens. Why it matters: Stage 1 teaches latents to matter; Stage 2 teaches teamwork. 🍞 Anchor: After Stage 1, the model can both read its notes and also double-check the original picture.
🍞 Hook: Like grading answers, not the notes you wrote. 🥬 NLL Loss on Answers: The training score only checks if the final text answer is right, not what the latents look like. How it works: This end-to-end signal makes latents learn whatever helps the answer most. Why it matters: No extra labels or helper images are needed. 🍞 Anchor: The model figures out by itself which visual patterns help it win the quiz.

03Methodology

At a high level: Input (images + question) → add K latent tokens → Stage 1: bottleneck masking (answers can’t see image tokens) → train on answer correctness → Stage 2: normal masking (answers can see image and latents) → output text answer.

Step-by-step with Sandwich explanations for new pieces:

🍞 Hook: Think of a smart assistant that can look and read. 🥬 Large Multimodal Model (LMM) Setup: There’s a vision encoder (turns image into tokens), a projector (fits vision features into language space), and a language model (generates the text answer). Why it matters: This is the engine we’re upgrading with visual notes. 🍞 Anchor: The model sees a photo, reads “Which box best outlines the dog?”, and then must answer A or B.
🍞 Hook: Imagine slipping blank index cards into your notebook. 🥬 Insert Latent Tokens: We introduce K special tokens and append them after the question. They start random and are trainable. The model never outputs them; it only reads/writes them inside. Why it matters: These are the model’s own visual scratchpad. 🍞 Anchor: For a jigsaw, some latents store corner lines; others store table texture.
🍞 Hook: A checkpoint gate where only certain paths are allowed. 🥬 Apply Bottleneck Attention Mask (Stage 1): Modify attention so answers and prompt tokens cannot attend to image tokens; they can attend to latents. Latents can attend to image tokens and the prompt. Why it matters: All visual info must pass through the latents, so they become meaningful. 🍞 Anchor: If the model picks the correct reflectance choice (A/B/C), it must have compared those two marked points through its latents.
🍞 Hook: Start strict, then relax. 🥬 Train Stage 1 with NLL on Answer Tokens: Only the correctness of the final answer is graded. No helper images or bounding boxes are provided. Why it matters: The model discovers for itself what to encode in the latents to get answers right. 🍞 Anchor: On semantic correspondence, latents learn to align the REF keypoint to the right place in the other image.
🍞 Hook: Now allow both the notes and the book. 🥬 Stage 2: Restore Standard Attention: The answer can now attend to both image tokens and latents. The loss is the same. Why it matters: The model learns to combine its learned notes with the raw image for best accuracy. 🍞 Anchor: For localization, it can use the latents’ summary while double-checking edges in the original image tokens.
🍞 Hook: Adjust a few knobs instead of rebuilding the machine. 🥬 Parameter-Efficient Tuning (LoRA): Fine-tune mainly low-rank adapters in attention/MLP plus the latent-token embeddings; keep the vision encoder and projector frozen. Why it matters: Efficient training that still teaches the language backbone to use the new notes. 🍞 Anchor: You upgrade the “brain’s habits” cheaply while letting the “eyes” stay the same.
🍞 Hook: Pop quiz scoring. 🥬 Task Setup and Metrics: Most tasks are multiple-choice (top-1 accuracy); counting is open-ended exact match. Why it matters: Clear scores show if visual reasoning improved. 🍞 Anchor: If the correct box is B and the model outputs B, that’s a hit.

Concrete example with data:

Visual Similarity: Input has a reference image and two candidates (A, B). Question: “Which is more similar to the reference?” K=16 latent tokens sit after the question. Stage 1 masking blocks answer/prompt from image tokens, so the latents must capture textures, colors, and layouts that decide similarity. The model is trained only on whether it says A or B correctly. In Stage 2, the mask is lifted so the model can combine latents with image tokens and refine choices. The result: higher accuracy than direct fine-tuning without latents.

What breaks without each step:

No Latents: The model would try to squeeze complex visual info through old text tokens, which is awkward and weaker. - No Bottleneck: The model could ignore latents and rely on language bias; results drop. - No Stage 2: The model might not learn to blend its new notes with raw image cues; performance lags. - No End-to-End Loss: If we graded intermediate pictures, it would require labels and miss task-agnostic benefits.

Secret sauce:

The truly clever part is the enforced information flow: by controlling who can “look” at what during Stage 1, the model is gently forced to invent and use an internal visual scratchpad tailored to each task—no hand-designed targets needed.

04Experiments & Results

🍞 Hook: Imagine a decathlon for vision—counting, puzzles, matching, and more. 🥬 The Test: The authors evaluate on nine perception-heavy tasks (mostly BLINK-style): counting (PixMo-Count), jigsaw completion, object localization, visual and semantic correspondence, functional correspondence, art style classification, relative reflectance, and visual similarity. Why it matters: These tasks stress real visual reasoning, not just reading. 🍞 Anchor: It’s like asking, “Can you find the matching point? the right box? the most similar picture?”

Baselines and competitors:

Zero-shot: Use the base instruction-tuned LMM without task training. - Direct SFT: Standard supervised fine-tuning on 1,000 examples per task. - Mirage: A latent-reasoning approach that relies on task-specific helper images (explicit supervision).

Scoreboard with context (highlights):

Single-task results with Qwen2.5-VL-3B: LIVR beats Direct SFT by +6.24% average over nine tasks. Big wins on hard visual-abstraction tasks: +12.00% on Jigsaw, +13.02% on Functional Correspondence, and solid gains on Art Style, Visual Similarity, and Relative Reflectance. - With Qwen3-VL-4B: +3.43% average improvement over Direct SFT across tasks, again showing generality. - With LLaVA-OneVision-1.5-4B: +5.60% average improvement, including a huge +27.40% on Functional Correspondence. Think of it as going from a B to an A on tough, visual-heavy exams.

Multi-task training (Qwen3-VL-4B on six tasks):

LIVR improves over Direct SFT on all six tasks, with mean +2.77%. That’s like consistently adding a few extra points across every event in a multi-sport meet—small per task, big overall. It shows the same task-agnostic method scales when tasks are mixed together.

Head-to-head with Mirage (uses helper images):

Jigsaw: Zero-shot 49.33 → Mirage 48.60 → LIVR 68.00 (+19.40 vs Mirage). - Visual Spatial Planning: Zero-shot 6.00 → Mirage 46.00 → LIVR 66.00 (+20.00 vs Mirage). Takeaway: Without any special helper images, LIVR wins clearly.

Surprising findings and ablations:

Latents-only (no Stage 1 bottleneck): The model learns to ignore the extra tokens; removing latents doesn’t hurt it, and performance under a forced bottleneck drops to random. - Our LIVR with bottleneck: Removing latents causes a clear accuracy drop, and attention maps show strong focus from latents to meaningful image regions. - Mask-only (no new latents): Worse than LIVR—repurposing old text tokens as a bottleneck is harder; fresh latents learn better. - Stage schedule: A balanced 4-epoch Stage 1 and 6-epoch Stage 2 works best; too little or too much Stage 1 hurts. - Number of latents: K=16 outperforms 4 or 8 (too small) and 32 (too diffuse). - Placement: Putting latents after the question works better than before (they can condition on the task).

Visualizations:

Attention overlays show the latents look at the right places: dog or motorcycle boxes, cow and balloon counts, and the toothbrush handle for correspondence. - t-SNE plots suggest many latents live near image-token space (they learn visual-like features), with a subset forming a distinct cluster (specialized abstractions).

Bottom line with context:

Across diverse backbones and tasks, LIVR consistently turns in higher scores than direct fine-tuning—often the difference between getting stuck on visual subtleties and nailing them. The strongest gains appear exactly where visual structure matters most and is hardest to describe in words.

05Discussion & Limitations

🍞 Hook: Think of powerful but private notes—you know they help, but reading them isn’t easy. 🥬 Limitations: The latent tokens are less interpretable than step-by-step text rationales; we can see where they attend but not always “why” in words. Hyperparameters (like K and stage split) matter; too few latents or the wrong schedule hurts. Also, while the method is efficient (LoRA + a few embeddings), training still needs GPUs and curated data per task. Why it matters: For safety-critical settings, you may want clearer explanations; and tuning takes care. 🍞 Anchor: If a hospital wants a fully verbal chain-of-thought, these silent visual notes might not be enough.

Required resources:

An instruction-tuned LMM backbone (e.g., ~3–4B parameters). - Standard vision encoder/projector frozen; LoRA on the language backbone; memory for K latent embeddings. - Modest but specific training datasets (1k examples per task in the paper).

When not to use:

If the task is mostly textual (like reading long documents or solving algebra from typed equations), text-only reasoning may suffice. - If you require human-readable intermediate steps for auditing (e.g., legal explanations), this latent approach won’t provide them directly. - If you have abundant, high-quality explicit visual annotations that perfectly fit your task, supervised intermediates might still shine.

Open questions:

Can we make latents more interpretable (e.g., auto-caption the latent space)? - How do we best scale K with model size and dataset size? - Can we adaptively allocate more/less latent compute per question? - How will this approach perform on larger backbones and much bigger, messier datasets? - Can we combine latent visual notes with concise, helpful text rationales for the best of both worlds?

06Conclusion & Future Work

Three-sentence summary: This paper introduces LIVR, which gives a model learnable visual notes (latent tokens) and uses a bottleneck to force all image information through them during training, so the model learns visual abstractions by itself. After this, the model combines its notes with the image to answer, leading to strong gains across nine vision-heavy tasks and in multi-task training. It outperforms standard fine-tuning and even methods that require helper images—all without extra annotated intermediates.

Main achievement: A simple, task-agnostic recipe—latent tokens plus bottleneck attention masking—consistently improves visual reasoning by letting the model invent its own internal visual representations.

Future directions: Scale to larger models and datasets, explore adaptive or hierarchical latents, improve interpretability (e.g., visualizing or labeling latent concepts), and blend latent notes with brief textual rationales. Investigate dynamic schedules that adjust Stage 1/2 time per task and smarter placement/number of latents.

Why remember this: LIVR shows that gently steering how information flows (rather than telling the model exactly what to draw or crop) can unlock better visual reasoning. It’s a clean, general trick that travels across tasks, avoids annotation costs, and teaches models to truly “think in pictures” when pictures matter most.

Practical Applications

•Photo library helpers that find the most similar image to a reference (e.g., best match for an ID or product).
•Quality control in manufacturing that compares tiny visual differences and highlights correct boundaries.
•Robotics perception that aligns keypoints across views to grasp or pour accurately without handcrafted hints.
•AR/VR scene understanding that localizes objects and matches parts across frames with fewer labels.
•Education tools that solve visual puzzles (like jigsaws or spot-the-difference) and explain focus regions via attention maps.
•Design search engines that retrieve art or styles most similar to a reference, improving creative workflows.
•Medical pre-screening aids that compare regions across images (e.g., correspondence) while highlighting attended areas.
•Geospatial analysis to match landmarks across satellite images taken from different angles or times.
•E-commerce tools that recommend visually similar products and verify style/fit consistency from photos.
•Document and UI analysis that localizes elements and matches corresponding components across different layouts.

Version: 1