Forest Before Trees: Latent Superposition for Efficient Visual Reasoning

Yubo Wang; Juntian Zhang; Yichen Wu; Yankai Lin; Nils Lukas; Yuhan Liu

Forest Before Trees: Latent Superposition for Efficient Visual Reasoning

Intermediate

Yubo Wang, Juntian Zhang, Yichen Wu et al.1/11/2026

arXiv PDF

Key Summary

•This paper introduces Laser, a new way for vision-language models to think in their hidden space before speaking, so they see the whole “forest” before picking out the “trees.”
•Instead of forcing the model to guess the very next word, Laser uses Dynamic Windowed Alignment Learning (DWAL) to keep a soft pool of future possibilities and gradually narrow down.
•Laser prevents premature semantic collapse (tunnel vision) by letting the model hold global context first and only later commit to specific details.
•A self-refined superposition target stabilizes training by using the model’s own soft predictions inside a dynamic semantic window.
•An entropy-regularized intervention steps in with harder guidance when the model is very unsure and backs off when it gains confidence.
•Laser is highly efficient, reducing inference tokens by more than 97% while achieving state-of-the-art results among latent reasoning methods on six benchmarks.
•It beats a strong baseline (Monet) by an average of 5.03% and shows especially large gains on hallucination and perception-heavy tasks.
•Laser’s hidden states are interpretable: you can decode them into readable tokens to see the model’s reasoning steps.
•It generalizes well to out-of-distribution tasks and improves chart and web understanding without forgetting core spatial skills.
•Overall, Laser encourages a shift from explicit next-token rationales to compact, decodable, and efficient latent visual reasoning.

Why This Research Matters

Better visual reasoning means smarter apps that see and think more like people. Laser keeps rich scene context early and locks in details later, so it reduces silly mistakes and hallucinations. Because it does most thinking inside hidden states, it answers much faster and uses fewer tokens, which saves time and cost. It also stays interpretable: you can peek at its internal steps, helping build trust. This helps with reading charts and webpages, making safer vision decisions, and handling tricky illusions. The approach can spread to many domains, encouraging training methods that value global context before local details.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook): Imagine you’re looking at a big, busy photo of a skatepark. Before you spot the exact boy’s T-shirt, your brain first takes in the whole scene: ramps, helmets, murals, and groups of kids. You see the forest before the trees.

🥬 Filling (The Actual Concept): Visual reasoning is the process of turning pixels into understanding step by step. Classic vision-language models take an image and a question, then produce words. Chain-of-Thought (CoT) made them better by asking them to “think out loud” with sentences. But there’s a catch: squeezing rich, continuous visual details into chunky, discrete words wastes a lot of nuance, like saving a high-resolution photo as a tiny GIF. This is the information bandwidth bottleneck. On top of that, most models are trained to always predict the very next word. That works for writing text, but it can be harmful for visual reasoning, which is naturally hierarchical—from global context to local details. Forcing a next-token guess too early can make the model lock onto the wrong thing—a problem called premature semantic collapse.

🍞 Bottom Bread (Anchor): Picture asking, “What is the boy to the right of the helmet wearing?” If the model locks onto the word “helmet” too soon, it may miss the bigger scene and pick the wrong boy. It needs to understand the whole skatepark first, then zoom into the right person, then name the clothing.

🍞 Top Bread (Hook): You know how telling a story slowly step by step helps you think? That’s CoT. It’s like explaining your homework aloud before writing the answer.

🥬 Filling (The Actual Concept): Chain-of-Thought is a strategy where models generate textual reasons, one token at a time, before giving the answer. It helps, but it still uses words as the thinking medium. For vision tasks, that means the model must constantly convert soft, continuous visual features into hard, discrete tokens. Important micro-details can be thrown away during this translation, especially early on when the model should still be exploring.

🍞 Bottom Bread (Anchor): If the model must write a sentence like “boy near helmet is wearing…” too early, it might commit to “boy near helmet” before confirming which boy is actually to the right of the helmet.

🍞 Top Bread (Hook): Imagine trying to guess the next word in a sentence even before you’ve read the whole paragraph. You might jump to the wrong conclusion.

🥬 Filling (The Actual Concept): Autoregressive next-token training makes a model minimize the error on the immediate next word at each step. For images, this can push the hidden state to become too specific too soon. The result is tunnel vision: the model narrows to one token before it has the big picture.

🍞 Bottom Bread (Anchor): While answering “Which image has the same style?”, focusing too fast on “Cartoon” might miss that “Street Art” is a better global match.

🍞 Top Bread (Hook): Think of two ways to think: (1) write every thought as a sentence, or (2) keep a flexible mental picture and only speak at the end. Which keeps more detail?

🥬 Filling (The Actual Concept): Latent reasoning does computations inside hidden vectors instead of spelling everything out as text. It keeps information continuous and rich. But if trained with rigid next-token targets, those vectors can still collapse into single points too early.

🍞 Bottom Bread (Anchor): A model might internally hold a fuzzy map of the whole skatepark (ramps, kids, mural colors) before focusing on the boy’s shirt. If forced to choose a word too soon, it loses that fuzzy map.

🍞 Top Bread (Hook): Imagine using a camera: you start wide (whole scene) and then zoom in (details). If you zoom instantly, you might miss where to look.

🥬 Filling (The Actual Concept): The paper’s gap is this: we lacked a way to let the model carry a soft, global “superposition” of future possibilities and then narrow down naturally. Existing latent methods often still force a step-by-step, single-token target, inviting premature collapse.

🍞 Bottom Bread (Anchor): We need a training rule that says, “At first, it’s OK to keep several future ideas in mind; later, pick one.”

🍞 Top Bread (Hook): Why should you care? Think of apps you use: photo search, reading charts, safe self-driving, and spotting fakes online.

🥬 Filling (The Actual Concept): If models keep the forest before the trees, they make fewer silly mistakes (hallucinations), understand dense pages (charts, web layouts), and notice tricky visual cues (depth, shadows, textures). They can also answer faster by doing most thinking silently in the latent space.

🍞 Bottom Bread (Anchor): When scanning a crowded sports photo, a “forest-first” model is more likely to pick the correct seat section and person location, then give a precise answer—quickly and reliably.

In short, before this research, models were good at describing or guessing word-by-word but struggled to keep rich visual context as they reasoned. The problem was a training habit—predict the next token now—that conflicts with how visual understanding should flow. Past attempts helped but still collapsed too early. Laser fills this gap with a training rule that rewards holding a soft bundle of future meanings early on and shrinking it over time, which lines up with how humans scan scenes. This matters in daily life because it leads to smarter, faster, and more trustworthy visual AI.

02Core Idea

🍞 Top Bread (Hook): You know how a good detective keeps several suspects in mind, then gradually rules them out as clues appear? They don’t accuse someone at the first hint.

🥬 Filling (The Actual Concept): The key insight: train the model’s hidden state to align with a dynamic window of valid future meanings (not just the next word), so it can carry a probabilistic superposition of possibilities and only commit later. This is Dynamic Windowed Alignment Learning (DWAL). Early on, the window is wide (many valid future tokens), keeping global context; as reasoning progresses, the window shrinks, pushing the model toward precise details. This prevents premature semantic collapse and preserves rich visual nuance in the latent space.

🍞 Bottom Bread (Anchor): When asked “What is the boy to the right of the helmet wearing?”, the model first holds a soft set—{skatepark, mural colors, multiple boys, clothing types}—then narrows to {short-sleeve, jeans} and finally answers.

Multiple analogies for the same idea:

Map Zoom: Start with a city map (global), then zoom to the neighborhood, street, and house (local). You don’t start at the house number before you know the city.
Grocery Basket: Keep a small set of possible recipes in mind while shopping. As you see what’s in stock, you drop some options and finalize one dish at checkout.
Orchestra Tuning: First, the whole orchestra warms up (a mix of sounds). Then sections synchronize, and finally instruments lock into a crisp melody.

🍞 Top Bread (Hook): Imagine a telescope that tightens its focus ring as you look longer.

🥬 Filling (The Actual Concept): Dynamic Semantic Windows are moving “validity zones” over future tokens. Early steps allow many future concepts; later steps allow fewer. Training aligns the current hidden state with the soft distribution over that window. Why it matters: without windows, the model is forced to pick a single word too soon and loses the forest.

🍞 Bottom Bread (Anchor): Step 1: {Skatepark, Crowd, Helmet, Boys}; Step 2: {Boy-right-of-Helmet, Clothing}; Step 3: {Short-Sleeve, Jeans}.

🍞 Top Bread (Hook): Think of a smoothie where flavors are blended but still influence the final taste.

🥬 Filling (The Actual Concept): Latent Superposition means the hidden state carries a mix of likely future semantics rather than collapsing to one. It works by creating a soft target distribution inside the window (using the model’s own logits with stop-gradient) so the hidden state learns to represent a balanced mixture. Why it matters: without superposition, the model can’t keep multiple candidates alive and might lock onto the wrong detail.

🍞 Bottom Bread (Anchor): The hidden state may blend signals for “short-sleeve” and “jeans” before finally confirming both.

🍞 Top Bread (Hook): Imagine a coach who helps more when you look confused and steps back when you’re doing fine.

🥬 Filling (The Actual Concept): Entropy-Regularized Intervention checks uncertainty (entropy). If uncertainty is high, it mixes in a stronger hard target for the immediate next token; if not, it sticks with the soft superposition. Why it matters: without it, the model could drift into a blurry, unfocused state.

🍞 Bottom Bread (Anchor): If the model is unsure whether the subject is the “spectator” or the “player,” the coach briefly says, “Anchor on ‘fence’ next,” then lets the model continue softly.

Before vs After:

Before: Hidden states are trained to be sharp pointers to the very next word, risking tunnel vision and hallucinations.
After: Hidden states are trained to be soft containers of several valid futures at first, then gracefully narrowed, mirroring human global-to-local perception.

Why it works (intuition, not equations):

Early stages should capture lots of scene context; a soft window target rewards that.
As windows shrink, the model is nudged from exploration (broad) to exploitation (precise).
Self-refined superposition stabilizes learning by using the model’s own frozen snapshot of beliefs as a teacher inside the window.
Entropy-based guidance keeps the model from spreading too thin.

Building blocks:

Dynamic Semantic Windows (moving from many valid futures to few).
Self-Refined Superposition (soft targets from the model’s own logits with stop-gradient).
Entropy-Regularized Intervention (adaptive hard mix-in when uncertain).
Decodable trajectories (interpretability by reading top tokens from hidden states). Together, these create Laser’s “forest-before-trees” training and inference behavior.

03Methodology

High-level recipe: Input Image + Question → Latent Reasoning with Dynamic Windows (DWAL) → Explicit Answer Generation.

Step 0: Prepare the training signals with cognitive scanpaths. 🍞 Top Bread (Hook): You know how a teacher might give you a worked example showing the big idea first, then the steps to the answer? 🥬 Filling (The Actual Concept): Cognitive Scanpaths are short sequences of visual concepts that follow a global-to-local order, like “Skatepark → Crowd → Helmet → Boy-right-of-Helmet → Short-Sleeve, Jeans.” How it works: The authors synthesize these sequences using a strong vision-language model (GPT-4o) with prompts that enforce global-first scanning and strict, concise tokens. Why it matters: The model learns valid reasoning flows without needing costly boxes or masks. 🍞 Bottom Bread (Anchor): For a chart question, a scanpath might be “Bar Chart → Legend → Blue Bars → Highest Bar → Category Name.”

Step 1: Build the latent reasoning trajectory H.

What happens: The model processes the image and question and (optionally) previous concept tokens to produce hidden states h1, h2, ..., hT. A fixed LM head can map each hidden state to a probability over vocabulary tokens (for decoding and supervision).
Why this step exists: Hidden states are where rich, continuous thinking lives; we align them to stay broad at first and precise later.
Example data: Image: skatepark; Q: “What is the boy to the right of the helmet wearing?” Hidden state h1 should reflect broad scene; h2 should reflect the right subject; h3 should surface clothing tokens.

Step 2: Define Dynamic Semantic Windows Wt. 🍞 Top Bread (Hook): Imagine a flashlight beam that starts wide and narrows as you approach your goal. 🥬 Filling (The Actual Concept): A Dynamic Semantic Window at step t includes all future concept tokens from positions t to T in the scanpath. Early steps allow many future possibilities; later steps allow fewer. Why it matters: Without windows, the model must point to one token immediately, risking collapse. 🍞 Bottom Bread (Anchor): At t=1, W1 might include {Skatepark, Crowd, Helmet, Boy-right-of-Helmet, Short-Sleeve, Jeans}; at t=3, W3 might include {Boy-right-of-Helmet, Short-Sleeve, Jeans}.

Step 3: Learn via Self-Refined Superposition. 🍞 Top Bread (Hook): Think of tasting your soup, then adjusting flavor gently without overreacting. 🥬 Filling (The Actual Concept): The model takes its own logits over tokens in Wt, freezes them (stop-gradient), and turns them into a soft distribution (temperature-scaled). This becomes the reference superposition target for training the current hidden state. Why it matters: It teaches the hidden state to represent a balanced mix of valid futures, not a single guess. 🍞 Bottom Bread (Anchor): If both “short-sleeve” and “jeans” are still plausible, the target softly supports them both.

Step 4: Add Entropy-Regularized Intervention when unsure. 🍞 Top Bread (Hook): Like training wheels that touch the ground only when you wobble. 🥬 Filling (The Actual Concept): If the model’s soft target is too spread out (high entropy), we blend in some hard guidance toward the immediate next concept token. If not, we leave the soft target alone. Why it matters: This avoids drifting into vague states while keeping flexibility when the model is confident. 🍞 Bottom Bread (Anchor): When torn between “spectators” and “players,” add a nudge to “fence” to fix the spatial frame, then continue softly.

Step 5: Optimize two objectives.

DWAL loss aligns each hidden state with its hybrid target (mostly soft, sometimes mixed with hard) over the current window. This builds the forest-to-trees progression in the latent space.
Cross-Entropy loss for the final answer ensures the model commits to a precise, correct output after the reasoning phase. The special <laser_end> token separates reasoning from answering.
Why the combo: One teaches how to think (DWAL in latent space), the other ensures what to say (final answer).
Example: After the last reasoning step predicts <laser_end>, the model outputs “Short-sleeve and jeans.”

Step 6: Keep interpretability via decodable trajectories. 🍞 Top Bread (Hook): It’s like seeing the detective’s notebook instead of only hearing the final verdict. 🥬 Filling (The Actual Concept): Because the LM head is fixed, you can project each hidden state to top-k tokens and read the model’s internal steps. Why it matters: Many latent methods are black boxes; Laser lets you see the path. 🍞 Bottom Bread (Anchor): A decoded path might be [Seats, Spectators, Fence] → [Outside] → [Option C].

What breaks without each step:

Without scanpaths: the model lacks structured teaching signals; windows have nothing coherent to cover.
Without dynamic windows: the model collapses early, losing global context.
Without self-refined superposition: the model can’t keep multiple futures alive; it overcommits.
Without entropy intervention: the model may stay too fuzzy and never focus.
Without explicit answer loss: the model may think well but not speak clearly.

The secret sauce:

Laser turns “predict the next word” into “align with a valid future set,” then gracefully tightens that set. This matches human perception (global-to-local), preserves detail early, and delivers precise answers late, all while staying token-efficient.

04Experiments & Results

The test: The authors evaluate Laser on six diverse benchmarks to measure perception, reasoning, robustness to tricks and illusions, and high-resolution understanding: MMVP, BLINK, SEED-Bench-2-Plus, MMStar, HallusionBench, and HRBench. They also track efficiency: how many tokens models need during inference.

The competition: Laser is compared to strong zero-shot VLMs (like Qwen2.5-VL-7B, LLaVA-OneVision, InternVL3.5-8B, GPT-4o), explicit reasoning or RL/tool methods (Vision-R1, VL-Rethinker, DeepEyes, PAPO), and latent reasoning baselines (LVR, Monet).

The scoreboard (with context):

Overall, Laser sets a new state-of-the-art among latent reasoning methods, beating Monet by +5.03% on average. Think of it like raising a solid “B” to an “A-” across many subjects at once.
On HallusionBench (which tests hallucinations and illusions), Laser improves by +11.36% over Monet, like jumping from a C+ to a solid B+/A-, showing it resists being tricked.
On BLINK (fast, perception-heavy tasks), Laser gains +6.21%, showing stronger scene understanding.
Efficiency is where Laser shines brightest: it reduces inference tokens by more than 97% compared to the base model and far more than Monet. On BLINK, Laser uses about 6 tokens vs. Monet’s 118.3. That’s like finishing your homework in 3 minutes instead of an hour—and still getting a better grade.
Compared with some heavyweight methods that write long rationales or use extra tools, Laser often matches or exceeds results without the extra compute.

Surprising findings:

Despite being a latent method (no long text chains), Laser still outperforms RL-enhanced and tool-using systems in several benchmarks. This suggests that smarter internal training (DWAL) can beat external add-ons.
Laser maintains interpretability: you can decode hidden states into top tokens and see a step-by-step path, which is rare for latent methods.
Fine-grained analysis across 14 task types shows Laser dominates in 11, especially high-level semantic and spatial reasoning (e.g., Visual Similarity, Spatial Relation). It slightly lags in tasks demanding exact pixel-level alignment like Object Localization or Jigsaw. This aligns with its forest-before-trees design.
Generalization holds: On web and chart tasks (text-rich images), Laser improves a lot without erasing geometry or depth skills. That means it learned a transferable reasoning habit, not just overfitted tricks.

🍞 Top Bread (Hook): Ever wish you could peek into the model’s head as it reasons? 🥬 Filling (The Actual Concept): Laser’s decodable trajectories let you read top-k tokens from each hidden state, revealing a multi-hop path (e.g., seats → fence → outside → option C). Why it matters: You can debug and trust the system more when you see how it arrived at the answer. 🍞 Bottom Bread (Anchor): On a baseball scene, the decoded steps show a shift from crowd identification to fence boundary and then to the correct “Outside the field.”

Ablations (what matters most):

Removing DWAL (reverting to next-token prediction) hurts fine-grained perception: premature collapse returns.
Using fixed windows (instead of dynamic) mainly hurts complex reasoning: you lose the gradual global-to-local narrowing. Together, these confirm that both superposition and window shrinking are essential.

Efficiency deep dive:

Explicit rationales (like VL-Rethinker) often blow up token usage; Monet, though latent, still uses long latent sequences.
Laser condenses reasoning into very few tokens because most of the thinking happens in the continuous hidden space, not as text. That’s why it can be both faster and more accurate.

Bottom line: Laser is both smarter (better scores, lower hallucinations) and thriftier (fewer tokens, faster) because it trains the hidden states to keep possibilities alive early and commit later.

05Discussion & Limitations

Limitations:

Pixel-precise tasks: Laser slightly underperforms on exact localization and jigsaw-like puzzles. Its training prefers semantic flow over strict pixel anchoring, so it’s semantically sharp but metrically approximate.
Data dependence: The method relies on high-quality cognitive scanpaths. While they are weakly supervised and scalable, poor scanpaths could teach bad habits.
Hyperparameter sensitivity: The entropy threshold and hard-mix strength (alpha) need tuning to balance exploration and grounding.
Scope: While interpretability is good (decoding top tokens), it’s still coarser than full visual bounding boxes.

Required resources:

A capable VLM backbone (e.g., Qwen2.5-VL-7B) and compute for fine-tuning.
Access to a “visual cognitive engine” (like GPT-4o) or an equivalent pipeline to synthesize scanpaths.
Training setup with memory-optimizing tools (e.g., DeepSpeed) for efficiency.

When NOT to use:

Applications demanding exact pixel-level coordinates or surgical segmentation (e.g., medical image boundary tracing) where precise ROI supervision is essential.
Tasks where interpretability must include spatial evidence overlays (e.g., legal audits requiring boxes/masks).
Extremely short, one-step questions where global-to-local reasoning brings no benefit; a direct-answer model may suffice.

Open questions:

Can we integrate weak ROI signals or self-learned visual anchors to boost localization without losing efficiency?
How does DWAL interact with video reasoning over long horizons, and can windows be time-aware for motion cues?
Can we learn scanpaths purely self-supervised (no external teacher) while keeping quality high?
Could we extend decodable trajectories to produce concise, human-friendly rationales on demand without reintroducing token bloat?
How far can RL-based early-exit policies push efficiency while safeguarding rare-case accuracy? Overall, Laser shifts the training target from “next word” to “valid future set,” proving that better inner guidance makes models both faster and wiser. The next steps are about sharpening local precision, broadening to videos, and making scanpaths cheaper to obtain.

06Conclusion & Future Work

Three-sentence summary: Laser teaches a vision-language model to think in superpositions inside its hidden space, aligning each step with a dynamic window of valid future concepts. This forest-before-trees process preserves global context early and adds precision later, preventing premature collapse and cutting token use by over 97%. The result is state-of-the-art latent reasoning performance, strong robustness, and interpretable latent trajectories.

Main achievement: Laser replaces rigid next-token prediction with Dynamic Windowed Alignment Learning, combining self-refined superposition and entropy-regularized intervention to stabilize flexible, global-to-local latent reasoning.

Future directions:

Add light-touch spatial grounding (weak ROI cues or self-learned anchors) to improve localization without losing efficiency.
Extend to video and long documents with time-aware or page-aware windows.
Develop self-supervised scanpath generation and compact, on-demand rationales.
Explore RL-driven early-exit strategies to push efficiency even further.

Why remember this: Laser shows that how we train hidden states matters as much as model size—teaching them to hold multiple futures and then resolve them can make models both smarter and faster. It bridges the continuous world of vision with the discrete world of language, all while keeping the reasoning path decodable. That’s a practical recipe for more trustworthy, efficient multimodal AI.

Practical Applications

•Document and chart analysis that reads legends, axes, and highlights accurately before extracting exact values.
•Retail image search that understands whole-scene context (store, aisle, brand groupings) before picking a product.
•Robust photo Q&A for phones that avoids hallucinations by scanning globally first.
•Assistive technology that describes complex scenes to users with visual impairments more reliably.
•Autonomous robotics that forms a global plan (layout, obstacles) before committing to precise maneuvers.
•Medical imaging triage that identifies region-of-interest candidates globally before suggesting local checks (with clinician oversight).
•Content moderation that captures scene-level context to reduce false flags and then confirms specific details.
•Forensics and deepfake detection that evaluates global artifacts and lighting before local pixel cues.
•Education tools that guide students through visual reasoning (maps, diagrams) with interpretable intermediate steps.

Version: 1