Toward Cognitive Supersensing in Multimodal Large Language Model

Boyi Li; Yifan Shen; Yuanzhe Liu; Yifan Xu; Jiateng Liu; Xinzhuo Li; Zhengyuan Li; Jingyuan Zhu; Yunhan Zhong; Fangzhou Lan; Jianguo Cao; James M. Rehg; Heng Ji; Ismini Lourentzou; Xu Cao

Toward Cognitive Supersensing in Multimodal Large Language Model

Intermediate

Boyi Li, Yifan Shen, Yuanzhe Liu et al.2/2/2026

arXiv PDF

Key Summary

•This paper teaches multimodal AI models to not just read pictures but to also imagine and think with pictures inside their heads.
•The key idea, called Cognitive Supersensing, adds a new LVIP head that predicts hidden visual states (like mental images) while the model reasons in text.
•A special training recipe first gathers good step-by-step explanations, then teaches the model to align its inner pictures with the right answer, and finally refines its reasoning with a reward-guided process.
•The authors build a new test called CogSense-Bench to check five kinds of visual thinking: fluid intelligence, crystallized intelligence, visuospatial cognition, mental simulation, and visual routines.
•Their 8B-parameter model, CogSense-8B, scores 73.8% on CogSense-Bench, beating much larger and famous models by large margins.
•The model keeps its regular skills on everyday vision-language tasks, showing that the new abilities do not erase old ones.
•It also generalizes better to out-of-domain math and chemistry picture problems, suggesting its inner visual imagery helps beyond the training set.
•An ablation study shows both parts—LVIP and the special reinforcement learning—matter, and together they give the biggest boost.
•The work points to a path where AI reasons partly in pictures and partly in words, more like how people use a 'mind’s eye.'

Why This Research Matters

Many real tasks require imagining changes to pictures, not just naming what’s there—like following a science diagram, reading a map, or predicting how parts fit. This work shows that giving AI an internal ‘mind’s eye’ can make it much better at such tasks. It improves accuracy on a wide set of visual cognition skills and even generalizes to new problem types, hinting at deeper understanding rather than memorization. Because the model keeps its usual abilities too, it’s a practical upgrade rather than a trade-off. In the long run, this approach could help tutors, lab assistants, and household robots reason more safely and reliably. It points toward multimodal systems that think more like people: with both words and pictures inside.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook) You know how in math or puzzle class you sometimes close your eyes and picture shapes rotating or blocks stacking, so you can figure out the answer? That little movie in your mind helps you solve tough problems that words alone can’t explain.

🥬 Filling (The Actual Concept)

What it is: Before this paper, multimodal large language models (MLLMs) were great at naming things in pictures but not great at doing the tricky mental picture-work humans do, like imagining rotations, simulating changes, or spotting deep rules in patterns.
How it works (the world before):
1. MLLMs learned to connect images to words and give descriptions or labels.
2. To reason, people pushed these models to write long, step-by-step text (Chain-of-Thought, or CoT).
3. But when the steps needed visual imagination—like mentally turning a shape—text became a clumsy tool, because it squishes rich, spatial details into sparse word tokens.
Why it matters: Without a way to think in visuals, models get stuck on puzzles that humans solve by using a "mind’s eye," causing brittle reasoning and wrong answers on many visual reasoning benchmarks.

🍞 Bottom Bread (Anchor) Imagine a puzzle asking which piece completes a 3×3 grid of shapes after each row rotates by 90°. If the model only writes words like “rotate by 90°,” it may miss exact spatial relations. But if it can also imagine the shape turning inside, it’s far more likely to choose the right piece.

🍞 Top Bread (Hook) Think of your brain as having two helpers: one that loves words, and another that loves pictures. If you only ask the word-lover to solve a jigsaw puzzle, you’ll have a tough time.

🥬 Filling (The Problem)

What it is: The big challenge is the “cognition gap” between seeing and truly understanding through multi-step visual reasoning.
How it works (why past fixes failed):
1. Text-only CoT: Models write more steps but still can’t hold fine-grained spatial details.
2. External tools: Some methods bolt on extra solvers, but often don’t integrate deeply with the model’s internal state.
3. Token bottleneck: Turning smooth geometry into discrete words loses important information.
Why it matters: Many real tasks—diagram math, science figure interpretation, mechanical reasoning—need visual memory and mental simulation. Without them, answers can be shallow or wrong.

🍞 Bottom Bread (Anchor) Try explaining a maze path only using words like “left, up, left, down,” with no map. It’s easy to get lost. That’s how MLLMs feel when they must reason about visuals without a visual scratchpad.

🍞 Top Bread (Hook) When solving a Rubik’s Cube, you don’t write paragraphs—you picture the moves and then act. The picture-thinking comes first.

🥬 Filling (The Gap and the Paper’s Role)

What it is: The missing piece is an internal “visuospatial sketchpad” inside MLLMs that can carry rich, geometry-friendly states through reasoning.
How it works: The paper introduces Cognitive Supersensing—a way to add and train internal visual imagery, so the model reasons partly in a latent (hidden) visual space, not just in text.
Why it matters: This preserves spatial structure, supports multi-step transformations, and grounds textual steps in picture-like states, much like a human’s mind’s eye.

🍞 Bottom Bread (Anchor) It’s like giving the AI a blank notebook where it can sketch the puzzle as it thinks, instead of forcing it to describe every line of the sketch in words.

🍞 Top Bread (Hook) Teachers need good tests. If you only quiz spelling, you won’t know if a student understands geometry.

🥬 Filling (The Stakes and Benchmark)

What it is: The authors build CogSense-Bench, a test measuring five kinds of visual thinking: fluid intelligence, crystallized intelligence, visuospatial cognition, mental simulation, and visual routines.
How it works: Questions require multi-step visual transformations, pattern induction, and simulation, not just naming.
Why it matters: It measures real cognitive abilities that matter in everyday life—following diagrams, predicting outcomes in science images, or choosing the right plan from maps.

🍞 Bottom Bread (Anchor) Think of assembling furniture: you must read the diagram, imagine flips and turns, and predict where each piece goes. CogSense-Bench checks whether an AI can do that kind of thinking.

02Core Idea

🍞 Top Bread (Hook) Imagine explaining Tetris to a friend. You don’t just say words—you also picture where each shape will land. The magic is in the mental image.

🥬 Filling (The Aha!)

What it is (one sentence): Move part of the model’s reasoning from words into an internal picture-like space by adding a Latent Visual Imagery Prediction (LVIP) head, then train and reward the model so its inner images line up with correct answers.
How it works (like a recipe):
1. While the model writes its reasoning steps, a parallel LVIP head predicts the hidden visual state of the right answer option.
2. During training, we nudge these inner images to match the real visual embedding of the correct option.
3. We further use reinforcement learning to prefer reasoning paths whose inner images and answers agree.
Why it matters: This reduces the text bottleneck, keeps spatial structure intact, and makes the model’s steps less brittle.

🍞 Bottom Bread (Anchor) It’s as if the AI keeps a ghostly sketch of the solution in its head while explaining, and we teach it to make that sketch match the true solution.

Multiple Analogies

Puzzle Table Analogy 🍞 Hook: You know how you sort puzzle pieces by shape and color on a table before snapping them together? 🥬 Concept: Cognitive Supersensing lets the model lay out visual pieces (in its latent space) while talking through the solution. Why it matters: Without the table (latent space), pieces fly everywhere—reasoning falls apart. 🍞 Anchor: Finishing a jigsaw by mentally placing each piece before you press it in.
GPS + Map Analogy 🍞 Hook: A GPS gives turn-by-turn words, but you still prefer glancing at the map. 🥬 Concept: Text steps are the GPS directions; LVIP is the live map. You need both for safe navigation. Why it matters: Directions alone miss curves or landmarks; the map (latent visuals) keeps you grounded. 🍞 Anchor: Picking the right street when two look similar because the map shows the curve.
Chef + Taste Analogy 🍞 Hook: Recipes are words, but a chef also tastes the soup. 🥬 Concept: Words describe steps; LVIP is the taste test that checks the internal ‘flavor’ of the visual state. Why it matters: Without tasting (visual grounding), the soup can be bland or wrong. 🍞 Anchor: Adjusting salt after tasting ensures the final dish matches the goal.

Before vs After

Before: Models wrote longer text chains but still missed geometric details and failed on multi-step visual transformations.
After: The model keeps an internal visual chain that travels with its words, so rotations, counts, alignments, and transformations stay coherent.

Why It Works (Intuition)

Visual patterns are continuous and structured; words are discrete and linear. By keeping a hidden visual state in sync with the answer, the model preserves geometry and causal structure.
Reinforcement signals reward paths that not only say the right thing but also ‘look’ right internally, reducing spurious text-only shortcuts.

Building Blocks (with Sandwich Explanations)

🍞 Hook: Imagine reading a comic where words and pictures work together. 🥬 MLLM (Multimodal Large Language Model)
- What: A model that processes both images and text.
- How: A vision encoder turns pictures into features; a language model reads those plus words to reason and answer.
- Why: Text alone can’t see; images alone can’t explain. 🍞 Anchor: Describing a photo and answering questions about it.
🍞 Hook: Asking a friend, “What’s happening in this picture?” 🥬 VQA (Visual Question Answering)
- What: Answering questions about an image.
- How: Read the question, look at the image, connect details to produce an answer.
- Why: Tests real understanding, not just naming. 🍞 Anchor: “Which shape comes next in this grid?”
🍞 Hook: Solving a riddle step by step. 🥬 CoT (Chain-of-Thought)
- What: Writing reasoning steps before the answer.
- How: Generate short, logical steps that lead to the solution.
- Why: Without steps, the model may guess or get lost. 🍞 Anchor: “First count rows, then compare columns, then pick option C.”
🍞 Hook: Your mind’s eye imagines turning a cube. 🥬 Visual Cognition
- What: How we think with images—pattern, space, motion.
- How: Keep a visual memory, transform it, and compare results.
- Why: Words alone drop spatial details. 🍞 Anchor: Predicting which gear turns next by imagining the motion.
🍞 Hook: Give the AI a sketchpad. 🥬 Cognitive Supersensing
- What: A training paradigm that gives MLLMs internal visual imagery linked to their reasoning.
- How: Add LVIP to predict latent visuals; align them with the true answer; reinforce good visual-text chains.
- Why: Bridges seeing and thinking. 🍞 Anchor: Solving Raven’s matrices by mentally picturing the missing tile.
🍞 Hook: Guess what the final puzzle piece looks like before seeing it. 🥬 LVIP (Latent Visual Imagery Prediction)
- What: A small head that predicts the hidden visual embedding of the correct answer.
- How: Pool vision-token states, pass through an MLP, match to the true answer’s embedding.
- Why: Keeps the model’s inner image tethered to the right solution. 🍞 Anchor: Predicting “a small dark pentagon” and checking it matches option B’s embedding.
🍞 Hook: Practice makes perfect when you reward good habits. 🥬 RL with Latent Rationales (GFlowNet-based)
- What: A training step that samples many reasoning paths and rewards those whose text and inner visuals both support the right answer.
- How: A frozen scorer rates answer evidence; LVIP grounding rates visual alignment; a flow-matching loss makes good paths more likely.
- Why: Prevents brittle single-path thinking and encourages robust visual-text reasoning. 🍞 Anchor: Keeping multiple solution sketches and favoring the ones that best fit the target image.
🍞 Hook: A fair test checks the right skills. 🥬 CogSense-Bench
- What: A benchmark of five visual cognition skills.
- How: Multiple-choice VQA puzzles that demand pattern induction, spatial reasoning, simulation, and attention control.
- Why: Measures true visual thinking, not just object naming. 🍞 Anchor: Picking the next tile in a logical matrix or the odd one out in a visual routine.

03Methodology

At a high level: Input (images + question) → Visual encoder + text encoder → Parallel heads: (A) text decoder for steps + answer, (B) LVIP head for inner visual imagery → Reinforcement learning refines which reasoning paths the model prefers.

Stage I: Reasoning Chain Generation (Teacher-Student) 🍞 Hook: Like copying notes from a top student to learn good study habits. 🥬 Concept

What: Use a strong teacher model to write clean, step-by-step rationales (CoT) for each puzzle and filter out wrong ones.
How:
1. Feed the image(s) and question to the teacher.
2. Get a rationale + predicted answer.
3. Keep it only if the answer matches the ground truth and contains no hallucinations.
Why: The student model needs good examples of how to think aloud. 🍞 Anchor: Building a curated set of model-written “worked examples” for training.

Stage II: Supervised Fine-Tuning (SFT) with LVIP 🍞 Hook: Imagine tracing over a faint picture to learn how to draw it correctly. 🥬 Concept

What: Train the model to generate the rationale + answer while its LVIP head predicts the hidden visual embedding of the correct option.
How (Step-by-step):
1. Visual encoder extracts features for the question image and all candidate options.
2. Project those features into the language space; the LLM backbone processes combined visual and textual tokens.
3. Text decoder learns to output rationale then final answer.
4. LVIP head pools the option-image visual tokens from the backbone’s hidden states and predicts the embedding of the correct option.
5. Loss = text loss (to write the right steps and answer) + LVIP MSE loss (to make inner image match the right option’s embedding).
Why: Without LVIP, the model can write steps that sound right but don’t align with the actual visual solution. 🍞 Anchor: If the correct answer is a small dark pentagon, LVIP learns to predict an embedding close to that option’s embedding while the text explains why.

Stage III: RL with Latent Rationales (GFlowNet-style) 🍞 Hook: Practice many solution paths and favor the ones that both sound right and look right inside. 🥬 Concept

What: Use reinforcement learning to sample multiple reasoning chains and score them using two signals: (1) answer evidence from a frozen scorer, and (2) LVIP-based visual grounding.
How (Step-by-step):
1. Sample several rationales Z.
2. Score each with R_ans (log-prob that this rationale leads to the correct answer under a frozen scorer) and R_lvip (how close the LVIP-predicted embedding is to the true option embedding when conditioned on this rationale).
3. Combine them into a reward R = α·R_ans + γ·R_lvip.
4. Use a flow-matching loss (GFlowNet SubTB) with prefix densification so partial chains get shaped by reward signals.
5. Keep only candidates better than a reference rationale threshold to cut noise.
Why: Without RL, the model may overfit to one brittle chain; with RL, it learns a distribution over robust, grounded chains. 🍞 Anchor: If three different step-by-step explanations all imply the same correct option and LVIP agrees visually, those explanations become more likely next time.

Inference Time 🍞 Hook: When in doubt, check a few plans and pick the strongest. 🥬 Concept

What: Sample multiple rationales, score their answer evidence with the frozen scorer (length-normalized), and pick the answer from the best-scored rationale (a simple MAP-style selection).
How: N-sample rationales → decode answers → compute scores → choose the top.
Why: Reduces sensitivity to any single mistaken chain. 🍞 Anchor: Trying three ways to solve a maze and choosing the path that most clearly fits the map.

Secret Sauce

The LVIP head ties the model’s inner visual story to the actual correct option’s embedding, keeping geometric and appearance details alive.
The RL stage doesn’t just reward the final word; it also rewards inner visual agreement, steering the model toward visually grounded thought chains.

Short Sandwiches for Auxiliary Ideas

🍞 Hook: Hidden drawings under tracing paper. 🥬 Latent Space/Embeddings
- What: Compact, mathy representations of images.
- How: Encoders turn pictures into vectors; comparing vectors checks similarity.
- Why: Lets the model ‘feel’ visual closeness even without pixels. 🍞 Anchor: Two pentagons have closer embeddings than a pentagon and a triangle.
🍞 Hook: Water flowing through pipes to many sinks. 🥬 GFlowNet (Generative Flow Network)
- What: A way to learn to sample many good solutions in proportion to their rewards.
- How: Match flows so that high-reward paths carry more ‘probability mass.’
- Why: Encourages diversity instead of one brittle path. 🍞 Anchor: Sampling several valid puzzle explanations instead of insisting on a single script.

04Experiments & Results

🍞 Top Bread (Hook) If you want to know whether a soccer player is good, you don’t just time their sprints—you see them pass, dribble, and shoot. This paper’s new benchmark checks many ‘skills’ of visual thinking.

🥬 Filling (The Test)

What they measured: Accuracy on CogSense-Bench, a 1,000-question multi-task multiple-choice VQA benchmark across five cognitive dimensions: fluid intelligence, crystallized intelligence, visuospatial cognition, mental simulation, and visual routines.
Why: These capture different ways humans reason visually: finding abstract rules, using learned knowledge, imagining spatial changes, simulating processes, and searching/attending efficiently.

🍞 Bottom Bread (Anchor) Think of classroom tests that include pattern puzzles, category matching, map reading, physics doodles, and spot-the-difference.

🍞 Top Bread (Hook) You don’t know you’re winning unless you see the scoreboard.

🥬 Filling (The Competition and Scoreboard)

Baselines: Major open and closed MLLMs/VLMs (e.g., Gemini 2.5 Flash, GPT-5.2, Claude Sonnet 4, Qwen3-VL-30B, etc.). A human baseline from 20 participants on a stratified sample is also reported.
Results with context:
- CogSense-8B (just 8B parameters) achieves 73.8% average on CogSense-Bench.
- That’s like scoring an A when many popular models are around C/C+ or lower (often 30–56% ranges depending on category and model).
- It narrows the gap to human performance substantially across all five dimensions.

🍞 Bottom Bread (Anchor) If the class average is 60% and you get 74%, you’re comfortably above the pack—even if you’re not yet at the best human scores.

🍞 Top Bread (Hook) A great soccer player shouldn’t forget how to pass while learning to shoot.

🥬 Filling (General Ability)

What they checked: A suite of common vision-language tasks (e.g., HallusionBench, AI2D, GQA, ScienceQA, RealWorldQA, ChartQA, BLINK, MMStar).
Result: CogSense-8B stays roughly on par with its base model on these general benchmarks, so the new visual cognition skills didn’t erase its older abilities.

🍞 Bottom Bread (Anchor) Like learning new tricks without forgetting your basic dribbling.

🍞 Top Bread (Hook) Can you transfer your puzzle skills to a new kind of puzzle?

🥬 Filling (Out-of-Domain Generalization)

Test: EMMA benchmark subsets in Chemistry and Mathematics, where both questions and options are images.
Result: CogSense-8B improves by +6.2 points in Chemistry and +8.8 points in Math over the base model.
Why it matters: Suggests the model learned general visual thinking strategies (not just memorizing training patterns).

🍞 Bottom Bread (Anchor) Like a student who can solve new types of geometry problems after mastering transformations.

🍞 Top Bread (Hook) Which ingredient mattered most in the recipe?

🥬 Filling (Ablations)

Variants tested: Base model; SFT without LVIP; SFT with LVIP; SFT ± LVIP with GRPO; full method (SFT with LVIP + custom RL/GFlowNet setup).
Findings with context:
- SFT helps a lot (nearly doubles average performance from the base).
- Adding LVIP further boosts accuracy (to ~68%), showing visual grounding helps.
- General RL (GRPO) adds gains but less than the specialized RL with LVIP grounding.
- Full method reaches 73.8%, the best result—so both LVIP and the tailored RL matter.

🍞 Bottom Bread (Anchor) It’s like baking: flour (SFT) makes the cake, cocoa (LVIP) adds flavor, and careful temperature control (special RL) makes it rise just right.

05Discussion & Limitations

🍞 Top Bread (Hook) Even superheroes have weaknesses—and knowing them helps you use their powers wisely.

🥬 Filling (Limitations and When Not to Use)

Limitations:
1. Multiple-choice setting: LVIP learns to match the correct option’s embedding; fully open-ended generation or tasks without clear option images may need adaptation.
2. Visual reasoning breadth: Though strong on five dimensions, it’s not proven on every kind of visual cognition (e.g., dense OCR-heavy tasks or chaotic real-world videos).
3. Interpretability: Latent visual states are not directly human-viewable; we infer their quality through performance and embedding similarity.
4. Data reformats: Some datasets were reformatted to multiple-choice, which may shift task dynamics.
Required Resources:
- Training used 8× NVIDIA H200 GPUs; you need a capable GPU stack, the CogSense-Dataset, and the code to reproduce the pipeline (SFT + RL).
When NOT to Use:
- Pure text problems with no visual structure (LVIP adds little).
- Tasks that require precise text reading in images (OCR) rather than spatial reasoning.
- Open-ended image generation or drawing tasks where no candidate options exist (LVIP supervision would need redesign).
Open Questions:
1. Can we visualize or decode the latent imagery itself to improve interpretability?
2. How well does the approach extend to video (temporal simulation) and robotics (embodied planning)?
3. Can LVIP be trained without explicit option embeddings (e.g., by predicting future latent states or self-supervised targets)?
4. What’s the best balance between text steps and visual latents for different task families?

🍞 Bottom Bread (Anchor) Think of LVIP as a reliable compass for multiple-choice map problems; for freehand drawing, you’ll want a different tool or a modified compass.

06Conclusion & Future Work

🍞 Top Bread (Hook) When tough visual puzzles appear, humans lean on a mind’s eye. This paper gives MLLMs a version of that.

🥬 Filling (Takeaway)

3-Sentence Summary: The paper introduces Cognitive Supersensing, which equips MLLMs with internal visual imagery by adding an LVIP head and coupling it with text reasoning. Training mixes supervised fine-tuning (to align inner images with the correct answer) and a specialized reinforcement learning stage (to favor visually grounded reasoning paths). A new benchmark, CogSense-Bench, shows large gains across five visual cognition skills, with strong out-of-domain improvements.
Main Achievement: Demonstrating that grounding reasoning in latent visual imagery, not just text, closes a crucial gap between perception and cognitive-level visual understanding.
Future Directions: Open-vocabulary visual imagery targets (beyond options), video/time-based simulation, interpretable visualization of latents, and applying the method to robots and science tools.
Why Remember This: It marks a shift from “more words” to “better inner pictures,” pointing to multimodal reasoning that is both explainable in text and faithful to the geometry of the world.

🍞 Bottom Bread (Anchor) Like adding a mental whiteboard to the AI’s brain, this approach helps it picture moves before making them—and that picture makes all the difference.

Practical Applications

•STEM tutoring that explains diagram-based problems by imagining and describing transformations step by step.
•Assistive tools for reading charts, maps, and assembly instructions with grounded visual reasoning.
•Scientific figure Q&A that simulates processes (e.g., chemistry mechanisms, physics diagrams) to choose correct outcomes.
•Robotics planning that keeps a visual latent plan aligned with goals for safer navigation and manipulation.
•Quality control in manufacturing by reasoning over visual patterns to detect odd ones out or mismatches.
•Medical education support on anatomy diagrams and process flows, with careful visual reasoning (non-diagnostic).
•Data visualization assistants that answer questions about trends and relations in charts more robustly.
•AR/VR guidance that anticipates user actions by simulating visual changes and offering next-step hints.
•Educational games that demand visual logic and pattern completion, powered by imagery-grounded reasoning.
•Design and CAD helpers that mentally transform parts (rotate/fit) and validate assembly options.

Version: 1