See Less, See Right: Bi-directional Perceptual Shaping For Multimodal Reasoning

Shuoshuo Zhang; Yizhen Zhang; Jingjing Fu; Lei Song; Jiang Bian; Yujiu Yang; Rui Wang

See Less, See Right: Bi-directional Perceptual Shaping For Multimodal Reasoning

Intermediate

Shuoshuo Zhang, Yizhen Zhang, Jingjing Fu et al.12/26/2025

arXiv PDF

Key Summary

•The paper teaches vision-language models (AIs that look and read) to pay attention to the right picture parts without needing extra tools during answering time.
•It creates two special training views: one that keeps only the helpful evidence and one that hides the critical evidence.
•The model is pulled to agree with the evidence-keeping view and pushed to disagree with the evidence-hidden view, using a KL divergence signal both ways.
•This bi-directional shaping is trained inside an RL recipe called GRPO, so the model learns from grouped comparisons of its own answers.
•A clever data pipeline edits chart-making code to precisely keep or remove just the needed lines, bars, and labels—no clumsy rectangles.
•With only 13K chart samples, the method lifts Qwen2.5-VL-7B by about 7.3 points on average across eight tests; adding 39K math samples raises the gain to 8.2 points.
•It especially helps with fine details like thin curves and intersections that older cropping methods miss.
•There is no extra cost at test time: the model just answers, now with better visual grounding.
•It generalizes beyond charts to tougher visual math and mixed-image benchmarks, beating larger, more specialized systems in many cases.

Why This Research Matters

When answers depend on tiny visual details—like which line dips lowest or where two curves intersect—guessing from text patterns won’t cut it. BiPS teaches models to truly look, so their answers match the pixels, not just their hunches. This makes homework helpers more reliable, business chart readers more trustworthy, and scientific tools more careful. Because there’s no extra step at test time, answers stay fast and simple to deploy. With data-efficient training, even modest chart sets can teach strong visual habits that transfer to broader tasks. As AI grows into everyday tools, teaching it to see less but see right is a smart way to build trust.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: You know how, when you do a scavenger hunt, you need to look at the right shelf or corner, not just anywhere in the room, or you’ll miss the tiny clue? If you glance everywhere equally, you’ll probably grab the wrong thing.

🥬 Filling (Concept: Vision-Language Models, or VLMs)

What it is: A VLM is an AI that looks at pictures and reads words to answer questions about them.
How it works: 1) See the image, 2) Read the question, 3) Connect the visual bits to the words, 4) Produce an answer.
Why it matters: Without solid seeing skills, the model might rely on guesses from text or miss small but important picture parts.

🍞 Anchor: If you show a chart and ask, “Which line drops the fastest?”, a good VLM must find and compare the right lines, not just pick a common answer it has seen before.

🍞 Hook: Imagine a teacher asks, “Which slice of the pizza is smallest?” If you only stare at the plate’s edge, you’ll guess wrong.

🥬 Filling (Concept: Multimodal Reasoning)

What it is: Multimodal reasoning means using more than one kind of input—like images plus text—to think and answer.
How it works: 1) Align what you see (colors, shapes, positions) with what you read (names, targets), 2) Track relationships (bigger/smaller, rising/falling), 3) Compute or compare to reach a conclusion.
Why it matters: If the model doesn’t truly match the words to the exact parts of the picture, it can sound smart but be wrong.

🍞 Anchor: For a bar chart question like “Which city has the tallest bar?”, multimodal reasoning means the model actually measures bars in the image tied to the city labels, not just guesses a famous city.

🍞 Hook: Picture trying to find a thin thread on a busy rug. If your flashlight has a big, blurry beam, you’ll miss it.

🥬 Filling (Concept: The Perception Bottleneck)

What it is: The perception bottleneck is when a model struggles to notice and read small, detailed, or oddly shaped visual clues.
How it works: 1) The model scans the image but over-focuses on easy, big regions, 2) It under-attends to fine features like thin lines or tiny numbers, 3) It leans on text guesses or shortcuts.
Why it matters: If the model can’t latch onto the exact visual evidence, any later reasoning can drift away from the truth.

🍞 Anchor: In a line chart with thin wiggly curves, a perception bottleneck means missing where two lines cross and then answering a crossing question incorrectly.

🍞 Hook: Think of having a helpful friend who points to a rectangle on the photo and says, “Look here!” That helps, but what if the important part isn’t a neat rectangle?

🥬 Filling (Concept: Inference-Time Visual Tools / Visual Chain-of-Thought)

What it is: These are extra steps at answering time—like drawing boxes, making crops, or creating helper images—to guide the model.
How it works: 1) The model or a tool marks regions, 2) The model reasons using those marks, 3) It answers.
Why it matters: Without them, some models miss key spots; but these tools can be shape-rigid (miss thin curves), task-specific, and slow.

🍞 Anchor: A crop box may help find a person in a photo, but it’s clumsy for tracing a skinny polyline across a chart.

🍞 Hook: Imagine teaching your eyes during practice so that on test day you naturally stare at what matters—no sticky notes needed.

🥬 Filling (The Problem This Paper Tackles)

What it is: We need models that, during training, learn where to look so they won’t need extra visual hints during answering.
How it works: 1) Provide training signals that highlight the truly relevant picture parts, 2) Also provide counter-signals when the key parts are missing, 3) Shape the model’s habit to rely on real visual evidence.
Why it matters: This removes extra steps at test time, boosts generalization across domains, and catches fine details like thin lines and intersections.

🍞 Anchor: If, during practice, you see both a “just the important parts” view and a “key bits removed” view, you’ll learn which pixels truly matter—so in the real exam you look right, fast, and well.

The world before: Many systems added inference-time helpers—cropping, masking, special visual tokens. They worked but often missed fine, irregular details and made test-time slower and more fragile. Failed attempts: Random masks as negatives sometimes hid the wrong stuff, and rectangle crops couldn’t hug wavy lines. The gap: A clean way to train the model’s looking skills with precise, question-based supervision, without bringing the crutches to the exam. Real stakes: Better grounded answers for charts, diagrams, and visual math helps in science homework, business reports, and medical visuals—anywhere tiny lines or symbols decide the truth.

02Core Idea

🍞 Hook: You know how a coach shapes players’ habits by running drills where you practice both doing the right move and avoiding the wrong move?

🥬 Filling (Concept: Bi-directional Perceptual Shaping, or BiPS)

What it is: BiPS is a training method that teaches a model where to look by pulling it toward correct, evidence-based views and pushing it away from views missing the key evidence.
How it works: 1) Make two special training views of the same image based on the question, 2) Nudge the model to agree with the evidence-keeping view, 3) Nudge it to disagree with the evidence-missing view, 4) Repeat so the model forms a strong habit: use the real pixels.
Why it matters: Without both pushes and pulls, the model may still cheat with text shortcuts or overlook fine details.

🍞 Anchor: If the question is “Which curve drops fastest?”, the model practices on (a) a view that keeps just the needed curves and (b) a view that hides those curves. It learns to trust (a) and distrust (b).

Three analogies for the same idea:

Map reading: One practice map highlights only the useful roads (agree with it); another hides those roads (don’t trust it).
Science lab: One sample contains the active ingredient (confirm your conclusion there); the control sample lacks it (your conclusion should change).
Sports: Scrimmage with teammates in the right positions (play your strategy), then scrimmage with key players missing (don’t act as if they’re there).

Before vs After:

Before: Models leaned on rectangles, quick guesses, or text patterns; fine curves fooled them; extra tools slowed answers.
After: The model’s eyes are trained. It naturally attends to the exact, sometimes skinny or irregular, evidence—no helper tools required at test time.

🍞 Hook: Imagine two special photos: one that carefully keeps the clue, and one that carefully erases it—but both keep the same frame and labels.

🥬 Filling (Concept: Evidence-Preserving View)

What it is: A version of the image that keeps only the parts needed to answer, while keeping layout and context.
How it works: 1) Identify question-relevant marks (like a certain line or legend entry), 2) Remove distractors while keeping axes and labels, 3) Render the cleaned view.
Why it matters: Without it, the model doesn’t get a clear, reliable target for where-to-look agreement.

🍞 Anchor: For “What’s the label of the steepest drop?”, the preserving view keeps just the important curves and legends, making the right evidence stand out.

🍞 Hook: Now imagine the opposite: a photo where just the crucial clue is erased, but everything else looks normal.

🥬 Filling (Concept: Evidence-Ablated View)

What it is: A version of the image where the decisive visual bits are blanked, so the question becomes unanswerable from pixels alone.
How it works: 1) Pinpoint the exact marks revealing the answer, 2) Remove or hide them, 3) Keep the layout so it still looks like the same image family.
Why it matters: Without it, the model might still answer from text hints or priors, not the actual evidence.

🍞 Anchor: If the answer depends on a blue line’s lowest point, the ablated view hides that line while keeping axes and legend shells, forcing the model not to pretend it still sees it.

🍞 Hook: Think of comparing two opinions—how far apart are they?

🥬 Filling (Concept: KL Divergence, used Bi-Directionally)

What it is: KL divergence measures how different two answer distributions are.
How it works: 1) Compute the model’s answer probabilities on the original image, 2) Compute them on the special view, 3) Use KL to pull them closer (preserving) or push them apart (ablated).
Why it matters: Without KL guidance, the model won’t get a precise “move closer” or “move away” signal.

🍞 Anchor: If the model gives 80% to “Blue” on the original and 85% on the preserving view, the KL pull says, “Get even closer.” If it gives 80% on the ablated view too, the KL push says, “No—these should differ.”

🍞 Hook: Imagine a game where your team’s score is judged relative to how a group of plays went, so you don’t get tricked by easy wins or flukes.

🥬 Filling (Concept: GRPO—Group Relative Policy Optimization)

What it is: GRPO is a reinforcement learning setup that stabilizes training by scoring actions relative to a group of rollouts.
How it works: 1) Generate several answer attempts, 2) Score them with a rule-based checker, 3) Compare attempts within the group, 4) Update the model toward better attempts while controlling big jumps.
Why it matters: Without GRPO, training can wobble, and the shaping signals might be too noisy.

🍞 Anchor: It’s like grading multiple tries of the same question side by side, then nudging the model toward the best try without overreacting.

Why it works (intuition, no equations): The preserving pull teaches coverage of all needed pixels; the ablated push breaks text-only shortcuts; GRPO keeps learning steady. Together, the model forms a durable habit: answers change when the true visual evidence changes—and stay steady when distractors change.

03Methodology

At a high level: Input (image + question) → Build two special views (keep-evidence and hide-evidence) → Train with two forces (pull toward keep-view, push from hide-view) inside GRPO → Output: a model that naturally looks at the right pixels.

Step-by-step like a recipe:

Collect training items

What happens: Use chart data where each figure is generated by code (so every line, bar, and legend has a known source). The paper samples 50K chart-code pairs from ECD.
Why it exists: Code-level control lets us edit precisely—keep or remove only the exact elements that matter—far better than rough rectangles.
What breaks without it: Pixel crops miss thin or wavy lines; random masks hide the wrong bits; supervision gets noisy.
Example: A line chart with three series is created in Python code. Because we have the code, we can surgically remove just the “blue line” or keep only “subplot (a)”.

Make questions easy to check

What happens: An auxiliary LLM rewrites open-ended questions into multiple-choice with verified correct options, using chart code and metadata to self-check.
Why it exists: Reinforcement learning needs dependable rewards. Multiple choice with a known correct answer provides that.
What breaks without it: If the reward is wrong or fuzzy, the model learns bad habits.
Example: “Which line decreases most?” becomes options A–D with one true answer, confirmed from the chart’s data.

Filter out trivial cases

What happens: If the base model already gets a question correct in all 8 tries, discard it. Focus training on challenging items.
Why it exists: Hard examples teach the model more.
What breaks without it: Time is wasted cementing what the model already knows; shaping signals have smaller impact.
Example: Simple “Which bar is tallest?” cases that the base model never misses are removed.

Build the evidence-preserving view (I_pres)

What happens: Edit the chart code to keep just the necessary lines, legends, or subplots, while preserving layout (axes, positions, legend structure). Render this precise view.
Why it exists: This is the “grounded yes” target—where looking the right way should give the same answer as the full image.
What breaks without it: The model has no clean anchor showing “this is the right stuff to focus on.”
Example: For “In subplot (a), how spread is the blue curve?”, the preserving view keeps subplot (a), the blue curve, and needed axes; it blanks unrelated panels or series but leaves their frames and legends.

Build the evidence-ablated view (I_abl)

What happens: Edit the chart code to hide the critical evidence while keeping the layout intact—so now the question becomes unanswerable from pixels.
Why it exists: This is the “grounded no” counter-example—when the key pixels vanish, the model should not pretend it still knows the answer.
What breaks without it: The model might still rely on text or priors and output the same answer, ungrounded in vision.
Example: If the answer depends on the blue line’s minimum, remove that line (but keep axes/legend shells). A human can’t answer now; neither should the model.

Consistency stage (pull toward I_pres)

What happens: Train the model so its answer distribution on the original image becomes more like its distribution on I_pres. This minimizes KL divergence (a “move closer” signal) but only on samples verified as correct.
Why it exists: It teaches the model that focusing on the preserved evidence leads to stable, correct answers—even when distractors are present in the full image.
What breaks without it: The model may keep wobbling toward distractors because it never gets a clear pull to the right pixels.
Example: If I_pres nudges probability toward “Blue” being the steepest drop, the model learns to match that even when extra curves exist in the full image.

Separation stage (push from I_abl)

What happens: Train the model so its answer distribution on the original image becomes different from its distribution on I_abl. This maximizes KL divergence (a “move apart” signal) under a gentle cap for stability.
Why it exists: It breaks text-only shortcuts; if the vital pixels are gone, the model should not act as if it still sees them.
What breaks without it: The model might align with I_pres but still not truly depend on the decisive pixels, answering from priors.
Example: If the ablated view hides the key blue line yet the model still answers “Blue,” the push signal says, “No, these should diverge.”

Coarse-to-fine training curriculum

What happens: First run the Consistency stage (build the habit to focus). Then run the Separation stage (make the habit robust to remove shortcuts). Both are done inside GRPO for stable RL learning.
Why it exists: Doing both at once can produce competing gradients; ordering them reduces conflict and speeds convergence.
What breaks without it: Joint training may wobble; reversing the order can regularize too early, before a stable focus is learned.
Example: The paper shows better scores with this order than with joint or reversed schedules.

Secret sauce

Programmatic code editing: Because charts come from code, we can keep/hide exactly the right visual atoms—lines, bars, legends—while keeping layout. This makes the pull/push signals clean and reliable.
Bi-directional KL shaping: The model learns a rule-of-thumb—“answers should agree when evidence is present and change when evidence is absent.” That’s the core of visual grounding.
GRPO stability: Group-based normalization keeps the updates measured, so the two forces train the model’s eyes instead of shaking them.

Concrete walkthrough with data:

Input: A 4-panel chart; question: “Which curve in panel (a) decreases the most from 0 to 50?” Options: A) Red B) Blue C) Green D) Yellow.
I_pres: Keep only panel (a), keep all curves needed for comparison, keep axes and legend; blank other panels.
I_abl: Keep layout but remove the curves in panel (a) (or at least the decisive one), so a human cannot answer from pixels.
Training: Stage 1 pulls original predictions to match I_pres (if “Blue” is correct, increase its probability on the full image). Stage 2 pushes original predictions away from I_abl (do not give the same confident answer when the key line is gone).
Result: The model learns that thin, irregular curves decide the answer—and it should track them.

Putting it all together: Input → (Rewrite question to multiple choice, filter hard ones) → (Generate I_pres and I_abl by editing chart code) → (Stage 1: KL pull to I_pres) → (Stage 2: KL push from I_abl) → Output: A model that, during normal answering, looks in the right places—no extra tools needed.

04Experiments & Results

The test: The authors checked whether the model truly grounds its answers in the pixels across chart and general visual reasoning tasks. They measured accuracy (how often the model gets the multiple-choice answer right) because it cleanly reflects correct reasoning with verified options.

The competition: They compared against the base open model Qwen2.5-VL-7B, other open models like InternVL3-8B, specialized chart reasoners (e.g., ChartLlama, Chart-R1), broader multimodal RL methods (e.g., Vision-R1, R1-OneVision, DeepEyes), and even closed-source leaders like GPT-4o and Claude Sonnet.

The arenas (benchmarks):

Charts and figures: CharXiv, ChartQAPro, ChartMuseum, Evochart, ECD-Bench.
General visual math and mixed reasoning: MathVista, MathVision, MathVerse-VO, MMStar.

Scoreboard with context:

With only 13K chart samples (BiPS-Chart-7B), average accuracy jumped from 44.3 to 51.6 (about +7.3 points). That’s like going from a solid B to a strong A- across many classes.
After adding 39K math-focused samples with standard GRPO (BiPS-General-7B), the average rose further to 52.5 (+8.2 points over the base). That’s edging from A- toward A, while still using no extra tools at test time.
Big individual wins: Evochart soared from 52.0 to 68.2 (+16.2) with BiPS-Chart-7B, and CharXiv rose from 42.5 to 49.4 (+6.9). These datasets feature tricky, thin lines and multi-panel layouts—exactly where fine-grained seeing matters.
Out-of-domain gains: Despite training only on chart-derived views, BiPS-Chart-7B improved MathVista (+5.3) and MMStar (+2.8), showing the shaping of “where to look” transfers beyond charts.
Head-to-head against a strong RL baseline trained on the same mixed data, BiPS-General-7B still led—especially on CharXiv (+5.2). This shows that smarter perception signals, not just more RL, make the difference.

Surprising and insightful findings:

Precision beats randomness: Replacing the code-edited views with random masking hurt performance. Carefully keeping/hiding the exact evidence matters far more than blotting out pixels blindly.
Training order matters: The two-stage, coarse-to-fine schedule outperformed doing both losses together and outperformed reversing the order. First learn to focus; then make that focus robust.
Small data, big effect: Only 13K chart examples trained this habit, yet it beat or matched systems trained on hundreds of thousands or even millions of chart items. That’s data efficiency.
No test-time crutches: Despite removing helper tools at inference, results improved across the board. The model’s eyes were shaped at training, so answering stayed fast and clean.

A human-friendly reading of a key case: On CharXiv, the base model hallucinated extra line intersections because it leaned on textual or statistical priors. BiPS-Chart traced the actual curves, counted real crossings, and answered correctly. That’s the difference between sounding plausible and being grounded.

Bottom line: Across eight benchmarks, BiPS consistently lifted accuracy, especially on fine-grained chart tasks, while also transferring gains to visual math and mixed-image reasoning—all without adding any extra steps at answer time.

05Discussion & Limitations

Limitations (honest view):

Needs paired views: BiPS relies on having an evidence-preserving and an evidence-ablated view. In charts, code editing makes this precise. In messy natural photos, building such pairs is harder without good segmentation tools.
Domain coverage: Training focused on charts. While benefits transfer, the biggest boosts appear where evidence is precise and structured. Broader domains may need their own view-generators.
OCR sensitivity: Reinforcement fine-tuning can sometimes tug against text-reading sharpness; careful prompts or mixed-data training help counter that.
Tooling dependencies: The current pipeline uses an auxiliary LLM to rewrite questions and edit code. Although this is still training-time only, good arbitration and error-checking matter.

Required resources:

Data with controllable rendering (best: code-generated charts) or strong automatic segmentation for other domains.
Compute for RL fine-tuning (the paper used 8×H100 GPUs) and an evaluator to check multiple-choice answers.
A capable base VLM (e.g., Qwen2.5-VL-7B) and a rule-based verifier for rewards.

When not to use:

If you can’t obtain reliable, semantically precise preserving/ablated views, the shaping signals may become noisy.
If answers depend purely on text with almost no visual evidence, BiPS adds little.
If test-time latency isn’t a concern and you already have excellent inference tools for every domain, sticking with those might suffice.

Open questions:

Multi-domain shaping: How best to build preserving/ablated views for natural images, videos, and diagrams automatically and at scale?
Beyond charts: Can similar programmatic pipelines be built for scientific diagrams, medical images, or UI screenshots?
Hybrid curricula: What’s the ideal blend of consistency/separation strength over time, and can adaptive schedules improve stability further?
Explaining attention: Can we extract human-readable maps showing exactly which pixels drove the answer, closing the loop on interpretability?
Robust OCR synergy: How to preserve or even enhance text-reading while strengthening visual grounding via RL?

06Conclusion & Future Work

Three-sentence summary: This paper introduces Bi-directional Perceptual Shaping (BiPS), which trains a model’s eyes by pulling answers toward a view that keeps the needed evidence and pushing them away from a view that hides it. Using clean, programmatically edited chart views and training inside GRPO, BiPS teaches the model to rely on the true pixels, not text shortcuts. It achieves strong, data-efficient gains across chart and general benchmarks with zero extra steps at test time.

Main achievement: Turning inference-time visual cues into pure training signals—via bi-directional KL shaping—so models internalize where-to-look habits and handle fine, irregular evidence like thin polylines.

Future directions: Extend view construction beyond charts using automatic segmentation or procedural domains, explore multi-domain curricula, balance OCR and grounding, and develop interpretable visual attributions.

Why remember this: BiPS shows that carefully crafted training views can rewire a model’s perception, making it both sharper and faster at test time. It’s a blueprint for teaching models to “see less but see right,” which matters wherever tiny visual details decide the truth—from school labs to business dashboards and beyond.

Practical Applications

•Dashboard assistants that accurately read charts and summarize trends without misreading thin lines.
•STEM tutoring that solves visual math problems by grounding steps in diagrams, not text guesses.
•Scientific paper readers that extract correct values from multi-panel figures and legends.
•Business report analyzers that compare KPIs across subplots and avoid being fooled by clutter.
•QA systems for lab notebooks that verify claims using precise visual evidence from plots.
•Educational apps that explain how an answer comes from the picture, reinforcing visual-critical thinking.
•Data cleaning tools that flag inconsistent or missing visual evidence in charts before publication.
•Medical training simulators that teach models to rely on contours and small markers instead of priors.
•UI analytics that understand tiny icon changes or status indicators in complex screens.
•Robotics vision modules that learn to trust decisive visual cues and ignore distractors in dashboards.

Version: 1