CPPO: Contrastive Perception for Vision Language Policy Optimization

Ahmad Rezaei; Mohsen Gholami; Saeed Ranjbar Alvar; Kevin Cannons; Mohammad Asiful Hossain; Zhou Weimin; Shunbo Zhou; Yong Zhang; Mohammad Akbari

CPPO: Contrastive Perception for Vision Language Policy Optimization

Intermediate

Ahmad Rezaei, Mohsen Gholami, Saeed Ranjbar Alvar et al.1/1/2026

arXiv PDF

Key Summary

•CPPO is a new way to fine‑tune vision‑language models so they see pictures more accurately before they start to reason.
•It automatically finds the words (tokens) in an answer that truly depend on the image by checking which tokens become uncertain when the image is partially hidden.
•For those vision‑dependent tokens, it uses a contrastive loss that says: stay the same when the picture only changes slightly, but change when important parts are removed.
•CPPO adds this perception‑focused loss on top of a standard reinforcement learning objective (GRPO), only for successful rollouts, to avoid learning from bad guesses.
•Unlike earlier methods, CPPO does not need extra judge models, special tags, or ground‑truth chain‑of‑thought labels.
•On Qwen2.5‑VL‑3B, CPPO lifts average accuracy from 37.8% (GRPO) to 40.0%, and on 7B from 46.7% to 48.2%, beating other perception‑aware baselines like PAPO.
•It learns faster and generalizes better out of domain because it teaches the model what visual details matter and when.
•An ablation shows all pieces help: selecting top‑k perception tokens and gating by positive advantage together give the biggest boost.
•Training takes only about 39% longer than GRPO, but even doubling GRPO’s time can’t match CPPO’s gains.
•CPPO is simple, scalable, and focuses training exactly where it counts: the tokens that truly come from seeing.

Why This Research Matters

Many real-world questions involve pictures, charts, forms, maps, or diagrams, and small visual misreads cause big mistakes. CPPO teaches models to anchor specific words to real visual evidence, so they stop guessing when key details are missing and stop changing answers when the picture only changes slightly. This makes AI helpers far more trustworthy for schoolwork with diagrams, business dashboards, technical manuals, and data entry from forms. Organizations save time and avoid costly errors because the model learns what to look at and why. Because CPPO is unsupervised and avoids extra judge models or special tagging, it is simpler to scale in production. And since it builds on standard RL, teams can add it to existing training pipelines without a full redesign.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you’re solving a picture puzzle in a workbook. First, you have to notice the tiny details in the drawing (like a hidden number), and then you can use your math or logic skills to solve the problem. If you miss the detail, even perfect math won’t save you.

🥬 The Concept (Reinforcement Learning, RL): RL is a way to train AI by trying, getting a score, and improving to get a better score next time. How it works:

The model answers a question about a picture.
It gets a reward if the final answer is right.
It updates itself to get more right answers later. Why it matters: Without RL, the model might never discover better strategies for hard problems that need many steps. 🍞 Anchor: A model figures out the number of triangles in a picture; if it’s right, it gets a point and learns to count better next time.

🍞 Hook: You know how some kids are great at reading but struggle with picture puzzles? Language‑only models are like that—great with words, weaker with images.

🥬 The Concept (Vision‑Language Models, VLMs): VLMs are models that look at images and read text together to answer questions. How it works:

A vision encoder looks at the image.
A language model reads the question and the vision features.
It generates a step‑by‑step answer. Why it matters: Many real questions are about pictures, charts, or maps—words alone aren’t enough. 🍞 Anchor: “What fraction of the circle is shaded?” needs both seeing the shaded part and then doing the math.

🍞 Hook: Think of writing your answer as two parts: seeing and thinking. If you misread the picture, the smartest reasoning won’t help.

🥬 The Concept (Perception vs. Reasoning Tokens): Perception tokens are the words that directly come from seeing the image; reasoning tokens are the logical steps that follow. How it works:

The model writes words that describe visual facts (e.g., “angle A is 40°”).
Then it reasons with those facts (e.g., “sum of angles in a triangle is 180°”). Why it matters: If the perception tokens are wrong, the final answer is almost guaranteed to be wrong—even if the reasoning is perfect. 🍞 Anchor: If you misread a graph label as “201” instead of “2010,” your whole conclusion will be off.

🍞 Hook: Earlier, people tried to make models separate ‘what I see’ and ‘how I think’ by using special tags like <perception> and <think>, or by asking a second model to grade the perception.

🥬 The Concept (Limits of Prior Approaches): Prior methods either forced explicit separation, relied on extra judge models, used ground‑truth chains of thought, or punished every token equally. How it works:

Forced tags risk reward‑hacking and break natural flow.
Extra judges are slow and costly.
Ground‑truth CoT doesn’t scale.
Penalizing all tokens over‑regularizes reasoning and can reinforce wrong perceptions. Why it matters: These make training fragile, expensive, or misdirected. 🍞 Anchor: It’s like grading an essay by asking a whole panel every time, or yelling at every sentence equally—even the good ones.

🍞 Hook: So what was missing? A way to automatically find which words truly depend on the picture, and then train only those words to be visually correct.

🥬 The Concept (The Gap): VLMs needed a self‑contained method to detect perception tokens and train them without extra labels or judges. How it works:

Detect vision‑dependent tokens using the model’s own uncertainty shift when the image is partially hidden.
Teach those tokens to stay stable for tiny, harmless image changes and to change when key information is removed. Why it matters: This focuses learning exactly where it counts and avoids hurting reasoning tokens. 🍞 Anchor: Like highlighting the few puzzle pieces that actually determine the final picture, then practicing with those pieces.

🍞 Hook: Why should you care? Because better perception means fewer silly mistakes—from reading a chart wrong to misreading a digit in a diagram.

🥬 The Concept (Real Stakes): CPPO targets the exact place where many multimodal answers fail: perception. How it works:

It improves what the model “sees” so the later “thinking” has solid footing.
It removes the need for extra judge models and fragile tags. Why it matters: This makes assistants better at homework diagrams, data dashboards, forms, and instructions with pictures. 🍞 Anchor: A homework helper that finally reads the angle as 40°, not 60°, and then solves the triangle correctly.

02Core Idea

🍞 Hook: Imagine comparing two photos of the same scene—one is just slightly filtered, the other has key objects covered with sticky notes. If your description changes only when the sticky notes hide something important, you’re doing perception right.

🥬 The Concept (Aha!): CPPO teaches models to keep perception tokens consistent under harmless image tweaks and to be sensitive when key visual info is removed—using contrast. How it works (at a glance):

Generate an answer to a question about an image.
Find which tokens depend on the image by measuring how uncertain they get when the image is partially obscured.
For those tokens, compare three views: original image (anchor), small harmless change (positive), big info‑removing change (negative).
Pull the anchor close to the positive and push it away from the negative (contrastive loss).
Add this loss to RL, only when the rollout was good (advantage gating). Why it matters: The model learns exactly which words must be visually grounded and how. 🍞 Anchor: If the token is “40°,” it should stay “40°” after a mild color change, but not when the relevant label is cropped out.

Multiple Analogies:

Camera focus: Keep focus sharp if the scene is the same, blur changes only if the subject is hidden.
Lie detector: If a detail flips whenever you delete the evidence, that detail truly depended on the evidence.
Balance scale: Match the weight (distribution) with the harmless twin, not with the version missing key pieces.

Before vs After:

Before: RL rewarded only the final answer, mixing perception and reasoning errors; models could fail by misreading images.
After: CPPO targets perception tokens specifically with contrastive pressure, reducing visual misreads while preserving reasoning flow.

Why It Works (intuition):

Vision‑dependent tokens become uncertain when evidence disappears; that uncertainty jump is a clean signal of “this came from seeing.”
Contrastive learning gives a directional push: stable for small, irrelevant changes; different for real information loss.
Gating by good rollouts ensures we lock in correct perceptions, not mistakes.

Building Blocks (each with a mini‑sandwich):

🍞 Hook: You know how a scoreboard shows if you did better than the group? 🥬 The Concept (GRPO): A relative RL method that updates policy using group‑based rewards and a safety clip. How: Sample several answers, score them, compute advantages vs group mean, update carefully. Why: Avoids huge, unstable jumps. 🍞 Anchor: If your answer is better than most, your style gets boosted.
🍞 Hook: Imagine checking which words wobble when you dim the lights in a picture. 🥬 The Concept (Entropy‑based Detection): Find tokens whose uncertainty rises most when the image is obscured; these are perception tokens. How: Compute entropy per token with original vs partially hidden image; pick top‑k with biggest increase. Why: Focus training only where the model truly looks. 🍞 Anchor: The token “red bar = 12” should become shaky if you hide the red bar.
🍞 Hook: Think of A/B/C comparisons. 🥬 The Concept (CPL via InfoNCE): Use original as anchor, harmless tweak as positive, info‑removing as negative; pull anchor toward positive, push from negative. How: Compute distribution similarities and apply a contrastive loss at the token level. Why: Trains sensitivity to the right differences. 🍞 Anchor: Keep “40°” close to “40° under color jitter,” far from “guess under 70% crop.”
🍞 Hook: Only reward the moves that win the game. 🥬 The Concept (Advantage Gating): Apply CPL only when the rollout beats the group baseline. How: If advantage > 0, include CPL; else, skip. Why: Prevents locking in wrong perceptions. 🍞 Anchor: Don’t memorize the play from a losing round.

03Methodology

At a high level: Question + Image → Generate multiple answers (rollouts) with RL → Detect perception tokens by entropy shift → Build two image variants (harmless and info‑removing) → Apply contrastive loss only on perception tokens from good rollouts → Update policy.

Step 1: Rollouts with GRPO

What happens: For each (question, image), sample several answers, score correctness, compute advantages vs group mean, apply clipped policy update with a KL safety penalty.
Why it exists: RL turns final‑answer feedback into better policies; clipping and KL keep training stable.
Example: For “What is angle CAD?”, five rollouts yield different answers; right ones get higher rewards and positive advantages.

Step 2: Build an Information‑Removing Image (I−)

What happens: Create a version that hides key content (e.g., 80% patch masking or 30% crop), likely removing query‑relevant bits.
Why it exists: This lets us test which tokens truly depended on visual evidence—those should become less confident.
Example: Crop out the corner where “40°” is printed; any token stating that value should get uncertain.

Step 3: Perception Token Detection via Entropy Increase

What happens: For each token in each rollout, measure entropy with original image vs I−; pick top‑k with biggest increase.
Why it exists: Only these tokens are vision‑dependent; regularizing all tokens would over‑constrain reasoning.
Example: In a chart question, tokens naming bar heights or labels rise to the top‑k; filler words like “thus” don’t.

Step 4: Build an Information‑Preserving Image (I+)

What happens: Make a harmlessly tweaked image (color jitter, small rotation, mild noise) that keeps the answerable info intact.
Why it exists: Perception tokens should remain stable across harmless changes; this defines the positive pair for contrast.
Example: Slightly brighten the image; “40°” should still be “40°.”

Step 5: Contrastive Perception Loss (CPL) at the Token Level

What happens: For each selected perception token, compute token probability distributions under Original (anchor), I+ (positive), I− (negative). Use InfoNCE to pull anchor toward positive and push from negative; similarity is negative KL between distributions.
Why it exists: It teaches the model which visual differences should matter (info removed) and which shouldn’t (harmless tweaks), precisely at the tokens that read the image.
Example: The distribution for the next token “40” stays close under I+, but moves away under I−.

Step 6: Advantage Gating

What happens: Apply CPL only if the rollout’s advantage > 0.
Why it exists: Ensures we reinforce correct perceptions, not mistakes.
Example: A wrong rollout that said “70°” won’t contribute CPL, preventing bad anchoring.

Step 7: Combine with GRPO and Update

What happens: Final objective = GRPO objective − λ × mean CPL over qualified rollouts; update policy; refresh rollout policy.
Why it exists: Merges perception‑aware learning with stable RL so both seeing and thinking improve together.
Example: After updates, the model more reliably extracts “40°,” and answers improve across tasks.

The Secret Sauce:

Token‑selective: We only contrast tokens that truly come from seeing, avoiding damage to reasoning.
Consistency vs Sensitivity: Be consistent under harmless perturbations; be sensitive when evidence is removed.
Advantage gating: Learn perception from good examples, not bad ones.
Unsupervised: No extra judges, no CoT labels, no forced tags—scalable and efficient.

Mini Sandwiches for New Concepts:

🍞 Hook: Ever check how uncertain you are when a clue is missing? 🥬 Entropy: A number showing how unsure the model is about the next token. How: Higher entropy = more uncertainty; compare with/without hidden image parts. Why: Big jumps reveal vision‑dependent tokens. 🍞 Anchor: Hiding a graph axis makes the “value” token uncertain.
🍞 Hook: Spot the difference puzzles! 🥬 Contrastive Learning: Learn by pulling similar things together, pushing dissimilar apart. How: Anchor vs positive vs negative. Why: Directly teaches what differences are meaningful. 🍞 Anchor: Same scene vs cropped scene.
🍞 Hook: A friendly nudge scale. 🥬 InfoNCE: A contrastive loss that prefers anchor~positive over anchor~negative. How: Increases similarity to positive, decreases to negative. Why: Clear, strong learning signal. 🍞 Anchor: Keep “40°” with “40° under jitter,” not with “guess under crop.”

04Experiments & Results

The Test: The team evaluated CPPO on a suite of multimodal reasoning benchmarks (MathVista, DynaMath, WeMath, MathVision, MathVerse, MMMU‑Pro‑Vision, LogicVista), using accuracy@8 with temperature 1.0. This measures how often the right answer appears across multiple samples—a fair way to judge models that can sample several thoughtful responses.

The Competition: CPPO was compared against GRPO (the strong RL baseline), and perception‑aware methods like PAPO, Visionary‑R1, Vision‑SR1, Perception‑R1, plus rollout or off‑policy helpers like NoisyRollout, Vision‑Matters, Look‑Back, and OpenVLThinker. All used the same backbones (Qwen2.5‑VL‑3B/7B), data (ViRL39K), and similar training lengths where applicable, keeping things fair.

The Scoreboard (with context):

On Qwen2.5‑VL‑3B: CPPO hits 40.0% avg, beating GRPO’s 37.8% and PAPO’s 38.1%. Think of moving from a B‑ to a solid B+ while others hover in the B range.
On Qwen2.5‑VL‑7B: CPPO reaches 48.2% vs GRPO’s 46.7% and PAPO’s 46.8%—like nudging from a strong B+ to an A‑ when peers stay just below.
Across individual benchmarks, CPPO consistently wins or ties for best, with especially clear gains on the smaller 3B model, showing scalability as models grow.

Surprising/Notable Findings:

Targeted helps more than uniform: Applying contrastive loss to all tokens barely helped, but selecting top‑k perception tokens and gating by positive advantage stacked meaningful gains.
The top‑k sweet spot: Accuracy rose as k grew to about 50%, then dipped—too many tokens pull in low‑value ones and slow learning.
Efficiency tradeoff: CPPO adds ~39% training time for the extra image passes, but even doubling GRPO’s time couldn’t match CPPO’s results—proof the signal quality, not just more steps, drives the win.
Better generalization: CPPO showed faster reward growth in training and better out‑of‑domain accuracy, hinting that learning “what to look at” travels well beyond the training set.

Ablations that Make Numbers Meaningful:

From GRPO (34.7% on a Geometry3K setup) → +CPL on all tokens (35.0%) → +Top‑k perception only (36.6%) → +Advantage gating (38.6%). Each piece adds, but the full recipe shines.
λ (loss weight) tuning: Around 0.02 worked best; too low gives weak guidance, too high can over‑constrain.
Perturbation validation: Harmless tweaks barely dent baseline accuracy (<~1.5% drop), while info‑removing ones slash performance by >14%, confirming they’re well chosen for contrast.

In Short: CPPO not only raises scores; it does so by teaching the model to anchor the right words to the right visual evidence. That’s why the gains stick across tasks and sizes.

05Discussion & Limitations

Limitations:

Scale tested up to 7B: Results are strong, but mega‑models (e.g., 72B) weren’t explored here.
Backbone variety: Experiments focus on Qwen2.5‑VL; trying InternVL or others would broaden validation.
Data size: Trained on ~39K samples; larger datasets could test scalability further.
Perturbation design: Although validated, the precise choice of I+ and I− might need tuning per domain (e.g., medical images vs charts).

Required Resources:

Compute to run triple forward passes for selected tokens (original, I+, I−) and groups of rollouts for RL.
A stable RL training stack (e.g., verl), with mixed‑precision and multi‑GPU for efficiency.
Curated perturbation pipelines that preserve vs remove information as intended.

When NOT to Use:

Text‑only tasks: There’s no visual perception to improve.
Domains where tiny pixel shifts change semantics (e.g., OCR on ultra‑low‑res scans), making “harmless” tweaks unsafe.
Settings with no reliable verifiable reward, since the RL backbone relies on correctness signals.

Open Questions:

Can adaptive top‑k (per example) beat fixed k? Can we learn k from data?
Can we auto‑learn the best perturbations instead of hand‑designing?
How does CPPO interact with more advanced rollout strategies (e.g., selective replay, forced rethinking) when combined?
Can we extend perception‑contrast to multi‑image, video, or 3D inputs?
Could a softer gating (weighting by advantage size) further stabilize learning?

06Conclusion & Future Work

3‑Sentence Summary: CPPO is a reinforcement learning method that identifies which answer tokens truly come from seeing the image, then trains only those tokens with a contrastive loss. It keeps perception tokens stable under harmless image changes and sensitive when key information is removed, all while updating the model with standard RL on final answers. This targeted, unsupervised approach beats prior perception‑aware methods without extra judges, tags, or CoT labels.

Main Achievement: A simple, scalable recipe—entropy‑based perception token detection + token‑level contrastive loss + advantage gating—that cleanly disentangles and improves perception inside natural reasoning.

Future Directions:

Scale to larger backbones and broader architectures; combine with stronger rollout curricula.
Auto‑learn perturbations per domain; explore video/multi‑image extensions.
Dynamic top‑k and softer advantage weighting for finer control.

Why Remember This: CPPO shows that “seeing” can be trained precisely where it lives: in the tokens that depend on vision. By focusing contrast where it matters, models make fewer silly visual mistakes and reason better as a result—turning multimodal AI into a more trustworthy problem solver.

Practical Applications

•Homework helpers that correctly read diagrams, plots, and geometry labels before solving.
•Business analytics bots that reliably extract numbers from charts and dashboards without misreading axes.
•Document processing systems that accurately capture fields from scanned forms despite minor image variations.
•Technical support assistants that interpret equipment diagrams and wiring schematics with fewer perception slips.
•Medical triage tools that read measurements or annotations on scans and charts more reliably (with proper safeguards).
•Robotics and AR assistants that understand scene labels and signs consistently under lighting changes.
•Data labeling assistants that produce stable visual captions unless key objects are truly occluded.
•Math tutoring apps that ground answers in the visible steps of the student’s worksheet photos.
•Compliance tools that extract required fields from ID documents even with harmless image distortions.
•Scientific assistants that read plots (e.g., error bars, axes) consistently across small rendering differences.

Version: 1