CogFlow: Bridging Perception and Reasoning through Knowledge Internalization for Visual Mathematical Problem Solving

Shuhang Chen; Yunqiu Xu; Junjie Xie; Aojun Lu; Tao Feng; Zeying Huang; Ning Zhang; Yi Sun; Yi Yang; Hangjie Yuan

CogFlow: Bridging Perception and Reasoning through Knowledge Internalization for Visual Mathematical Problem Solving

Intermediate

Shuhang Chen, Yunqiu Xu, Junjie Xie et al.1/5/2026

arXiv PDF

Key Summary

•This paper teaches AI to solve diagram-based math problems by copying how people think: first see (perception), then make sense of what you saw (internalization), and finally reason (solve the problem).
•The key new idea is a middle step called knowledge internalization that turns low-level visual facts (like points, lines, circles) into ready-to-reason knowledge (like right angles or equal lengths).
•To see pictures better, the model gets two kinds of visual rewards: one checks precise geometry (local details) and the other checks the whole layout (global look).
•A special reward model checks that the reasoning really uses the visual facts, reducing “reasoning drift,” where the steps sound smart but ignore the picture.
•A visual gate filters out weak perceptions before reasoning starts, so the model doesn’t build arguments on shaky visual ground.
•All three parts—seeing, internalizing, and reasoning—are trained together with reinforcement learning to stay aligned.
•A new dataset, MATHCOG, provides over 120K carefully aligned examples that separate watching (perception) from thinking (reasoning).
•On tough benchmarks like FlowVerse, MathVerse, and MathVista, the method beats many open-source models and rivals larger closed models.
•Ablation studies show each piece (visual rewards, internalization reward, and visual-gated optimization) matters and they work best together.
•This approach makes AI’s math reasoning more accurate and more trustworthy because it stays grounded in what the picture truly shows.

Why This Research Matters

Many real tasks mix pictures and math: grading student work with diagrams, checking engineering sketches, or explaining physics graphs. If AI can’t stay faithful to what the picture shows, it may sound confident yet be wrong, which erodes trust. By adding a middle step that turns visual facts into reliable, ready-to-use knowledge—and checking that the reasoning truly uses those facts—this approach makes answers more accurate and explanations more honest. It also shows a path for other multimodal problems: see clearly, internalize faithfully, then reason. Over time, this can power better tutors, safer design tools, and clearer scientific helpers.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

You know how when you solve a geometry problem, you first look at the diagram, then you rewrite what you see in your own words (like “AB is a diameter, so angle ACB is 90°”), and only then you do the math steps? That simple routine—see, make sense, then reason—has helped students for ages.

🍞 Hook: Imagine trying to solve a maze blindfolded. You might remember some turns, but you’ll often bump into walls because you can’t truly see what’s there.

🥬 The Concept (Perception): In AI for visual math, perception means turning pixels into clear, structured facts (like finding points, lines, circles, and labels in a diagram).

How it works: (1) Detect visual pieces (points/lines/circles), (2) label them, (3) normalize them into a shared coordinate system, (4) output a structured list like a clean diagram recipe.
Why it matters: If perception is wrong or fuzzy, every step after that becomes shaky.

🍞 Anchor: In a circle diagram, perception extracts “center O, radius r, points A, B, C on the circle, and the line AB,” so later steps can use the right facts.

Before this paper, multimodal large language models (MLLMs) could read text and look at images, but they struggled with visual math diagrams. Two main styles existed:

One-step reasoning: Mix seeing and thinking at the same time. It’s fast, but messy—like trying to cook and write the recipe simultaneously. Errors in recognition and logic got tangled.
Decoupled pipeline: First perceive, then reason. Cleaner, but often the second part drifted away from the first—like writing an essay that forgets the notes you just took from the picture.

🍞 Hook: You know how a friend might tell a very confident story that doesn’t match what actually happened? It sounds good, but it’s not grounded in reality.

🥬 The Concept (Reasoning Drift): Reasoning drift is when the AI’s step-by-step explanation sounds reasonable but quietly ignores the visual evidence it extracted.

How it happens: (1) Perception extracts facts, (2) the reasoning stage guesses or uses unrelated theorems, (3) the final steps contradict the diagram.
Why it matters: You get answers that seem smart yet are wrong or unfaithful to the picture.

🍞 Anchor: The diagram says AB is a diameter (so angle ACB must be 90°), but the reasoning chain forgets that and treats ACB as 80°. That’s reasoning drift.

Researchers tried to fix this mostly by improving perception alone, such as using better visual encoders or extra perception tasks. That helped the “seeing,” but there was a missing middle step: a way to turn raw visual primitives into the exact conceptual facts that the reasoning needs—and to check the reasoning truly uses them.

🍞 Hook: Imagine your teacher asks you to highlight the key points before solving a problem. You don’t jump to the solution; you first convert what you see into the important ideas you will use.

🥬 The Concept (Internalization): Knowledge internalization transforms low-level visual facts (points, lines, circles) into structured, reasoning-ready knowledge (e.g., “AB is a diameter, therefore angle ACB is 90°”).

How it works: (1) Read the extracted primitives, (2) bind labels consistently, (3) infer canonical geometric relations that are justified by the diagram, (4) write them as a stable foundation for reasoning.
Why it matters: Without internalization, the reasoning may ignore or misuse what was seen.

🍞 Anchor: From “A, B lie on a circle and AB is a diameter,” internalization writes “∠ADB = 90° (Thales’ theorem),” which the solver then uses safely.

The gap this paper fills is exactly that missing middle: enforcing that the model not only sees well but also turns what it saw into the right conceptual building blocks—and then uses those blocks faithfully in its reasoning. The stakes are real: better tutors, fairer automated grading, safer engineering diagrams, and less “confident but wrong” math from AI.

02Core Idea

The “Aha!” moment in one sentence: Add a middle “knowledge internalization” stage—plus rewards that check it—so perception and reasoning stay locked together.

🍞 Hook: Think of a three-step assembly line: (1) parts arrive, (2) they’re assembled into a sturdy frame, (3) finishing touches are added. Skip step 2, and the whole bike wobbles.

🥬 The Concept (Cognitive-Inspired Framework): COGFLOW copies how humans solve visual math: perception ⇒ internalization ⇒ reasoning, training each stage and their connections.

How it works: (1) Train perception with synergistic visual rewards (local precision and global layout), (2) train internalization with a reward that detects unfaithful conversions, (3) guide reasoning with a visual gate so it only proceeds from good perceptions, plus outcome rewards for correct answers.
Why it matters: It stops the model from “sounding smart” while ignoring the picture.

🍞 Anchor: For a circle-geometry question, the model first extracts clean primitives, converts them into theorems that apply, and only then computes the target angle.

Explain it three ways:

Cooking analogy: Perception is grocery shopping (getting the right ingredients). Internalization is prep (washing, chopping, measuring). Reasoning is cooking (following the recipe). If prep is wrong, the dish fails.
School notes: Perception is reading the textbook picture. Internalization is writing clean notes in your own words. Reasoning is solving problems using those notes.
Sports playbook: Perception is spotting players and positions. Internalization is mapping positions to a play. Reasoning is running the play to score.

Before vs after:

Before: Models either mixed seeing and thinking (chaotic) or separated them without checks (drift). Results were inconsistent.
After: The middle internalization stage locks what you saw to what you conclude. The chain-of-thought becomes both accurate and faithful to the diagram.

Why it works (intuition, not equations):

Errors compound when early steps are weak. By rewarding precise perception (local details) and consistent layout (global structure), the base gets stronger.
A trained internalization reward serves as a referee that says, “Are you really using what you saw?”
A visual gate filters out bad visual parses before reasoning, so the solver doesn’t build on sand.

Building blocks, each with a mini Sandwich:

🍞 Hook: You know how a teacher checks both neat handwriting (details) and overall organization (layout) in your notes.

🥬 The Concept (Synergistic Visual Rewards, SynVRs): Two rewards jointly grade perception: one for precise geometry (VPR) and one for global look-and-feel (VSR).

How it works: (1) VPR compares predicted points/lines/circles to ground truth in a parameter space, (2) VSR compares rendered images with a visual encoder to check global layout.
Why it matters: Together they prevent both tiny mistakes (like a mis-placed point) and big layout mismatches.

🍞 Anchor: If a circle’s center is slightly off, VPR catches it; if the whole diagram is misarranged, VSR catches that.

🍞 Hook: Imagine getting a gold star only when your explanation truly uses the facts from the picture.

🥬 The Concept (Knowledge Internalization Reward, IntlzR): A reward model trains the AI to convert visual primitives into faithful, reasoning-ready knowledge.

How it works: (1) Build positive examples that internalize correctly, (2) create five common error types as negatives (omit/misbind, invent facts, violate constraints, misuse theorems, inconsistent references), (3) train a preference model to score faithful internalization higher.
Why it matters: It reduces reasoning drift by punishing explanations that ignore what was seen.

🍞 Anchor: If the diagram never shows parallel lines, but the reasoning suddenly uses parallel-line theorems, IntlzR lowers the score.

🍞 Hook: Picture a bouncer at the door who only lets in solid perceptions before the party (reasoning) starts.

🥬 The Concept (Visual-Gated Policy Optimization, VGPO): A training-and-inference method that filters weak perceptions before generating the reasoning chain and optimizes with grouped rewards.

How it works: (1) Generate several perception candidates, (2) score them with SynVRs, (3) accept only high-quality ones (or regenerate), (4) then reason; optimize with a group-relative objective using perception, internalization, and inference rewards.
Why it matters: Reasoning becomes steadier because it starts from reliable visual ground.

🍞 Anchor: If the first attempt places point C poorly, the gate asks for another try before allowing the solver to proceed.

🍞 Hook: Think of a carefully labeled practice workbook that separates “what I see” from “how I think.”

🥬 The Concept (MATHCOG Dataset): A large, aligned dataset with over 120K items that separates watching (perception) from thinking (reasoning), plus curated positives and negatives for internalization training.

How it works: (1) Extract/normalize primitives, (2) write reasoning that explicitly uses those primitives, (3) build contrastive internalization examples.
Why it matters: Good training data is the ground truth the rewards learn from.

🍞 Anchor: A sample shows the diagram, a structured list of detected shapes, a cleaned internalization, then a step-by-step solution ending with the answer.

03Methodology

At a high level: Input (diagram + text) → Perception (extract clean shapes) → Internalization (turn shapes into reasoning-ready facts) → Reasoning (solve while staying grounded) → Output (answer + faithful chain-of-thought).

Step 1: Perception with SynVRs

What happens: The model reads the diagram and outputs a structured list of primitives: points (with coordinates), lines (through points), and circles (center + radius), normalized to a shared coordinate range.
Why this step exists: Without precise, consistent seeing, the rest collapses. Diagrams can be busy; the model needs a crisp blueprint.
Example: For a circle geometry puzzle, the model lists: Point A, B, C on the circle; center O; radius r; chord/diameter lines.
How SynVRs help: • Visual Parameterized Reward (VPR): Measures local geometric precision by matching predicted primitives with ground truth using a cost-minimizing matching; rewards small parameter errors. • Visual Semantic Reward (VSR): Renders both predicted and ground-truth diagrams, encodes them, and rewards high global similarity (layout/style consistency).
What breaks without it: Missing or slightly shifted points cause theorems to misapply; global layout mismatches can lead to wrong angle/length logic.

🍞 Hook: Like grading both spelling and essay structure. 🥬 The Concept (VPR): Rewards precise geometry for points/lines/circles.

How it works: Compare parameters to ground truth after optimal matching; smaller gaps earn higher score.
Why it matters: Stops tiny local errors from snowballing into big reasoning mistakes. 🍞 Anchor: If predicted center O is close to the true O, VPR gives a high score.

🍞 Hook: Like checking your poster looks right from across the room. 🥬 The Concept (VSR): Rewards global layout consistency by comparing rendered diagrams in an embedding space.

How it works: Render, embed, compare with cosine similarity.
Why it matters: Ensures the overall picture matches, not just isolated parts. 🍞 Anchor: If the whole triangle sits in the correct place and size, VSR is high.

Step 2: Knowledge Internalization with IntlzR

What happens: The model converts structured primitives into conceptual, reasoning-ready facts (e.g., “AB is a diameter ⇒ right angle at C,” or “∠AED = 20° ⇒ arc AD = 40°”).
Why this step exists: Theorems work on concepts, not raw pixels. Internalization locks in what is legitimate to use in reasoning.
Example: From points on a circle and a diameter, the internalization writes the right-angle fact and other supported relations.
IntlzR training: • Positive trajectories: Human-verified faithful internalizations. • Negative trajectories (five types): omit/misbind, invent facts, violate constraints, misuse theorems, inconsistent references. • Train a reward model (preference learning) to score positives higher than negatives.
What breaks without it: The model may use the wrong theorems or claim facts not in the diagram—classic reasoning drift.

🍞 Hook: Like a teacher checking your summary really reflects the text. 🥬 The Concept (IntlzR): A reward that favors explanations grounded in the perceived diagram.

How it works: Given a correct internalization and several flawed ones, the model learns to score the faithful one higher.
Why it matters: Keeps the later reasoning honest. 🍞 Anchor: If a step says two lines are parallel but the diagram doesn’t support it, IntlzR penalizes it.

Step 3: Reasoning with Visual-Gated Policy Optimization (VGPO)

What happens: The model samples several perception candidates, filters them through a visual gate (using SynVRs), picks a good one, and then generates the reasoning chain; training optimizes groups of trajectories with combined rewards.
Why this step exists: Reasoning is only as good as its starting point. The visual gate ensures “good in, good out.”
Example: If the first try misplaces point D, the gate asks for another perception attempt until a threshold is met (or takes the best so far), then proceeds to reason.
Rewards combined: • SynVRs: Perceptual fidelity. • IntlzR: Faithful internalization. • Inference Reward: Correct final answer and well-formed output.
What breaks without it: Long reasoning chains become unstable; small perception noise derails the whole solution.

🍞 Hook: A bouncer only lets solid perceptions in before the debate begins. 🥬 The Concept (Visual Gate): Accepts perception candidates that exceed a quality threshold; otherwise asks for regeneration.

How it works: Score each perception with SynVRs; stop when one passes or pick the best after a few tries.
Why it matters: Prevents long chains from starting on bad footing. 🍞 Anchor: The third perception attempt finally aligns with the diagram, so the solver proceeds from that one.

🍞 Hook: Imagine practicing in teams and comparing results to pick the best approach. 🥬 The Concept (VGPO): A group-based optimization that integrates perception, internalization, and inference rewards.

How it works: Sample several trajectories per problem; compute a combined reward; update the policy to favor better, more grounded chains.
Why it matters: Stabilizes training and encourages interpretable, faithful chain-of-thought. 🍞 Anchor: Among multiple attempts, the one with accurate primitives, faithful internalization, and correct answer gets reinforced most.

Secret sauce:

Two-angle perception grading (local VPR + global VSR) so both tiny details and big picture match.
A trained internalization referee (IntlzR) that penalizes clever-but-ungrounded steps.
A visual gate that keeps weak perceptions from contaminating the reasoning.
All stitched together with reinforcement learning so the three stages co-evolve and stay aligned.

04Experiments & Results

The test: The team evaluated COGFLOW on several well-known benchmarks with many types of visual math problems, including geometry-heavy tasks where diagrams matter most. They measured two things: (1) final answer accuracy and (2) chain-of-thought quality (CoT-E), which checks if the reasoning is sound and consistent.

The competition: COGFLOW was compared with both large closed models (like GPT-4V/4o and Gemini variants) and strong open models specialized in math and vision (like MathFlow, SVE-Math, VLM-R1). This is like racing a tuned sports car against both luxury sedans and track-focused racers.

The scoreboard (with context):

FlowVerse: COGFLOW reaches 66.0% accuracy. Think of it as getting a solid A when many comparable open models are getting B’s or C’s.
MathVerse (testmini): 53.9% accuracy. This is a notable lift over open baselines, especially on diagram-heavy subsets.
MathVista: 76.8% accuracy overall and standout scores in geometry-centric categories. That’s like being near the top of the class on the hardest geometry sections.
More sets: COGFLOW also posts strong or competitive results on WeMath, LogicVista, and DynaMath, showing it’s not just a one-trick pony.

Why these numbers matter:

Visual subsets (Vision-Dense, Vision-Primary, Vision-Centric) show especially big gains. That means the model truly sees diagrams better and uses them properly.
CoT-E is high, meaning the steps are not only right but also make sense and stay faithful to what’s in the picture—reducing “smart-sounding” but wrong arguments.

Ablations (what happens if we remove parts):

Remove SynVRs (the two visual rewards) and perception gets weaker; accuracy drops. Add back VPR or VSR alone and it improves; use both and it’s best. This shows local and global checks are complementary.
Remove IntlzR (the internalization referee) and the model drifts more. With IntlzR, it catches five common error types (omit/misbind, invent facts, violate constraints, misuse theorems, inconsistent references) and the reasoning stays anchored.
Replace VGPO (the visual-gated optimization) with simpler RL and long-chain stability suffers. With VGPO, perception quality rises, chains are steadier, and final accuracy climbs the most of any single module.

Surprising findings:

Even strong general models like GPT-4o can reason well but still miss fine diagram details; COGFLOW’s perception-internalization linkage closes that gap.
The visual gate helps even at inference time: picking the best perception among a few samples boosts accuracy with only a small time cost, nearly as good as best-of-3 full runs but more efficient.
Softmax-style preference training for IntlzR (contrasting one correct internalization against several cleverly wrong ones) yields better robustness than a simpler pairwise setup.

Takeaway: The trio—better seeing (SynVRs), honest understanding (IntlzR), and safe starts for reasoning (VGPO)—translates into higher scores where it counts: tough, diagram-heavy problems. And the chains the model writes are more trustworthy because they trace back to what the picture actually shows.

05Discussion & Limitations

Limitations:

Compute: Training with multiple rewards, a visual gate, and contrastive internalization examples needs significant GPU time. This could slow adoption in small labs or classrooms without cloud support.
Domain focus: The system is tuned for visual math (especially geometry). While the ideas are general, performance beyond math diagrams (like natural photos) needs more work.
Data crafting: Building and verifying aligned perception→internalization→reasoning examples (and realistic negatives) takes effort.
Residual errors: Although drift is reduced, it can still appear when diagrams are extremely cluttered or ambiguous.

Required resources:

A capable multimodal backbone (e.g., 7B-parameter class), curated data (MATHCOG-like), and training infrastructure for supervised and reinforcement learning stages.
A visual encoder (for VSR), tools for rendering and matching primitives (for VPR), and a small reward model for IntlzR.

When not to use:

If the task barely depends on the visual diagram (mostly text), simpler text-first solvers might be faster and good enough.
For noisy real-world photos without structured primitives (e.g., crowded street scenes), you may need a different perception schema first.
When compute is tightly limited and a quick approximate answer is acceptable.

Open questions:

Generalization: How well does the internalization idea transfer to everyday images (e.g., science labs, mechanical drawings) with object-level primitives instead of points/lines/circles?
Efficiency: Can we reduce sampling (k) and still keep perception strong enough? Can smaller reward models work just as well?
Verification: Can we add learned verifiers to double-check each reasoning step against the internalized state in real time?
Data: Can we automate more of the positive/negative internalization curation while preserving quality?

06Conclusion & Future Work

Three-sentence summary: COGFLOW solves visual math by following the human-like flow of perception → internalization → reasoning. It trains perception with two visual rewards, enforces honest internalization with a learned reward model, and stabilizes reasoning with a visual-gated optimization strategy. The result is higher accuracy and more trustworthy, visually grounded chains of thought on multiple tough benchmarks.

Main achievement: Showing that a dedicated knowledge internalization stage—plus rewards that check it—bridges the long-missing link between “what the AI sees” and “how it reasons,” sharply reducing reasoning drift.

Future directions: Extend the primitive-based approach to natural scenes (e.g., objects and parts as primitives), learn lighter-weight rewards for broader adoption, and add verifiers that continuously cross-check reasoning steps against the internalized state. Scaling curated data (like MATHCOG) across domains will also help.

Why remember this: It’s a blueprint for trustworthy multimodal reasoning where answers don’t just sound right—they stay loyal to the picture. By teaching AI to first see clearly, then internalize faithfully, and only then reason, we move closer to assistants that are both smart and reliable in visually grounded tasks.

Practical Applications

•AI geometry tutors that point to exact diagram elements and explain theorems step-by-step.
•Automated grading that validates whether a student’s proof truly uses the given figure.
•Diagram-to-proof assistants for math contests or classroom practice.
•Quality checks for CAD-like drawings where angles, lengths, and constraints must be consistent.
•Science lab helpers that read graphs/plots and reason about trends with faithful references to the visuals.
•Accessible learning tools that convert complex diagrams into structured explanations for students.
•Content creation that turns textbook figures into solvable, annotated exercises with verified solutions.
•Pre-checkers for math problem authors to detect ambiguous or misleading diagrams before publishing.
•Diagnostic tools that highlight where reasoning drift occurs in a student’s or model’s solution.
•Foundation for extending to object-based internalization in everyday images (e.g., physics-in-the-wild).

Version: 1