MMFineReason: Closing the Multimodal Reasoning Gap via Open Data-Centric Methods

Honglin Lin; Zheng Liu; Yun Zhu; Chonghan Qin; Juekai Lin; Xiaoran Shang; Conghui He; Wentao Zhang; Lijun Wu

MMFineReason: Closing the Multimodal Reasoning Gap via Open Data-Centric Methods

Intermediate

Honglin Lin, Zheng Liu, Yun Zhu et al.1/29/2026

arXiv PDF

Key Summary

•MMFineReason is a huge, open dataset (1.8 million examples, 5.1 billion solution tokens) that teaches AIs to think step by step about pictures and text together.
•It was built with a three-step recipe: collect and clean many sources, have a top teacher model write detailed reasoning, then keep only the best and most challenging examples.
•The dataset covers math, science diagrams, puzzles, games, and charts, each with long, visually grounded chains-of-thought.
•Training small open models (2B/4B/8B) on this data beats or matches much larger models; the 4B model even surpasses Qwen3-VL-8B-Thinking, and the 8B model outperforms Qwen3-VL-30B-A3B-Thinking.
•A 'less is more' finding shows that only 7% of carefully chosen hard data can perform almost as well as training on the entire dataset.
•Reasoning-focused data also improves general skills like real-world image questions and chart understanding, not just puzzles and math.
•Simple training with supervised fine-tuning gives most of the gains; a short reinforcement learning stage helps generalization further.
•Very high image resolution gives little extra help for diagram reasoning, and adding extra captions is mostly redundant once the thinking steps are strong.
•All data and models are open, reproducible, and do not rely on closed APIs, enabling fair, community-driven progress in multimodal reasoning.

Why This Research Matters

This work makes powerful visual reasoning available to everyone by releasing both data and models openly. It shows that thoughtful data design—clear steps, correct answers, and the right difficulty—can let small models perform like big ones, saving energy and cost. That means better homework helpers, study aids, and tools for reading charts, forms, and diagrams. It also lets researchers fairly compare and improve systems without relying on private, closed datasets. The approach encourages responsible AI that explains its steps, which helps users trust and verify answers. In short, it helps build smarter, more transparent, and more accessible AI for real-world problem solving.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

You know how solving a tricky riddle often needs both the picture clues and the words that explain it? Before this paper, computers were getting better at looking and reading, but truly solving riddle-like visual problems—like math diagrams, tricky charts, or logic puzzles—was still hard for open models. Closed, proprietary systems did better, mainly because they trained on huge, private, carefully made datasets that the public couldn’t use.

🍞 Hook: Imagine trying to learn math with only easy worksheets and no worked examples. You’d know the basics but stumble on tough problems. 🥬 The Concept (Multimodal Reasoning): It’s understanding images and text together to figure things out. How it works: (1) Look at the picture and read the words, (2) connect picture parts to the question, (3) reason step by step, (4) decide the answer. Why it matters: Without it, models can see or read but can’t truly solve. 🍞 Anchor: A model answering a geometry question must link labels on a triangle to the words in the prompt to compute an angle.

The problem researchers faced was data. Open-source vision-language models (VLMs) had lots of general Q&A about natural photos, but not enough hard, well-explained reasoning problems—especially STEM diagrams and logic puzzles. Even when such data existed, the explanations were short, inconsistent, or missing the step-by-step reasoning (Chain-of-Thought, or CoT) that helps models learn how to think, not just what to answer.

🍞 Hook: You know how a cookbook with clear step-by-step recipes helps you cook better than a list of dish names? 🥬 The Concept (Vision Language Models, VLMs): These are models that understand pictures and words at the same time. How it works: (1) Turn images and text into internal signals, (2) connect them, (3) generate answers. Why it matters: Without VLMs, you can’t teach a computer to reason about diagrams with text. 🍞 Anchor: A VLM reads a chart label (text) and sees the bars (image) to answer “Which city has the highest temperature?”

People tried to fix the gap by scaling up data or auto-generating Q&A from strong models. But two things often failed: (1) the data mix was imbalanced—plenty of simple visual questions but very few hard STEM/puzzle questions; (2) explanations were short or inconsistent, so models didn’t learn reliable thought processes.

🍞 Hook: Think of building a LEGO set. Without numbered steps, pieces, and checkpoints, your castle might collapse. 🥬 The Concept (Chain-of-Thought, CoT): These are step-by-step reasoning explanations. How it works: (1) extract facts from the image and text, (2) set up the problem, (3) compute/derive, (4) check the result. Why it matters: Without CoT, models memorize answers instead of learning how to solve. 🍞 Anchor: A CoT shows how to compute triangle angles from given sides rather than just stating the final angle.

The missing link was an open, large, and consistently explained multimodal reasoning dataset. That’s where MMFineReason comes in: a carefully built collection of 1.8M examples with 5.1B solution tokens, each with detailed, visually grounded reasoning traces distilled from a strong open teacher (Qwen3-VL-235B-A22B-Thinking). It uses a three-stage pipeline: (1) collect and standardize many datasets, (2) generate long CoT with a teacher, (3) select only high-quality, correctly reasoned, and appropriately difficult samples.

🍞 Hook: You know how a great coach picks the right drills and tosses out bad habits? 🥬 The Concept (Data-Centric Methods): Improve the data itself—clean it, balance it, and add great explanations. How it works: (1) curate and standardize, (2) add strong CoT, (3) filter by quality and difficulty. Why it matters: Without good data, even big models learn the wrong lessons. 🍞 Anchor: A smaller team with well-designed practice often beats a bigger team that practices the wrong plays.

The stakes are real. Better multimodal reasoning helps with homework helpers that actually show steps, digital tutors that explain diagrams, and tools that understand real-world photos plus text (like forms, maps, lab setups). MMFineReason shows that with the right data, smaller open models can match or beat much larger ones, closing the gap with private systems and making progress available to everyone.

02Core Idea

The “aha!” is simple: if you give models a huge supply of high-quality, step-by-step, visually grounded reasoning examples—and you carefully keep only the challenging, correct ones—then even small open models learn to think much better than before.

Three analogies:

Recipe Book Analogy: A thick cookbook with clear steps beats a thin one with vague hints. MMFineReason is the thick, clear recipe book for multimodal reasoning.
Sports Practice Analogy: Quality drills + right difficulty levels beat just longer practice. Difficulty-aware filtering keeps the training at the right challenge.
Schoolwork Analogy: Worked solutions with correct steps teach you how to solve new problems. Long, consistent CoT teaches the model how to reason, not just copy answers.

Before vs. After:

Before: Open models saw many easy photos and short answers, with uneven or missing explanations. They struggled on STEM diagrams, logic puzzles, and multi-step visual math.
After: With MMFineReason’s clean, long CoT on hard STEM/puzzle data, smaller models outperform bigger ones, and gains even spill over to general tasks like chart reading and real-world Q&A.

Why it works (intuition without math):

Step-by-step supervision shrinks the search space for the model. Instead of guessing the whole solution at once, the model learns a reliable path of micro-steps from facts to answer.
Difficulty-aware selection prevents overfitting to easy questions and forces the model to practice what it doesn’t already know.
Consistent formats (<think>…</think> then <answer>…</answer>) make learning stable: the model recognizes where the reasoning ends and the final answer begins.
Diverse but reasoning-heavy coverage (math, science diagrams, puzzles) teaches general patterns of logic that transfer beyond the exact training domains.

Building Blocks of the idea:

Collection and Standardization: unify many sources into one clean shape, including fields for original question/answer, image, standardized question/answer, captions, and teacher-produced reasoning.
Reasoning Distillation: a strong open teacher (Qwen3-VL-235B-A22B-Thinking) writes long, visually grounded explanations following a strict four-phase structure and a fixed output template.
Quality Verification: remove malformed, too-short, repetitive, or incorrect CoT; keep those consistent with ground truth.
Difficulty-Aware Filtering: measure a smaller model’s pass rate over multiple tries; keep examples it still fails—these are the best teachers.
Training Recipe: do supervised fine-tuning (SFT) on this data to get most gains, then a short reinforcement learning (RL) stage to polish generalization.

🍞 Hook: You know how gardeners fix the soil before planting seeds? 🥬 The Concept (Reasoning Quality Filtering): Keep only explanations that are correct, long enough, and not copy-pasted. How it works: (1) check format and length, (2) de-duplicate repeated text, (3) verify the final answer matches the ground truth. Why it matters: Without it, the model memorizes mistakes. 🍞 Anchor: Throwing out broken instructions keeps the recipe book trustworthy.

🍞 Hook: Imagine your teacher gives you only the problems you still get wrong—tough but helpful! 🥬 The Concept (Difficulty-Aware Filtering): Choose training examples based on how often a smaller model still fails them. How it works: (1) try four times, (2) if it never gets it right, keep it, (3) else, drop it. Why it matters: Without this, training wastes time on problems the model already knows. 🍞 Anchor: Practicing only the missed questions makes test-day gains faster and bigger.

03Methodology

At a high level: Input (many image+text problems) → Collect & Standardize → Distill Long CoT → Filter for Quality & Difficulty → Train (SFT → RL) → Output (strong multimodal reasoners at 2B/4B/8B).

Step A: Data Collection and Standardization

What happens: The team gathers many open datasets focused on reasoning (math, science diagrams, puzzles/games, some general/OCR). They clean text (translate to English, remove noise), fix instructions to encourage reasoning, drop tasks outside visual reasoning, and make all samples follow one schema: metadata, raw fields, standardized question/answer/image, dense captions, distilled reasoning, and quality metrics.
Why it exists: Without consistent formatting and clean text, later steps (reasoning generation and checking) would break or be unreliable.
Example: A geometry problem in another language is translated, stripped of web links, given a reasoning-friendly instruction, and stored with image + standardized question/answer.

🍞 Hook: You know how a master chef writes detailed step-by-step notes for each dish? 🥬 The Concept (Reasoning Distillation): A strong teacher model writes long, clear, visual reasoning for each problem. How it works: (1) Extract all visual/text facts, (2) Set up the plan, (3) Execute the math/logic, (4) Validate the answer. It outputs <think>…</think> then <answer>…</answer>. Why it matters: Without rich, reliable steps, students (the smaller models) learn shaky habits. 🍞 Anchor: For a triangle, it lists given sides/angles, picks theorems, computes carefully, and checks if the angle sum is 180° before giving the final angle.

Step B: Reasoning Distillation and Captions

What happens: Qwen3-VL-235B-A22B-Thinking produces long CoT with a strict four-phase structure wrapped in <think> tags, then the final answer in <answer>. Qwen3-VL-235B-A22B-Instruct also produces dense captions. Every sample gets both reasoning and caption.
Why it exists: The structure forces precise, visual grounding and makes automatic checking easy. Dense captions add context where needed.
Example: A logic puzzle image gets a caption describing shapes/positions; the reasoning then hypothesizes rules, tests candidates, and validates the choice.

Step C: Reasoning Quality Filtering

What happens: Three filters ensure quality: (1) Template & Length: bad/malformed outputs or too-short traces (<100 words) are removed; (2) N-gram De-duplication: repeated boilerplate CoT is removed/regenerated; (3) Correctness Verification: final <answer> is compared with ground truth—mismatches are removed.
Why it exists: To avoid learning from mistakes or fluff.
Example: If a solution says the area is 25 but the correct is 16, the sample is dropped.

🍞 Hook: You know how coaches check which drills you still fail to plan your next practice? 🥬 The Concept (Pass Rate): It measures how often a smaller model answers a question correctly across multiple tries. How it works: (1) Run the 4B model four times, (2) count successes, (3) the pass rate is the fraction correct. Why it matters: Without it, we can’t tell which questions still teach the model something new. 🍞 Anchor: If it gets 0/4 right, that problem is gold for training; if 3/4 right, it’s probably too easy.

🍞 Hook: Imagine your teacher trims your homework to only the most helpful 7%—less time, same learning. 🥬 The Concept (Difficulty-Aware Filtering): Keep only challenging samples the 4B model never got right (pass rate = 0) to form a 123K ‘hard set’; keep broader ‘hard-ish’ ones (pass rate ≠ 1) for a 586K set. Why it matters: Without this, training wastes compute on easy or redundant data, slowing progress. 🍞 Anchor: The 123K subset reaches almost the same performance as the full 1.8M set.

Step D: Training Recipe (SFT then RL)

What happens: Supervised Fine-Tuning (SFT) on MMFineReason does most of the lifting. Then a short Reinforcement Learning (RL) stage (GSPO) adds polish, improving generalization to charts and real-world tasks.
Why it exists: SFT teaches the model to follow the step-by-step patterns; RL sharpens decision-making under different scoring signals.
Example: After SFT, the model solves most math diagrams well; RL nudges it to better handle chart questions and real-world VQA.

🍞 Hook: You know how you first learn from worked examples, then practice under game-like conditions? 🥬 The Concept (Supervised Fine-Tuning, SFT): The model copies high-quality examples to learn the pattern. How it works: (1) feed image+prompt, (2) train to produce the long CoT and final answer, (3) repeat on many samples. Why it matters: Without SFT, the model lacks the step-by-step habit. 🍞 Anchor: Like copying a long-division walkthrough until you can do it yourself.

🍞 Hook: Think of scrimmage games that reward good plays, not just memorized drills. 🥬 The Concept (Reinforcement Learning, RL): The model explores answers and gets a reward when correct or well-structured. How it works: (1) generate multiple tries (rollouts), (2) score them, (3) adjust to prefer higher-reward patterns. Why it matters: Without RL, the model may be brittle outside the training style. 🍞 Anchor: Practicing with a score counter helps you play better on new courts, not just your home gym.

Secret Sauce

Long, consistent CoT that are visually grounded ensure the model learns reasoning, not just final answers.
Difficulty-aware filtering compresses the training set to its most valuable core—achieving a ‘less is more’ effect where 7% data can rival 100%.
The data mix (math-heavy but diverse in puzzles/science) builds general reasoning habits that transfer to unseen general tasks.

04Experiments & Results

The Test: The authors evaluated across tough STEM and puzzle sets (MMMU, MathVista, MathVision, MathVerse, DynaMath, LogicVista, VisuLogic, ScienceQA), general vision QA (RealWorldQA, MMBench-EN, MMStar), and document/chart understanding (AI2D, CharXiv reasoning/description). Greedy decoding (temperature 0.0) was used to test reliable reasoning, and a robust LLM-based verifier checked correctness.

The Competition: They compared three MFR models (2B, 4B, 8B) against closed (Gemini-2.5-Flash, GPT5-mini-High), open-weight Qwen3 ‘Thinking’ models (8B, 30B-A3B, 32B), and open-source datasets/models (HoneyBee, MMR1, OMR-7B). Where needed, baselines were fairly re-trained under the same setups.

The Scoreboard (with context):

Overall Average: MFR-8B scores 75.7%—like getting an A when many peers are at a B range. It beats Qwen3-VL-30B-A3B-Thinking (74.5%) and even a commercial baseline Gemini-2.5-Flash (75.0%), while approaching Qwen3-VL-32B-Thinking.
Math and Logic Strength: On DynaMath, MFR-8B hits 83.4% (an A+), surpassing Qwen3-VL-32B-Thinking at 82.0% and 30B-A3B-Thinking at 76.7%. On MathVerse, MFR-8B reaches 81.5%, near the 32B’s 82.6%. These are tough multi-step visual math tasks.
Generalization: Despite training on reasoning-centric data, MFR-8B shines on general tasks too—e.g., RealWorldQA 75.6% and CharXiv-desc 89.9%—matching or surpassing open-source baselines trained with far more general data. That’s like practicing math proofs and still improving at reading charts and real photos.
Parameter Efficiency: MFR-4B scores 73.9%, beating Qwen3-VL-8B-Thinking (72.5%). That’s like a junior team beating a senior team thanks to better coaching data.

Surprising Findings:

Less is More: Using only the hardest 7% (123K) of the data can reach performance close to training on the full 1.8M set. That’s a huge savings in compute and time.
SFT Does Most of the Work: Supervised fine-tuning delivers the biggest jumps, especially on math/logic benchmarks. RL then adds polish for charts and real-world QA.
Caption Augmentation Is Redundant Once CoT Is Strong: Prepending captions yields little or negative gains on STEM tasks—long CoT already encodes the needed visual details.
Ultra-High Resolution Gives Diminishing Returns for Diagrams: Going up to 2048×2048 helps real-world photos but not most geometric/chart tasks, where 768×768 often suffices.

Breaking Down the Training Dynamics:

SFT vs RL: For 8B, SFT pushes MathVision from 53.9% up to 67.6% (a big jump), while RL slightly varies math scores but helps AI2D, RWQA, and CharXiv. This suggests SFT teaches the step-by-step habit; RL helps transfer that habit to broader contexts.
Dataset Composition Matters: Puzzle/game sets are very hard but don’t transfer as strongly as math/science to the average score; broader science (like BMMR spanning 300+ disciplines) encourages more general reasoning patterns. A tiny, high-density set (WeMath2-SFT, only 814 samples!) triggers notable gains—evidence that some ‘instructional keys’ unlock hidden abilities.

Putting it in Plain Words: With a ‘great cookbook’ of long, correct, and challenging thought-steps, even smaller open models cook up answers better than bigger ones trained on ordinary recipes. They not only ace math and logic but also do well on charts and real-world pictures.

05Discussion & Limitations

Limitations:

Domain Imbalance: The dataset is intentionally math-heavy (≈79%), with fewer puzzle/game and general/OCR samples. Some areas—like certain real-world fine-grained tasks or multi-image reasoning—are limited.
Teacher Bias: Distilled CoT inherits the teacher model’s style and potential blind spots; errors that slip through verification can reinforce specific reasoning patterns.
Filtering Trade-offs: Strict correctness and difficulty filters may discard creative but valid solution paths, reducing diversity of reasoning strategies.
RL Variance: Short RL training improves generalization but can cause small dips on some math sets, hinting at a need for more tailored RL data or rewards.

Required Resources:

Compute: SFT with long sequences (up to 32K tokens) and large images is memory-intensive; RL with multiple rollouts increases cost.
Data Handling: Managing 1.8M examples with long CoT and captions requires robust storage and fast I/O.

When NOT to Use:

If your main task is simple image labeling or basic VQA, this heavy reasoning data may be overkill and slower than specialized lightweight datasets.
If you need multi-image temporal/video reasoning, MMFineReason isn’t focused on that scenario.
If your model cannot handle long context windows or large images, you won’t benefit from the long CoT supervision.

Open Questions:

Mixing Ratios: What’s the best balance between math, science, puzzles, and general data for different downstream goals?
Smarter Difficulty: Beyond pass rate, can we design richer difficulty signals (e.g., step-level uncertainty) to pick even better training subsets?
RL Rewards: Which reward functions and data selections best amplify reasoning without hurting math robustness?
Multi-Image/Video: How to extend the same data-centric principles to temporal reasoning and multi-image contexts?
Teacher Diversity: Would ensembles of teachers reduce bias and strengthen the variety of reasoning styles learned?

06Conclusion & Future Work

In three sentences: MMFineReason proves that with the right data—long, correct, visually grounded chains-of-thought, plus smart filtering—small open models can learn to reason about images and text far better than before. Its 1.8M-example pipeline (collect → distill → filter) lifts 2B/4B/8B models to state-of-the-art results for their size, sometimes beating models many times larger. A ‘less is more’ result shows that a carefully chosen 7% can rival the full set, making high-quality reasoning both powerful and efficient.

Main Achievement: Demonstrating a fully open, reproducible, data-centric pipeline that closes much of the multimodal reasoning gap—achieving superior parameter efficiency and broad generalization—by focusing on reasoning quality and difficulty-aware selection rather than brute-force scaling.

Future Directions: Explore better mixes of domains, richer difficulty signals beyond pass rate, improved RL rewards for stable math gains, and extensions to multi-image/video reasoning. Investigate combining multiple teacher styles to diversify reasoning paths and reduce bias.

Why Remember This: It shows that the secret to smarter, more helpful multimodal AIs isn’t just bigger models—it’s better, clearer, and harder training examples with trustworthy steps. That insight empowers the open community to build strong reasoning systems without private data, making advanced AI learning more fair, efficient, and widely accessible.

Practical Applications

•Build step-by-step math tutors that explain geometry and algebra from textbook diagrams.
•Create science diagram assistants that extract facts and reason about processes in AI2D-like visuals.
•Develop chart-reading tools that can both describe and reason about plots and tables in reports.
•Design puzzle and logic game solvers that show intermediate reasoning for learning and debugging.
•Improve document understanding systems that connect diagrams, captions, and questions in one flow.
•Enhance classroom tools that grade and give feedback on visual math and science questions with worked steps.
•Construct domain-specific trainers (e.g., physics labs) that reason about experimental setups from images.
•Use the difficulty-aware subset (123K) to rapidly fine-tune smaller models for edge devices with limited compute.
•Adopt the template (<think>/<answer>) in your pipeline to simplify automatic checking and evaluation.
•Run ablations on resolution and captions to optimize inference speed for your exact application domain.

Version: 1