DeepVision-103K: A Visually Diverse, Broad-Coverage, and Verifiable Mathematical Dataset for Multimodal Reasoning

Haoxiang Sun; Lizhen Xu; Bing Zhao; Wotao Yin; Wei Wang; Boyu Yang; Rui Wang; Hu Wei

DeepVision-103K: A Visually Diverse, Broad-Coverage, and Verifiable Mathematical Dataset for Multimodal Reasoning

Intermediate

Haoxiang Sun, Lizhen Xu, Bing Zhao et al.2/18/2026

arXiv

Key Summary

•DeepVision-103K is a new 103,000-example picture-and-text math dataset designed to help AI think better using rewards that can be checked automatically.
•It covers many K-12 math topics and a huge variety of visuals (like geometry drawings, charts, real photos, and function plots), so models learn to see and reason at the same time.
•The team built a three-step cleaning pipeline: remove unsuitable questions, pick just-right difficulty using pass rates, and verify that the question-image-answer all match.
•Training with DeepVision-103K improves models on multiple visual math benchmarks, beating both official 'thinking' versions and strong closed-source systems in several cases.
•Adding visual-logic puzzles (like mazes and games) boosts spatial and pattern skills that also help general multimodal tasks, not just math.
•Careful correctness checks are necessary: skipping them made models clearly worse despite more data.
•Models trained on DeepVision-103K show better visual perception, better self-correction by re-checking images, and stronger step-by-step math reasoning.
•The dataset scales beyond hand-written or recombined sets by automatically curating high-diversity, verifiable questions from real educational sources.
•Results suggest that diversity (many visual elements), breadth (many topics), and verifiability (unique answers) are a winning trio for multimodal reasoning.
•This work offers a practical recipe and open resource that others can reuse to advance vision-and-language reasoning.

Why This Research Matters

Multimodal tools are moving into classrooms, workplaces, and homes, where they must correctly read diagrams, charts, and real-world images—and think about them. DeepVision-103K shows how to build and use a large, verifiable, and visually diverse practice set so models learn reliable, transferable reasoning skills. This leads to better tutoring systems that truly understand student drawings, analytics assistants that interpret dashboards accurately, and productivity tools that won’t misread a figure. The approach also provides a blueprint for other domains: start with diverse data, curate for just-right difficulty, and insist on verifiable answers. In short, it helps make AI not just fluent, but careful and trustworthy when pictures and words must be combined. That’s a big step toward safer, more useful AI in everyday life.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook): Imagine you’re doing homework with a worksheet full of pictures and questions. Sometimes the picture is tricky, sometimes the question is, and sometimes the answer can be checked right away. That last part—being able to check—is what helps you practice smarter.

🥬 Filling (The Actual Concept): What if we train computers the same way—on picture-and-text homework where the right answer can always be verified? Before this paper, AI could read or look at images very well, but doing careful picture-based math reasoning was still hard, especially when the training data wasn’t diverse or reliably checkable.

What it is: DeepVision-103K is a large, carefully cleaned set of picture-and-text math problems with one correct, checkable answer.
How it works (at a high level):
1. Gather many real K-12 problems with images from open sources.
2. Remove problems that can’t be automatically checked.
3. Keep questions in the “just-right” difficulty band using pass rates.
4. Verify each image-question-answer trio is correct and consistent.
5. Train vision-language models with a reward that says “correct = +1, wrong = 0.”
Why it matters: Without diverse, verifiable data, models either memorize or guess, and they fail when pictures change style or when questions require careful step-by-step thinking.

🍞 Bottom Bread (Anchor): It’s like building a giant practice book with answer keys for every problem, using photos, charts, and diagrams—so the AI learns to see, think, and check its work.

🍞 Top Bread (Hook): You know how a coach gives you points when you do a move exactly right? The points help you practice the right way.

🥬 Filling (Reinforcement Learning with Verifiable Rewards—RLVR):

What it is: RLVR lets an AI learn by trying, then getting a clear reward based on whether its final answer is verifiably correct.
How it works:
1. The AI reads the problem and looks at the image.
2. It writes out steps and a final answer.
3. A checker compares the final answer to the ground truth.
4. The AI gets a reward (+1 or 0) and updates how it thinks.
Why it matters: Without a trustworthy check, the AI can get “confused treats,” learning the wrong habits.

🍞 Bottom Bread (Anchor): Just like a quiz with an answer key, RLVR gives the AI a reliable thumbs-up only when it’s truly right.

🍞 Top Bread (Hook): Picture a student who can read stories and also understand the pictures on the same page.

🥬 Filling (Large Multimodal Models—LMMs):

What it is: LMMs are AIs that understand and reason over both images and text together.
How it works:
1. The image and text are turned into features the model can read.
2. The model connects what it sees (shapes, numbers, labels) with what it reads (conditions, questions).
3. It reasons step-by-step to reach an answer.
Why it matters: Without seeing and reading together, the AI might miss that a label in the image changes the meaning of the text.

🍞 Bottom Bread (Anchor): When a problem shows a triangle with lengths and asks a question in words, an LMM can link the picture’s lengths to the text’s instructions to solve it.

The world before: Researchers had three main ways to build training data, each with problems.

Fully synthetic: Great for neat diagrams (like perfect circles or grids) but weak on messy real-world visuals and context.
Human-annotated K-12 sets: High-quality but small and expensive to scale.
Recombined sets: Mix existing problems but don’t add new kinds of challenges; datasets overlap and don’t broaden the distribution much.

The problem: Models trained on these sets often struggled with generalization—doing well on one style of diagram but failing on new styles, or misreading charts, or not reflecting to fix mistakes. Worse, some questions lacked unique answers, making rewards noisy.

The gap: A large, visually diverse, broad-coverage, and strictly verifiable dataset was missing—especially one that blends math with visual-logic puzzles (like mazes and tetris) to strengthen spatial and pattern skills.

🍞 Top Bread (Hook): Think of a playground with many different stations—climbing, balance, puzzles—so you build all-around skills.

🥬 Filling (Multimodal Reasoning):

What it is: Using images and text together to think through problems, including math and visual logic.
How it works:
1. Perceive: Extract the right visual facts (lengths, angles, bars on a chart).
2. Align: Match those facts with the question’s words.
3. Reason: Use math rules or logic to reach the answer.
Why it matters: If any step fails—like misreading a bar chart—your conclusion can’t be trusted.

🍞 Bottom Bread (Anchor): Solving “Which slice of the pie chart is largest?” needs accurate seeing (sizes), linking (labels), and reasoning (largest = answer choice).

Real stakes: Better multimodal reasoning means safer data dashboards, smarter tutoring apps, clearer lab-diagram help, and more reliable AI across classrooms and everyday tools. DeepVision-103K aims to be the sturdy, varied practice ground AIs need to become careful, accurate problem-solvers in the real world.

02Core Idea

🍞 Top Bread (Hook): You know how having a big, well-organized workbook with lots of different pictures and clear answer keys makes studying easier and more effective?

🥬 Filling (The Aha!): The key insight is: If we train multimodal models with a large, visually diverse, broad-coverage dataset whose answers are verifiably correct—and we carefully control difficulty—then simple rewards (+1 for correct) reliably teach the model to perceive better, reflect on mistakes, and reason more clearly.

Multiple Analogies:

Sports gym: Mixing cardio (visual logic), strength (geometry), and flexibility (charts/plots) builds an all-around athlete. The answer checker is your coach’s whistle.
Cooking class: Fresh, varied ingredients (diverse visuals), a clear recipe (pipeline), and a taste test (verifiable rewards) make good dishes consistently.
Puzzle pack: Crosswords (algebra), jigsaws (geometry), mazes (visual logic). A correct-solution booklet (verifier) means you learn patterns instead of guessing.

Before vs. After:

Before: Models practiced on narrow or noisy data; they often mis-saw diagrams, skipped re-checking, or fell apart on new visual styles.
After: With DeepVision-103K and RLVR, models improve their first-glance seeing (visual perception), actively re-check the image when something feels off (visual reflection), and calculate or reason more rigorously (mathematical reasoning), transferring gains to general multimodal tasks.

🍞 Top Bread (Hook): Imagine a factory line that turns raw questions into clean, fair practice problems.

🥬 Filling (Data Curation Pipeline):

What it is: A three-step process that turns messy real data into reliable, verifiable training material.
How it works:
1. Validity Filtering: Remove proofs, multi-answer, or non-visual questions.
2. Difficulty Filtering: Run a model multiple times; keep problems in a sweet-spot pass-rate band.
3. Query Correctness Verification: Check that the image, question, and final answer all match and are correct.
Why it matters: Without this pipeline, the reward signal becomes noisy. The model can’t tell if it was right because it reasoned well or because the data was flawed.

🍞 Bottom Bread (Anchor): It’s like washing, sorting, and quality-checking fruit before making juice, so the final drink (the training signal) is pure and healthy.

🍞 Top Bread (Hook): You know how some worksheets are way too easy and others are discouragingly hard?

🥬 Filling (Difficulty Filtering):

What it is: A way to keep problems that are neither trivial nor impossible for the current model.
How it works:
1. Try each question multiple times (rollouts).
2. Compute a pass rate.
3. Keep those in a target band; discard always-wrong (likely flawed or too hard) and always-right (too easy) ones.
Why it matters: Without this, the model either coasts (no learning) or flails (no signal to improve).

🍞 Bottom Bread (Anchor): Like picking math problems that stretch you just enough to learn something new each day.

🍞 Top Bread (Hook): Imagine checking that a map and its directions actually match the same city before you drive.

🥬 Filling (Query Correctness Verification):

What it is: A final check that the image, text, and provided answer are consistent and correct.
How it works:
1. Detect garbled or missing text.
2. Detect image–text mismatches.
3. Confirm the final answer is right; otherwise reject the item.
Why it matters: If the map and directions don’t match, your GPS (the reward) leads you astray.

🍞 Bottom Bread (Anchor): It’s like tossing out a bad question where the bar chart’s labels don’t match the choices.

Why it works (intuition behind the math):

Verifiable rewards turn reasoning into a practice game with a trustworthy scoreboard.
The sweet-spot difficulty keeps learning signal strong.
Visual diversity prevents overfitting to any single drawing style.
Adding visual-logic puzzles strengthens universal primitives—spatial relations and pattern recognition—that transfer broadly.

Building blocks:

Rich visuals spanning geometry, charts, plots, real scenes.
Breadth across 200+ topics and ~400 knowledge points.
A strict three-step curation pipeline (validity → difficulty → correctness).
RL training with a simple, reliable reward (+1/0) and a thinking prompt that nudges step-by-step reasoning.

🍞 Bottom Bread (Anchor): Put together, it’s like a leveled workbook with many picture types, a careful editor, and an answer key—perfect for building sturdy multimodal reasoning.

03Methodology

At a high level: Raw multimodal math + visual-logic sources → Validity Filtering → Difficulty Filtering (via rollouts and pass rates) → Query Correctness Verification → DeepVision-103K (77K math + 26K visual logic) → RLVR training with rule-based rewards → Evaluate on math and general multimodal benchmarks.

🍞 Top Bread (Hook): You know how a librarian sorts books before they go on the shelves?

🥬 Filling (Validity Filtering):

What it is: The first sorting step that removes items unhelpful for verifiable training.
How it works:
1. Rule-based filters remove proofs/explanations (no unique answer) and non-visual items.
2. An LMM checks if the question needs the image and has a single answer.
3. Keep only single-answer, image-required questions.
Why it matters: RLVR needs a unique, checkable answer. Proofs or multiple valid answers would break the reward.

🍞 Bottom Bread (Anchor): Tossing out “Explain why triangles are cool,” keeping “What is angle ABC?”

🍞 Top Bread (Hook): Think of a coach picking drills that are hard enough to build skill but not so hard you give up.

🥬 Filling (Difficulty Filtering):

What it is: Calibrating question difficulty to the model’s current ability.
How it works:
1. For each question, run the model 8 times (rollouts).
2. Score each attempt with an automatic math checker.
3. Compute pass rate; keep only those within a target band (drop always-fail and always-pass).
Why it matters: Without this, training either lacks challenge (no growth) or provides noise (no stable learning).

🍞 Bottom Bread (Anchor): Keeping problems you get right sometimes but not always—the sweet spot where you improve.

🍞 Top Bread (Hook): Before a test, you make sure the question, diagram, and answer key match.

🥬 Filling (Query Correctness Verification):

What it is: Final consistency check of the image, the question, and the official answer.
How it works:
1. Check for missing/garbled text.
2. Check for image–text mismatch.
3. Confirm the provided answer is correct; discard otherwise.
Why it matters: If the answer key is wrong or the diagram doesn’t match, rewards mislead the model.

🍞 Bottom Bread (Anchor): If a bar chart’s tallest bar is clearly 8, but the key says 6, the item gets removed.

🍞 Top Bread (Hook): Imagine a training mix of drills: geometry diagrams, function plots, charts, and real-world photos.

🥬 Filling (Visual Diversity and Broad Coverage):

What it is: DeepVision-103K spans six visual categories and 200+ topics to teach robust perception and reasoning.
How it works:
1. Include planar/solid geometry, analytic plots, data charts, schematic diagrams, and real-world scenes.
2. Mix cross-category visuals (e.g., function plot + 3D shape) to force flexible reasoning.
3. Add visual-logic puzzles (maze, chess, tetris) to strengthen spatial/pattern skills.
Why it matters: Without visual variety, models overfit to tidy, synthetic diagrams and stumble on messy real cases.

🍞 Bottom Bread (Anchor): A problem might show a frustum next to a height–time plot, asking which cup shape fits the curve.

🍞 Top Bread (Hook): Picture a scoreboard that only counts fully correct answers.

🥬 Filling (RLVR Training with Rule-based Rewards):

What it is: Train with a simple reward: +1 if the final answer matches, 0 otherwise.
How it works:
1. Use a thinking prompt that encourages step-by-step reasoning and re-checking images.
2. Generate responses up to long lengths to allow full reasoning chains.
3. Compute reward from the verified final answer; update the model (using GSPO) to favor better reasoning.
Why it matters: Fancy rewards aren’t needed if the data is clean and verifiable; simple, reliable signals scale well.

🍞 Bottom Bread (Anchor): Like multiple-choice practice where only exact matches earn a point.

Concrete example (data path):

Input: “In the figure, AB is a diameter, angle ACB = ?, choices A–D.” Image shows a circle, a right angle marker at C.
Validity: Single correct choice? Needs image? Yes/yes.
Difficulty: Model passes 4/8 → in band → keep.
Correctness: No garbled text, image matches text, ground-truth answer consistent → accept.
Training: The model learns to spot the right-angle-inscribed theorem and select the right option.

The secret sauce:

Sweet-spot pass-rate curation keeps learning signal sharp.
Cross-category visuals prevent brittle tricks and encourage genuine perception.
Visual-logic tasks build transferable primitives (spatial relations, pattern recognition).
Strict correctness verification keeps rewards trustworthy.
Knowledge-guided retrieval balances underrepresented topics, improving coverage without bloating size.

🍞 Bottom Bread (Anchor): The result is a compact, well-balanced workout plan for your brain-with-eyes AI—hard enough to grow, clean enough to trust, and varied enough to generalize.

04Experiments & Results

🍞 Top Bread (Hook): Imagine a school sports day where teams compete in sprints, long jumps, and relays—so you can tell who’s good overall, not just at one thing.

🥬 Filling (The Test):

What it is: The team measured how often models got the final answer right on well-known visual math and general multimodal benchmarks using Pass@1 accuracy (the score for the model’s first answer).
How it works:
1. Evaluate on visual math sets (WeMath, MathVision, MathVerse, LogicVista).
2. Evaluate on general multimodal sets (MMMU val, MMMU Pro, M3CoT).
3. Use consistent long-context decoding; verify answers with an auto-checker and a small re-judge step to avoid grading mistakes.
Why it matters: This shows if the model truly learned to see, reflect, and reason—not just memorize one dataset’s style.

🍞 Bottom Bread (Anchor): Like counting how many first tries you nail on a quiz, not how many tries you take.

🍞 Top Bread (Hook): You know how you compare your time to the fastest kid and to last year’s you?

🥬 Filling (The Competition):

What it is: DeepVision-trained models were compared to:
1. The same base models without DeepVision training.
2. Official “thinking” versions released by the developers.
3. Models trained on other open-source RLVR datasets.
4. Strong closed-source systems.
Why it matters: Beating your own baseline is good; beating other strong teams shows your training plan is truly better.

🍞 Bottom Bread (Anchor): DeepVision models are like the athlete who improves their personal best and also wins medals against rivals.

The Scoreboard (with context):

Qwen3-VL-8B-DeepVision reached 85.11% on WeMath (like an A+ when others are at A or B+), and matched or exceeded strong closed-source baselines on multiple math sets.
MiMo-VL-7B-DeepVision hit 65.62% on LogicVista, a top score within its family, and showed gains of roughly 3–9 percentage points across tasks compared to its non-DeepVision versions.
On general multimodal tests, DeepVision-trained models consistently improved over both base and official thinking variants, indicating broad transfer beyond math.

Surprising/Notable findings:

Visual-logic data helps more than just visual-logic tasks: it boosts math and general multimodal performance too—suggesting spatial reasoning and pattern recognition are core skills.
Correctness verification matters: training on a larger but unverified set scored clearly worse than the verified DeepVision mix. More data didn’t beat better data.
Capability gains were threefold: better first-glance perception, genuine visual re-checking when uncertain, and more rigorous math reasoning. Annotators confirmed these patterns by reading the models’ own step-by-step text.

🍞 Top Bread (Hook): Think of Pass@1 accuracy as counting how often your first answer is right in a rapid-fire quiz.

🥬 Filling (Pass@1 Accuracy):

What it is: The percentage of problems the model solves on the first try.
How it works:
1. The model gives one final answer per problem.
2. An auto-checker grades it.
3. Percentage correct is the Pass@1.
Why it matters: It’s a clean way to see how reliable the model is without do-overs.

🍞 Bottom Bread (Anchor): If you get 85 right out of 100 on the first try, Pass@1 = 85%.

Takeaway: The careful mix—diverse visuals, broad topics, verifiable answers, and pass-rate-based selection—produced consistent, across-the-board improvements. It didn’t just make math models better at math; it helped them become better multimodal reasoners, period.

05Discussion & Limitations

🍞 Top Bread (Hook): Even the best study guide has weak spots—and knowing them helps you plan what to fix next.

🥬 Filling (Honest Assessment):

Limitations:
1. Visual imbalance: Planar geometry is overrepresented; some rare elements (certain 3D views or niche diagrams) are still sparse.
2. External verifier dependence: Using a strong model (like Gemini-Flash) to check correctness may add bias/cost and might filter out some legitimately tough items.
3. Task scope: Focuses on K-12 problems with unique final answers; open-ended proofs or multi-solution tasks need richer feedback and aren’t covered.
Required resources:
- A capable base LMM, long-context decoding, rollout infrastructure, auto-verification tools, and GPUs for RL training.
When NOT to use:
- If your target tasks are proof-writing, multi-solution explorations, or require nuanced partial credit beyond a single final answer.
- If your domain visuals are very different from K-12 (e.g., medical imaging), you may need domain-specific data and verifiers first.
Open questions:
1. How to balance visual categories automatically so models don’t overfit common shapes?
2. Can we replace external verifiers with cheaper or self-checking methods without losing quality?
3. How to extend RLVR to open-ended math with graded reasoning quality, not just final answers?
4. What’s the optimal ratio of visual logic to math tasks for maximum transfer?

🍞 Bottom Bread (Anchor): It’s like a strong all-around workout plan that still needs more back exercises, a cheaper coach’s whistle, and a way to grade creative routines—not just finish times.

06Conclusion & Future Work

Three-sentence summary: DeepVision-103K is a large, visually diverse, and verifiable dataset for training multimodal models with reliable rewards, using a pipeline that filters for validity, calibrates difficulty by pass rate, and verifies correctness. Models trained on it show stronger visual perception, genuine reflection, and rigorous mathematical reasoning, achieving top results on visual math and improved performance on general multimodal tasks. Careful data curation and visual-logic inclusion are key to broad, transferable gains.

Main achievement: Showing that a curated blend of diversity, breadth, and verifiability—plus a just-right difficulty band—lets simple rewards drive major improvements in multimodal reasoning.

Future directions:

Balance visual categories further (especially underrepresented 3D and niche diagrams).
Explore lower-cost, lower-bias verification or self-checking strategies.
Extend RLVR beyond single-answer tasks to proofs and partial-credit reasoning.
Tune the math-to-visual-logic ratio for even better transfer.

Why remember this: It demonstrates a practical, scalable recipe for teaching AIs to truly see and think together. With the right practice set—varied, broad, and checkable—even simple rewards can build robust visual perception, reflection, and reasoning that generalize well beyond math.

Practical Applications

•Build smarter math tutors that can accurately read student-drawn diagrams and provide step-by-step feedback.
•Create reliable dashboard assistants that interpret charts and tables without misreading axes or labels.
•Design classroom tools that grade visual math questions consistently using verifiable answers.
•Train helpdesk bots to understand photos (e.g., receipts, forms) and text instructions to solve tasks accurately.
•Improve lab-report assistants that extract values from plots and diagrams and check calculations.
•Enhance educational games (mazes, tetris-like puzzles) that teach spatial and pattern skills with AI guidance.
•Develop accessibility tools that describe images and explain charts clearly, then answer questions about them.
•Pre-train models for technical fields (engineering sketches, circuits) using the same verify-and-curate recipe.
•Build robust interview or exam proctors that can interpret graphical problems and validate unique answers.
•Use the curation pipeline to clean other multimodal datasets, ensuring they support reliable RL training.

Version: 1