BabyVision: Visual Reasoning Beyond Language

Liang Chen; Weichu Xie; Yiyan Liang; Hongfeng He; Hans Zhao; Zhibo Yang; Zhiqi Huang; Haoning Wu; Haoyu Lu; Y. charles; Yiping Bao; Yuantao Fan; Guopeng Li; Haiyang Shen; Xuanzhong Chen; Wendong Xu; Shuzheng Si; Zefan Cai; Wenhao Chai; Ziqi Huang; Fangfu Liu; Tianyu Liu; Baobao Chang; Xiaobo Hu; Kaiyuan Chen; Yixin Ren; Yang Liu; Yuan Gong; Kuan Li

BabyVision: Visual Reasoning Beyond Language

Intermediate

Liang Chen, Weichu Xie, Yiyan Liang et al.1/10/2026

arXiv PDF

Key Summary

•BabyVision is a new test that checks if AI can handle the same basic picture puzzles that young children can do, without leaning on language tricks.
•Researchers found that top multimodal AIs scored around 49.7% on BabyVision, far below adults at 94.1% and even behind many 6-year-olds.
•The benchmark covers four early-vision skills: fine-grained discrimination, visual tracking, spatial perception, and visual pattern recognition.
•Many failures come from a verbalization bottleneck: models squeeze rich visuals into words before thinking, losing important shape and space details.
•BabyVision-Gen extends the test by letting models answer visually (draw paths, circle items) and grades those outputs automatically with high agreement to humans.
•Open-source models perform notably lower than the best proprietary ones, and simply making models bigger does not fix these early-vision gaps.
•Reinforcement Learning with Verifiable Rewards (RLVR) gives small gains overall but barely helps with hard visual tracking tasks.
•Generation models show hints of human-like visual thinking when drawing solutions, but they still make many mistakes and lack reliable consistency.
•This work spotlights a new frontier: AI must learn to think in visual space, not just describe it with text, to reach human-like perception.
•Progress on BabyVision can guide better robots, safer vision systems, and learning tools that truly understand what they see.

Why This Research Matters

Many real-world systems need true visual understanding, not just good descriptions—think robots, AR glasses, and safety cameras. BabyVision shows today’s AIs still miss basic picture skills that kids have, so we know exactly where to improve. By letting models answer with drawings, we can train and test the kind of visual thinking people actually use. This will make future AIs better at tracing routes, recognizing shapes, and imagining 3D spaces, which can boost navigation, inspection, and education tools. The benchmark’s clear human comparisons make progress easy to track and explain. As models get better on BabyVision, we can expect safer, smarter vision systems in daily life.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: You know how toddlers can spot the odd sticker on a page, follow a wiggly line with a finger, or imagine how a folded paper would look when opened—all before they can read big words? Those are super-early vision powers.

🥬 The Concept: Multimodal Large Language Models (MLLMs) are AIs that look at images and read text together, then answer questions. How it works:

They take in a picture and some words. 2) They turn the picture into internal features and often into word-like descriptions. 3) They reason mostly with their language brain. 4) They output text (and sometimes an image). Why it matters: If they lean too much on words, they can miss tiny visual details that kids easily spot. 🍞 Anchor: Imagine an AI that aces textbook quizzes about famous landmarks but can’t find the one penguin shadow that matches the real penguin. That’s a mismatch between book smarts and picture smarts!

🍞 Hook: Imagine doing mazes with a crayon: you don’t say the route—you draw it. Kids solve many picture puzzles this way—no fancy words needed.

🥬 The Concept: Visual reasoning is solving problems by looking—matching shapes, tracking lines, imagining 3D, and spotting patterns. How it works:

See the picture. 2) Build a mental map of shapes and spaces. 3) Try small visual moves (rotate, fold, trace). 4) Pick or draw the result. Why it matters: Without solid visual reasoning, an AI can say a lot but still be wrong about what’s in front of it. 🍞 Anchor: When you ask, “Which path leads from A to the exit?”, a good visual reasoner traces the exact path, not just guesses with words.

🍞 Hook: Teachers don’t grade reading with a math test. So why grade visual skills with language-heavy questions?

🥬 The Concept: Before this paper, most AI tests rewarded language knowledge more than raw seeing skills. How it works:

Benchmarks ask questions that can be answered by memorized facts. 2) Models learn to rely on text clues and priors. 3) Visual weaknesses stay hidden. Why it matters: We might think models see well when they’re really just great talkers. 🍞 Anchor: If a model can score high on a geography quiz without even looking at the map, that test isn’t checking vision.

🍞 Hook: Picture a kid’s activity book full of mazes, spot-the-difference, cube counts, and reflection puzzles. Now give it to an AI.

🥬 The Concept: BabyVision is a new benchmark that tests early, language-light visual skills. How it works:

Curate 388 hand-checked puzzles in four skill families. 2) Remove tasks that need reading big chunks of text or cultural knowledge. 3) Ensure two expert reviewers agree that answers come from seeing, not from words. 4) Compare many AIs to children and adults. Why it matters: It cleanly measures the basic visual building blocks humans develop before school. 🍞 Anchor: On BabyVision, the best model scored 49.7%, but adults averaged 94.1%, and many young kids did better too.

🍞 Hook: Sometimes the best answer is a drawing, not a sentence—like circling the odd tiger or tracing the maze path.

🥬 The Concept: BabyVision-Gen lets models answer visually by drawing on the image. How it works:

Show the original picture. 2) Ask the model to add simple marks (lines, circles, arrows). 3) Compare its drawing to a human-made solution using an automatic judge. Why it matters: It checks what the model can do in visual space, not just what it can say in words. 🍞 Anchor: If the task is “Draw the path from A to the exit,” a correct red line proves the model truly followed the maze.

🍞 Hook: Think of four superpowers kids pick up early: noticing tiny differences, following lines, imagining 3D, and spotting patterns.

🥬 The Concept: BabyVision’s four skill families are the foundation of visual reasoning.

Fine-grained Discrimination: Telling apart very similar shapes or textures. How: compare edges, corners, and tiny features step by step. Why: Without it, the AI mixes up near-twins. Example: Finding the only tiger stripe pattern that’s different.
Visual Tracking: Following one curve or route through crossings. How: lock onto one line, follow it through turns and intersections, ignore distractors. Why: Otherwise, the AI “switches tracks.” Example: Tracing a subway route or the right thread through a knot.
Spatial Perception: Understanding 3D from 2D pictures. How: imagine rotations, folds, hidden blocks, and views. Why: Without this, the AI miscounts cubes or picks the wrong 3D view. Example: Counting blocks in a stacked structure with some hidden.
Visual Pattern Recognition: Finding the rule behind transformations. How: test rotations, mirrors, overlays in order. Why: Without it, the AI guesses by color or style. Example: Choosing the next shape in a sequence after a 90-degree turn. 🍞 Anchor: These skills are like vision’s ABCs: miss one, and bigger tasks wobble.

🍞 Hook: Have you ever tried to describe a squiggly line perfectly with words? It’s hard!

🥬 The Concept: The verbalization bottleneck means squeezing rich pictures into words loses key details. How it works:

Convert image to language-like tokens. 2) Reason mostly in text space. 3) Lose tiny curves, exact positions, and continuous shapes. 4) Make mistakes humans don’t. Why it matters: Many BabyVision tasks depend on those “hard to say” details. 🍞 Anchor: A model that “explains” a path but draws the wrong line didn’t truly track the curve.

02Core Idea

🍞 Hook: Imagine testing eyesight with drawing and matching games—not with reading tests. That’s fair for kids and for AIs too.

🥬 The Concept: The key insight is that to judge an AI’s real visual understanding, you must test vision directly, beyond language. How it works:

Build puzzles solvable by seeing, not by reading. 2) Ensure answers don’t need background facts. 3) Compare models to children and adults. 4) Let models answer by drawing when that’s the natural way. Why it matters: It reveals a hidden gap—models that talk well still stumble on basic visual skills. 🍞 Anchor: On BabyVision, an AI that writes long reasoning can still fail a simple line-tracing test that a 6-year-old aces.

Three analogies for the same idea:

Glasses vs. Glossary: Glasses help you see; a glossary helps you talk. BabyVision checks if the AI has glasses-quality vision, not just a big glossary of words.
Road Trip vs. GPS Story: It’s better to drive the route than to tell a story about it. BabyVision-Gen asks the AI to drive (draw) the path, not just narrate it.
Ruler vs. Riddle: Some tasks need measuring, not riddles. BabyVision favors visual measurements (edges, angles, paths) over word puzzles.

Before vs. After:

Before: Success on knowledge-heavy, language-friendly benchmarks made it look like models understood images deeply.
After: BabyVision shows they often don’t. The best model (49.7%) trails adults (94.1%) and even many young kids. Visual tracking and 3D perception are especially weak.

Why it works (intuition, no math):

When a task depends on exact shapes, continuous lines, or true 3D structure, turning visuals into words throws away precision. By minimizing language needs and inviting visual answers, BabyVision and BabyVision-Gen keep the important picture details intact. This way, models can’t hide behind language priors; they must demonstrate real seeing.

Building blocks (each with the sandwich pattern):

🍞 Hook: Think of a fair playground test—same swings, no extra helpers. 🥬 The Concept: Visual-only curation ensures tasks rely on sight, not facts. How it works: 1) Choose early-vision puzzles. 2) Filter out text-heavy or cultural items. 3) Double-blind review for clarity. 4) Keep only unambiguous, visual-first items. Why it matters: It prevents language shortcuts. 🍞 Anchor: Spot-the-difference works the same in any language because it’s about seeing, not reading.

🍞 Hook: When a puzzle is best solved by drawing, let the solver draw! 🥬 The Concept: Visual externalization means answering with marks on the image. How it works: 1) Show the image. 2) Ask for minimal overlays (circles, lines). 3) Compare with a ground-truth drawing. 4) Accept style differences, demand the same answer. Why it matters: It measures visual thinking directly. 🍞 Anchor: A correct maze path proves understanding even if no words are spoken.

🍞 Hook: Referees make games fair; so do careful graders for AI images. 🥬 The Concept: Automatic judging checks if the model’s visual answer matches the ground truth. How it works: 1) Provide input, human solution, and model output. 2) Use subtype-specific rules. 3) Decide True or False. 4) Validate against human raters. Why it matters: It scales evaluation and stays reliable (about 96% agreement). 🍞 Anchor: If both drawings circle the same penguin shadow, the judge says True.

🍞 Hook: Practice with feedback makes you better at sports—and models too. 🥬 The Concept: RLVR (Reinforcement Learning with Verifiable Rewards) nudges models to make answers that match ground-truth-checked outputs. How it works: 1) Model tries an answer. 2) A judge verifies correctness. 3) Correct answers earn reward. 4) The model learns policies that win more rewards. Why it matters: It improved overall BabyVision scores a bit, though it didn’t fix visual tracking. 🍞 Anchor: The model learned to count blocks better, but still stumbled following tricky lines at intersections.

03Methodology

At a high level: Input image and short question → Make sure task is visual-first → Model produces an answer (text for BabyVision; annotated image for BabyVision-Gen) → An automatic judge checks correctness → Score per subtype and overall.

Step-by-step details:

Scope and Taxonomy (what tasks to include)

What happens: Researchers defined four early-vision categories with 22 subtypes: Fine-grained Discrimination, Visual Tracking, Spatial Perception, and Visual Pattern Recognition.
Why this step exists: Without a clear map, the benchmark might drift into language trivia or expert-only content.
Example: Count 3D Blocks goes under Spatial Perception; Lines Observation under Visual Tracking.

Data Collection and Filtering

What happens: Start from ~100 hand-picked seed puzzles (kid textbooks, vision tests), then expand via reverse image search and keywords to ~4,000 candidates; keep only academic-use images; remove heavy text, culture-specific clues, or unsafe content.
Why it matters: It keeps the focus on seeing, not reading or guessing based on culture.
Example: A puzzle requiring you to read a paragraph to understand the scene is filtered out.

Annotation and Double-Blind Quality Assurance

What happens: Annotators write a short, clear question, the answer, and a visual solution process. Two independent experts then verify the answer is unambiguous and visually derivable.
Why it matters: Ensures fairness and that each problem truly measures vision.
Example: For “Which shadow matches the penguin?”, both reviewers must agree the correct shadow is C for clear, visual reasons.

BabyVision Inference Protocol (text answers)

What happens: Every model gets the same brief prompt template that asks it to think and then answer in a specific format.
Why it matters: Standardization makes scores comparable across models.
Example: For “Find the entrance connected to the exit: A, B, or C,” the model must return exactly something like “Entrance A.”

BabyVision Scoring (LLM-as-judge)

What happens: A strong judge model checks if the model’s answer semantically matches the ground truth; this matched 100% with human evaluators for text equivalence in the paper’s setup.
Why it matters: It handles small formatting differences fairly and quickly.
Example: “A” versus “Entrance A” both count as correct.

BabyVision-Gen Inference (visual answers)

What happens: The model is told to annotate the original image using only minimal overlays (lines, circles, arrows, short labels) without changing the image.
Why it matters: It measures whether the model can show its reasoning visually.
Example: “Please draw in red the path from A to the exit.” The model must draw only that path.

BabyVision-Gen Automatic Judge

What happens: A judge model sees the input image, a human-annotated ground-truth solution image, and the model’s generated image; subtype-specific rules decide True/False.
Why it matters: It makes grading consistent and scalable; it agreed with human raters ~96% of the time.
Example: In a maze, any different route is marked False, even if it still reaches the exit.

Human Baselines

What happens: Children (ages 3, 6, 10, 12) took a smaller BabyVision-Mini set in class time; adults completed the full set.
Why it matters: It anchors model scores to real human development levels.
Example: Adults average 94.1%; many 6-year-olds outperform the best model on BabyVision-Mini.

RLVR Training Probe (open model study)

What happens: Fine-tuned a smaller open model using verifiable rewards derived from correct answers on BabyVision-like data; trained for several days; measured pre/post.
Why it matters: Tests whether reward-driven practice strengthens visual reasoning.
Example: Overall gain was about +4.8 points, with some subtypes improving more than others; visual tracking stayed stubborn.

The secret sauce:

Visual-first design: Problems are chosen and checked to require seeing, not reading.
Visual externalization: Models can answer by drawing when drawing is the natural proof.
Trustworthy grading: Automatic judges reflect human judgments closely, enabling large-scale, fair evaluation.
Developmental comparison: Scoring against kids makes the gaps concrete and motivating.

Sandwich intros for the four families (brief recaps):

Fine-grained Discrimination: 🍞 You know how two leaves look almost the same until you spot a tiny notch? 🥬 What it is: telling near-twins apart by tiny visual cues. How: scan edges, corners, spacing, then compare side by side. Why it matters: prevents mixing similar options. 🍞 Example: Find the unique tiger pattern in a 7x7 grid.
Visual Tracking: 🍞 Imagine tracing one noodle in a bowl without switching noodles. 🥬 What it is: following one continuous line through crossings. How: lock on, follow curvature, ignore touching lines, finish at the true endpoint. Why: avoids “track switching.” 🍞 Example: Trace the wire from battery to bulb.
Spatial Perception: 🍞 Imagine turning a Lego model in your head. 🥬 What it is: inferring 3D structure from 2D views. How: build a mental 3D model, rotate, account for occlusion, project to the asked view. Why: needed for count, unfold, and view choice. 🍞 Example: Count all cubes, including hidden ones.
Visual Pattern Recognition: 🍞 Think of a dance where each step rotates or mirrors. 🥬 What it is: discovering visual rules across steps. How: test transforms (rotate, mirror, overlay) systematically. Why: stops guessing by color alone. 🍞 Example: Choose the next symbol after a 90-degree rotation.

04Experiments & Results

The test: Researchers measured accuracy on 388 BabyVision puzzles spanning four early-vision skills. They also ran BabyVision-Gen where the answer is a drawing. They compared many top proprietary and open-source models and gathered human baselines from children and adults.

The competition: Proprietary MLLMs (like Gemini3-Pro-Preview and GPT-5.2), large open-source MLLMs (Qwen3VL family and others), and visual generation systems (like NanoBanana-Pro) faced the same kinds of puzzles, with standardized prompts and judges.

The scoreboard (with context):

Adults: about 94.1% overall, which is like getting straight As on the whole test.
Best model (Gemini3-Pro-Preview): about 49.7%, roughly like scoring a middle-of-the-pack D when humans are near-perfect.
Other proprietary models: GPT-5.2 around 34.4%, Doubao-1.8 around 30.2%, with several others between 14–19%.
Best open-source model (Qwen3VL-235B-Thinking): about 22.2%, with most open models 12–19%.
BabyVision-Gen (generation models): NanoBanana-Pro around 18.3% overall; GPT-Image-1.5 around 9.8%; Qwen-Image-Edit around 4.8%. These are not directly comparable to the text-answers benchmark but show similar difficulty patterns.

Category highlights (what stood out):

Visual Tracking: Major weak spot. Many models perform near zero on tasks like Lines Observation unless they’re the very strongest. Even then, they sometimes “switch tracks” at crossings.
Spatial Perception: Counting 3D blocks is tough for all models (best around 20.5%), likely due to occlusions and the need for a stable 3D mental model.
Fine-grained Discrimination: Spot-the-same and spot-the-different are harder than expected; tiny edge and contour cues vanish through language compression.
Visual Pattern Recognition: Comparatively better (e.g., logic and rotation rules), probably because discrete rules can piggyback on language-style reasoning. Still far from humans.

Surprising findings:

Bigger isn’t always better: Scaling from 8B to 235B in open models helps, but not smoothly; a 4B variant even outscored an 8B in places, suggesting architecture/training matters more than just size.
Talking more doesn’t fix seeing: “Thinking” modes helped somewhat, but they couldn’t rescue tasks that demand precise following of lines or true 3D imagination.
Drawing helps but isn’t magic: BabyVision-Gen showed that letting models draw can surface visual skills, yet maze and connect-the-lines still stumped generation models (often 0% on these subtypes), showing persistent gaps in maintaining spatial coherence.

Human comparison (the eye-opener): On a child-sized BabyVision-Mini, most frontier MLLMs performed below average 3-year-olds, while the best model hovered below typical 6-year-olds. That’s a dramatic reversal from their PhD-level performance on language-heavy exams.

RLVR probe (can training nudge improvements?): On an 8B open model, RLVR fine-tuning raised overall BabyVision accuracy by roughly +4.8 points, improving several subtypes (notably some 3D and pattern tasks). However, visual tracking barely improved or even dipped, hinting that reinforcement on text-format rewards can’t easily teach continuous line following.

Bottom line: Across text answers and drawn answers, a consistent pattern emerges—current MLLMs often do not possess the atomic, pre-language visual skills that kids develop early. The gap isn’t about advanced math; it’s about seeing and tracking with fidelity.

05Discussion & Limitations

Limitations:

Coverage vs. depth: BabyVision focuses on early-vision skills with 388 items. It’s rich enough to reveal gaps, but it’s still a slice of all possible visual abilities.
Cultural and demographic scope: The child participants came from a single school population; broader sampling could refine the human baselines.
Automatic judging bounds: Although agreement is high (about 96% for BabyVision-Gen’s judge), it’s not perfect, and some tricky edge cases may still need human review.
Model access variability: Proprietary models used default or recommended settings; hidden training details or guardrails could influence results.

Required resources:

For evaluation: A GPU/CPU setup capable of running chosen models, access to the benchmark data and judge prompts, and—if using generation—image annotation capabilities.
For training studies: Multiple high-end GPUs (the paper cites 8 H800s for several days for the RLVR probe), a reward pipeline (LLM judge), and curated training data aligned with BabyVision’s spirit.

When not to use:

If you want to test high-level domain knowledge (like medical diagnosis or historical facts), BabyVision is not the right benchmark; it’s designed to avoid language crutches.
If your model cannot process images, BabyVision and BabyVision-Gen will not be meaningful.

Open questions:

Architecture: What visual backbones or multi-stage pipelines best preserve continuous shapes and 3D structure without collapsing them into text tokens?
Training signals: Can we design verifiable, continuous-geometry rewards (e.g., differentiable path-matching) to teach line following and spatial coherence?
Visual memory: How can models maintain manifold identity over long trajectories with crossings and occlusions?
Unified models: Will natively multimodal, generation-capable systems that think “in pixels” close the gap faster than language-first architectures?
Generalization: How robust are BabyVision skills to new styles, distortions, or adversarial clutter, and how do improvements transfer to real-world robotics and AR/VR tasks?

06Conclusion & Future Work

Three-sentence summary: BabyVision is a carefully curated benchmark that measures early, beyond-language visual skills—fine-grained discrimination, visual tracking, spatial perception, and pattern recognition—using tasks that children solve without heavy reading. State-of-the-art models score far below humans (best at 49.7% vs. adults at 94.1%), revealing a deep, foundational gap rooted in a verbalization bottleneck that discards crucial visual detail. BabyVision-Gen further shows that allowing visual answers helps expose genuine visual reasoning, but today’s models still struggle with spatial coherence and consistency.

Main achievement: The paper cleanly separates true visual competence from language cleverness and provides rigorous, scalable tools—both text and visual evaluation—to diagnose and track progress toward human-like visual reasoning.

Future directions:

Build architectures that maintain visual fidelity end to end, enabling reasoning directly on pixels and geometry rather than compressing into text.
Create verifiable visual rewards (e.g., path-matching, shape congruence) to train tracking and 3D imagination.
Develop unified multimodal generators that can sketch intermediate steps, simulate 3D, and externalize reasoning visually.
Expand human baselines and stress-test generalization across styles and real-world noise.

Why remember this: BabyVision reframes the challenge—AI won’t reach human-level perception by getting better at talking about pictures; it must get better at truly seeing. This benchmark shines a bright, kid-simple light on what’s missing and points the way to models that can think with their eyes, not just with their words.

Practical Applications

•Evaluate vision capabilities of new multimodal models before deploying them in visual assistants.
•Design training curricula that teach models to trace paths and count 3D structures using verifiable visual rewards.
•Benchmark household robots on BabyVision-like tasks to improve safe navigation and manipulation.
•Build kid-friendly learning apps that check visual skills (mazes, matching, rotations) and adapt difficulty.
•Improve quality checks in manufacturing by training models to spot tiny differences in parts and patterns.
•Enhance AR/VR tutoring that teaches spatial reasoning by having the AI demonstrate solutions visually.
•Develop driving and mapping aids that reliably follow continuous lanes and paths without ‘track switching.’
•Create medical imaging pre-checks that test models’ ability to trace structures and match shapes precisely.
•Use BabyVision-Gen-style annotation to make AI explain its vision decisions by drawing, aiding transparency.
•Adopt automatic visual judges to scale up dataset labeling and rapid model iteration on vision tasks.

Version: 1