MMGR: Multi-Modal Generative Reasoning

Zefan Cai; Haoyi Qiu; Tianyi Ma; Haozhe Zhao; Gengze Zhou; Kung-Hsiang Huang; Parisa Kordjamshidi; Minjia Zhang; Wen Xiao; Jiuxiang Gu; Nanyun Peng; Junjie Hu

MMGR: Multi-Modal Generative Reasoning

Intermediate

Zefan Cai, Haoyi Qiu, Tianyi Ma et al.12/16/2025

arXiv PDF

Key Summary

•MMGR is a new benchmark that checks whether AI image and video generators follow real-world rules, not just whether their outputs look pretty.
•It tests five kinds of reasoning—physical, logical, 2D spatial, 3D spatial, and temporal—across three domains: Abstract Reasoning, Embodied Navigation, and Physical Commonsense.
•State-of-the-art video models can make realistic videos but often break rules like not walking through walls or keeping puzzle examples unchanged.
•Image models do better on abstract logic tasks like Sudoku and ARC-AGI, while video models do better on physics-flavored scenes like sports.
•On ARC-AGI, most models score under 10%, except Nano-banana Pro (about 30%), showing a big gap in true abstract reasoning.
•On long-horizon navigation, success is very low (for example, around 3.64% holistic success), revealing trouble with planning and consistency over time.
•A VLM (another AI) is used as an automatic judge, but it sometimes overestimates success; human checks reveal more mistakes, especially quick rule violations in videos.
•The paper finds three root causes: training data favors looks over logic, architectures forget global state, and objectives reward appearance instead of correctness.
•MMGR offers a roadmap and tools to build future models that are physically grounded, logically consistent, and better at multi-step reasoning.
•This matters for robots, safety in generated media, education tools, and any system that needs videos that both look right and behave correctly.

Why This Research Matters

Many real-world uses need videos and images that not only look right but also follow rules—robots shouldn’t plan paths through walls, and educational videos shouldn’t teach impossible physics. MMGR helps us measure and improve these deeper abilities so models can be trusted for planning, tutoring, and simulation. By revealing exactly where current models fail—like forgetting clues in Sudoku or drifting examples in ARC-AGI—researchers can target fixes in data, architecture, and training. Better evaluation today leads to safer, more reliable AI systems tomorrow, from self-driving car simulators to assistive AR. It also curbs misinformation by favoring outputs that are causally correct, not just photorealistic. In short, MMGR moves AI from pretty pictures toward principled world models.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine a movie that looks amazing—sparkling oceans, speeding cars—but then a ball passes straight through a table. Your eyes say “cool,” but your brain says “that’s not possible.”

🥬 The Concept (setting the scene): Before this paper, most video and image generators were graded mainly on how real and pretty they looked. Metrics like FVD, IS, or CLIP similarity are like judges at a beauty contest: they score style and surface, not whether the story makes sense or the physics hold up. How it worked: 1) Train giant models on tons of videos/images; 2) Generate outputs from text prompts; 3) Score with appearance-focused metrics; 4) Call it good if it looks real. Why it matters: Without checking logic and physics, a model can make lovely nonsense—great for trailers, risky for real tasks like navigation or tutoring. 🍞 Anchor: A billiards clip might look cinematic, yet show balls ghosting through each other. Old metrics would still give it a gold star.

— New Concept — 🍞 Hook: You know how you don’t need to measure every marble to know it will fall off a tilted tray? Your brain carries built-in world rules. 🥬 Physical Reasoning: It is understanding how objects behave (gravity, collisions, materials). How it works: 1) Track objects; 2) Predict their motions and contacts; 3) Enforce rules like “no passing through” and “things fall down”; 4) Keep it consistent over time. Why it matters: Without it, models create impossible scenes—like skiers floating or doors that don’t block. 🍞 Anchor: When a dancer spins, the skirt flares because of angular momentum; a reasoning-aware model should show continuous spin, not teleporting poses.

— New Concept — 🍞 Hook: Imagine solving a riddle: “If a number is even, add 2; if odd, subtract 1.” You follow rules, not just patterns. 🥬 Logical Reasoning: It is following step-by-step rules to reach correct conclusions. How it works: 1) Read the rules; 2) Apply constraints; 3) Check for conflicts; 4) Repeat until the solution fits all rules. Why it matters: Without it, puzzles like Sudoku get wrong digits even if the grid looks neat. 🍞 Anchor: Filling a Sudoku row with two “5”s looks tidy but breaks the rule—so it’s wrong.

— New Concept — 🍞 Hook: Think of arranging furniture in a room so you can walk around without bumping your shins. 🥬 3D Spatial Reasoning: It is understanding depth, layout, and paths in 3D spaces. How it works: 1) Build a mental map; 2) Plan a path avoiding obstacles; 3) Respect walls, floors, and stairs; 4) Update as you move. Why it matters: Without it, a navigating agent may “turn left through a wall” or jump floors magically. 🍞 Anchor: A robot finding a kitchen should walk through hallways and doors, not teleport.

— New Concept — 🍞 Hook: Reading a map or maze on paper uses flat (2D) thinking: where is start, where is end, where are walls? 🥬 2D Spatial Reasoning: It is reasoning about positions and shapes on a flat grid or image. How it works: 1) Identify cells and edges; 2) Trace valid corridors; 3) Avoid walls; 4) Connect start to goal. Why it matters: Without it, a drawn maze path may cross black walls while still reaching the goal. 🍞 Anchor: A correct maze path stays on white corridors from green start to red end.

— New Concept — 🍞 Hook: You know breakfast comes before lunch; stories have beginnings, middles, and ends. 🥬 Temporal Reasoning: It is keeping events in the right order and maintaining cause-and-effect over time. How it works: 1) Remember past states; 2) Predict next steps; 3) Enforce that causes happen before effects; 4) Keep details consistent frame to frame. Why it matters: Without it, a video may show a cup unbreaking itself or puzzles that rewrite their clues mid-solve. 🍞 Anchor: If a maze agent reaches the goal, it shouldn’t be back at the start next frame.

— New Concept — 🍞 Hook: Imagine juggling five balls at once—physics, logic, space (2D and 3D), and time. 🥬 Multi-Modal Generative Reasoning: It is creating images/videos that obey physical laws, logical rules, spatial layouts, and time order—all at once. How it works: 1) Read the prompt and any provided context images; 2) Plan a solution that satisfies all rules; 3) Generate frames or images; 4) Check and keep global consistency throughout. Why it matters: Without this, models produce pretty but self-contradictory scenes. 🍞 Anchor: A navigation video that goes upstairs, turns right, and arrives at the blue door without ever crossing walls or changing the map.

The Problem: The community relied on appearance-first metrics (like FVD and IS) that ignore whether outputs follow rules. — New Concept — 🍞 Hook: You know how a photo filter can make two photos look similarly stylish even if they tell different stories? 🥬 FVD (Fréchet Video Distance): It is a score measuring how “visually similar” a generated video distribution is to real videos. How it works: 1) Extract features with a pretrained network; 2) Fit statistical summaries; 3) Compare distributions. Why it matters: FVD can be high even when physics or logic are wrong, because it judges look, not law. 🍞 Anchor: A teleporting runner can still get a good FVD if textures and motion look realistic.

Failed Attempts: People added simple temporal checks or text–video alignment metrics, but these still missed causal correctness and rule-following. The Gap: There wasn’t a single benchmark that tested all five abilities across both images and videos, with strict, whole-task correctness and human verification. Real Stakes: This matters for robots that mustn’t crash, for sports and science videos that should teach real physics, for navigation aids that mustn’t hallucinate paths, and for safety (no impossible medical procedures). MMGR was created to fill this testing gap by measuring whether generators not only look good but also think straight.

02Core Idea

🍞 Hook: Imagine grading a science fair where entries must look neat, follow the laws of physics, explain their logic, fit on the table, and tell a clear story in order—only then do they pass.

🥬 The Concept (the aha): The key insight is to evaluate generative models on five reasoning abilities at once—Physical, Logical, 2D Spatial, 3D Spatial, and Temporal—across three domains, with strict “all-correct-or-fail” metrics and human checks. This turns evaluation from “Does it look right?” into “Does it behave right?” How it works: 1) Define five core abilities; 2) Build three domains (Abstract Reasoning, Embodied Navigation, Physical Commonsense) with controlled difficulty; 3) Ask models to generate videos/images; 4) Judge with a VLM using fine-grained rubrics plus a strict primary metric that requires holistic correctness; 5) Validate with human annotators to catch fleeting errors. Why it matters: Without this, we keep rewarding pretty mistakes; with it, we diagnose exactly where and why models fail to reason. 🍞 Anchor: A Sudoku is only a success if every rule is satisfied, no clues are changed, and all empty cells are correct—partial prettiness doesn’t count.

Multiple Analogies:

Driver’s test: It’s not enough that your car looks polished (appearance); you must signal, follow routes, and stop at red lights (rules). MMGR is the road test.
Recipe check: A cake should look nice, taste right, rise correctly, and be baked through. MMGR doesn’t just judge frosting; it slices the cake to see if it’s done inside.
Escape room: To win, you must use clues, maps, timing, and physical props together. MMGR checks if the model can really escape, not just pose for photos.

Before vs After:

Before: Success meant high FVD or good text alignment; models could cheat physics or logic and still pass.
After: Success means holistic correctness—no wall-crossing, no clue changes, right pattern, steady context, correct final answer. Now we see image models excel at symbolic puzzles, while video models excel at natural motion but break under strict logic.

Why It Works (intuitions, no equations):

Decomposition: Splitting reasoning into five abilities pinpoints which cog in the machine slips (e.g., temporal drift vs. 2D layout errors).
Holistic metric: Requiring every sub-rule to be correct prevents “partial credit illusions.”
Cross-domain triangulation: Abstract tasks stress logic; navigation stresses 3D planning; sports stress physics. Performance patterns across domains reveal training biases.
Human-in-the-loop: Humans catch fast, frame-level violations that automatic judges miss, calibrating true difficulty.

Building Blocks (with mini Sandwich intros as needed): — New Concept — 🍞 Hook: You know how a careful referee watches every move, not just the final score? 🥬 VLM-based Evaluation: It is using a vision-language model to grade generated outputs with structured rubrics. How it works: 1) Provide prompt, references, and output; 2) Ask targeted questions (e.g., crossed wall? changed clues?); 3) Extract fine-grained scores; 4) Combine into a strict primary metric. Why it matters: It scales judging, but can miss fleeting errors—hence human checks. 🍞 Anchor: The VLM pauses a maze video and answers, “Did the green square ever touch a black wall?”

— New Concept — 🍞 Hook: Think of a strict checklist where every box must be ticked to pass. 🥬 Holistic Primary Metric: It is a pass/fail rule that demands all sub-conditions are satisfied simultaneously. How it works: 1) Compute fine-grained scores; 2) AND-them together; 3) If any fail, overall is 0; 4) Only perfect compliance gets 1. Why it matters: Stops partial successes from hiding critical errors. 🍞 Anchor: In Sudoku, if any original clue was altered, the entire attempt fails overall—no exceptions.

— New Concept — 🍞 Hook: If your cookbook has 2,000 dessert photos but only 10 logic puzzles, you’ll bake cakes better than you solve riddles. 🥬 Training Data Imbalance: It is when models see far more natural videos than symbolic puzzles, so they learn looks over logic. How it works: 1) Ingest mostly perceptual data; 2) Optimize to match appearances; 3) Weak exposure to rules leads to poor symbolic generalization. Why it matters: Explains why models ace sports motion but flop on ARC-AGI and Sudoku. 🍞 Anchor: Great at ballet spins, bad at grid transformations.

— New Concept — 🍞 Hook: A story falls apart if the author forgets what happened in Chapter 1 by Chapter 10. 🥬 Global State Consistency: It is remembering and enforcing the same facts across the whole generation. How it works: 1) Track entities and constraints; 2) Update consistently; 3) Prevent contradictions; 4) Keep a stable world model. Why it matters: Without it, demos in ARC-AGI drift or mazes morph mid-video. 🍞 Anchor: The example puzzles (E1–E4) must never change while solving the test case.

— New Concept — 🍞 Hook: If you reward students only for neat handwriting, they won’t learn math. 🥬 Objective Gap: It is optimizing for appearance (reconstruction/adversarial) instead of correctness (rule adherence/causality). How it works: 1) Losses focus on visuals; 2) Models learn to look right; 3) Rule-breaking isn’t penalized; 4) Errors persist. Why it matters: Drives the pretty-but-wrong pattern seen in results. 🍞 Anchor: A maze run that clips through walls can still score well if the frames look sharp under appearance losses.

Together, these pieces form MMGR: an evaluation lens that surfaces where today’s models reason well, where they guess, and where they break.

03Methodology

At a high level: Prompt + (optional input image/solution) → Model generates video/image (5 samples per prompt) → VLM judge scores fine-grained criteria → Strict primary metric decides pass/fail → Human evaluators validate a subset to calibrate.

Step-by-step recipe:

Define the five abilities and three domains

What happens: The benchmark maps tasks to Physical, Logical, 2D Spatial, 3D Spatial, and Temporal reasoning across three domains: Abstract Reasoning, Embodied Navigation, and Physical Commonsense.
Why this step exists: It ensures broad but structured coverage, so we can diagnose which cognitive gear slips.
Example: Maze stresses 2D+Logical+Temporal, SLAG stresses 3D+2D+Temporal+Physical, Sports stresses Physical+Temporal.

Build task suites with controlled difficulty

What happens: For Abstract Reasoning, generate 240 Mazes (3×3–13×13; DFS/Wilson; varied start/goal), 300 Sudokus (4×4, 9×9; Easy/Medium/Hard), 456 ARC-AGI tasks (v1+v2; Match/Mismatch; Easy/Medium/Hard), and 327 visual math problems (GSM8K, MATH500, AIME, Omni-MATH). For Embodied Navigation, assemble 4 tasks (3D Real-World, Last-Mile, Top-down, SLAG), each with 120 samples and 24 configurations (floors, fidelity, distance, goal spec). For Physical Commonsense, sample 25 Physical Concept and 25 Sports prompts covering solid/fluid dynamics.
Why this step exists: Difficulty control prevents cherry-picking and exposes scaling behavior (where performance drops as complexity rises).
Example: Maze 9×9 corner-to-corner tests longer 2D planning than 4×4; ARC-AGI Mismatch Hard forces spatial restructuring, not just color flips.

Select diverse models and fix generation settings

What happens: Evaluate leading video models (Sora-2, Veo-3, Wan-2.2) and image models (Nano-banana, Nano-banana Pro, GPT-4o-image, Qwen-image). Generate 5 samples per prompt using default or recommended settings.
Why this step exists: Multiple samples reduce randomness; locked configs ensure fairness and true zero-shot capability.
Example: If a model solves a maze only 1/5 times, its success is fragile; reporting all 5 samples reveals reliability.

Design fine-grained rubrics and a strict primary metric per task

What happens: Each task has detailed checks plus a holistic pass/fail. Examples:
- Maze: Maze Changed (no), Cross Wall (no), Action Reflection (recorded), Target Achievement (yes), Overall = 1 only if all constraints hold.
- Sudoku: Clues Changed (no), Constraint Violations (none), Completion Accuracy (full), Action Reflection (video only), Overall = 1 only if perfect.
- ARC-AGI: Pattern Recognition (yes), Grid Integrity (yes), Color Accuracy (yes), Valid Solution (exact match required).
- Navigation: Scene Consistency, Destination Integrity, Path Validity, Overall Success.
Why this step exists: Fine-grained metrics isolate failure modes; the primary metric prevents partial-credit illusions.
Example: A Sudoku that looks nearly filled but changes one clue fails Overall by design, reflecting human scoring.

Use a VLM as the first-pass judge (AutoEval)

What happens: Feed the generated output, the references (e.g., solution image or demonstration pairs), and a task-specific prompt to a powerful VLM (Gemini 2.5-Pro). It returns structured answers for each metric.
Why this step exists: It scales evaluation across 1,853 samples quickly and consistently.
Example: For Maze, the VLM answers yes/no to “ever crossed a wall?” and “did the goal get reached?” enabling automatic scoring.

Calibrate with human evaluation on curated subsets

What happens: Trained annotators use a web interface to scrub frames, compare with references, and score the same metrics with confidence ratings.
Why this step exists: Humans catch transient violations (e.g., a one-frame wall clip) and hold stricter standards where needed (e.g., Sudoku digit clarity).
Example: In mazes, humans found 3–5× more wall crossings than AutoEval, dropping some AutoEval successes to true failures.

Aggregate results, analyze patterns, and diagnose causes

What happens: Compare across models, modalities, tasks, and difficulty to find trends (e.g., image > video on symbolic logic; video > image on physical motion). Highlight bottlenecks (e.g., Color Accuracy) and anomalies (e.g., Sora-2 better on Mismatch v1 Hard but collapses on v2).
Why this step exists: Numbers need stories; patterns suggest fixes in data, architecture, and objectives.
Example: Low Overall with decent Pattern Recognition pinpoints an execution gap, not a perception gap.

Domain-specific walkthroughs:

Abstract Reasoning (Maze, Sudoku, ARC-AGI, Math)
- What happens: Provide a static puzzle or demos; ask the model to output a valid solution (image) or a solving process (video) that ends in a valid state.
- Why needed: Tests symbolic rules, exact 2D layouts, and steady context.
- Example data: ARC-AGI v1 has 381 tasks; v2 adds 75 harder ones. Image model Nano-banana Pro reaches ≈30% overall; video models lag (<~20% and often <10%).
Embodied Navigation (3D Real-World, Last-Mile, Top-down, SLAG)
- What happens: Provide views (ego, top-down, or mixed). The model must generate a path/trajectory that obeys 3D structure, respects obstacles, and reaches the target.
- Why needed: Tests 3D reasoning plus long-horizon temporal coherence.
- Example data: SLAG Overall Success can be as low as ≈3.64%, showing planning brittleness.
Physical Commonsense (Physical Concept, Sports)
- What happens: Prompt phenomena like fluid–fluid or solid–solid interactions, or sports actions with expected motion.
- Why needed: Tests whether videos obey intuitive physics (continuity, forces, momentum).
- Example data: Sports ≈60% plausibility shows relative strength on natural motions common in training data.

The Secret Sauce (what’s clever):

Five-ability lens: Separating Physical, Logical, 2D, 3D, and Temporal reasoning makes errors findable and fixable.
Holistic correctness: AND-ing conditions prevents inflated scores from partial successes.
Cross-modality comparison: Scoring both images and videos reveals a modality gap (images better at abstract rules; videos better at natural motion).
Difficulty control: Systematic sizes and settings expose how models break as tasks get harder.
Human+VLM triangulation: AutoEval scales; humans ground truth. The gap between them is itself a diagnostic signal.

Failure without each step:

Without fine-grained rubrics: You can’t tell if failure was due to walls, timing, or rule-breaking.
Without holistic metrics: A Sudoku with 1 wrong digit looks “okay” but would fool the scoreboard.
Without human checks: Quick wall clips slip by, inflating scores.
Without difficulty tiers: You won’t see the sudden cliff from Easy to Hard.
Without multiple models: You can’t tell if an issue is architectural or universal.

04Experiments & Results

The Test: MMGR evaluates whether generation follows rules, not just appearance, across 1,853 samples in 10 tasks. It measures fine-grained dimensions (e.g., crossing walls, clue changes, grid integrity) and then applies strict overall correctness (all conditions must be satisfied).

The Competition: Video models (Sora-2, Veo-3, Wan-2.2) vs. image models (Nano-banana, Nano-banana Pro, GPT-4o-image, Qwen-image), all zero-shot with default/recommended settings, 5 samples per prompt.

Scoreboard with context:

ARC-AGI (Abstract Reasoning)
- Big picture: Most models score under 10% on v1; Nano-banana Pro leads at ≈30% (v1 and v2). Sora-2 reaches ≈20% on v1 but collapses to ≈1.33% on v2.
- Meaning: 30% here is like getting a strong B where most of the class is failing; <10% is a near-failure on rule discovery and execution.
- Surprise: Video models can show decent Pattern Recognition and Grid Integrity but still fail Overall—Color Accuracy and exact execution are bottlenecks.
Sudoku
- Image models clearly outperform video models. For 4×4 Easy, image models can reach ≈56%–66% Overall; video models linger around ≈0%–11% and drop to ≈0% on harder/larger grids under human review.
- Meaning: Image models handle symbolic, all-at-once outputs better; videos introduce temporal drift and clue overwriting.
Maze
- AutoEval suggests moderate success for some video outputs (e.g., Veo-3 Overall up to ~50% in easier settings). Human evaluation, however, finds 3–5× more wall-crossings; true success drops near zero in hard cases.
- Meaning: The maze “looks solved” but actually cheats (briefly clips through walls). AutoEval misses these fleeting violations; humans catch them.
Embodied Navigation (3D Real, Last-Mile, Top-down, SLAG)
- Holistic success can be very low (e.g., SLAG ≈3.64%). Even when parts look right (e.g., 80%+ for primary sub-metric), overall drops to ~20% because global consistency and destination integrity fail.
- Meaning: Long-horizon planning and keeping a stable world state remain hard.
Physical Commonsense (Physical Concept, Sports)
- Results are stronger here: Sports plausibility around ≈60%.
- Meaning: Models have learned common natural motions from abundant training data. This contrasts with poor abstract logic performance.

Surprising Findings:

Performative vs direct reasoning styles in videos: Sora-2 often shows exploration (backtracking) yet changes the maze (invalid). Veo-3 often heads straight to the goal but clips through walls (invalid). Both “look like” they reason but break rules differently.
VLM vs Human gap: AutoEval regularly overestimates success for fast errors (like a one-frame wall clip). Humans downgrade these to true failures, especially on Hard mazes and Sudoku.
The perception–execution gap: Models can spot patterns (Pattern Recognition) and keep structure (Grid Integrity) but stumble on exact, pixel-level rule execution (Color Accuracy), causing Overall to crash.
Modality asymmetry: Image models do better on ARC-AGI and Sudoku; video models do better on physical plausibility. Training data skew toward natural videos likely causes this split.
Brittleness to distribution shift: Sora-2’s sharp drop from v1 to v2 ARC-AGI (≈20% → ≈1.33%) suggests reliance on familiar patterns over general rules.

Putting numbers in plain words:

ARC-AGI: Most models scoring <10% Overall is like getting a D–F on abstraction; Nano-banana Pro’s ≈30% is a clear lead but still shows room to grow.
Sudoku: Image models’ ~40–66% on smaller/easier puzzles is like steady Cs to Bs; video models’ ~0–11% (and 0% under human checks on many settings) is failing.
Mazes: AutoEval’s ~40–50% for some video runs shrinks toward 0–20% with humans; that’s the difference between “seems okay” and “actually wrong.”
Navigation: ~3.64% holistic success is like barely passing one question on a long test—the long-horizon plan collapses.
Sports/Physics: ~60% plausibility is like a solid C+/B– on everyday physics—better than abstract logic.

Overall message: Today’s generative models often know how to draw but not how to obey rules, especially over long sequences. MMGR reveals where they shine (natural motions) and where they stumble (symbolic logic and long-range planning).

05Discussion & Limitations

Limitations (of models and evaluation):

Symbolic drought: Training corpora overflow with natural videos but starve on symbolic puzzles. Models overfit to looks and motion, underfit to rules and constraints.
Temporal brittleness: Video generators struggle to preserve static context (ARC-AGI demos drift) and to avoid fleeting violations (wall clips) that AutoEval often misses.
Architecture memory gaps: Without strong world-state representations and external memory, global consistency decays—leading to clue changes, maze morphing, and navigation drift.
Judge imperfections: VLM-based evaluation can undercatch transient errors and overflag subtle color differences; human checks are essential but costly.

Required resources to use MMGR:

Access to evaluated models’ inference APIs or open weights (Sora-2, Veo-3, Wan-2.2; Nano-banana/Pro, GPT-4o-image, Qwen-image).
Compute for generating 5 samples per prompt across 1,853 tasks (non-trivial time and storage).
A capable VLM (e.g., Gemini 2.5-Pro) for AutoEval, plus a small trained human annotation team for calibration.

When NOT to use MMGR (or how to scope it):

If you only need photorealistic style and don’t care about rule correctness (e.g., mood boards), MMGR’s strict pass/fail may be overkill.
If your model is trained purely for still images without any reasoning claims, running full navigation/temporal tasks may be unnecessary—use the Abstract Reasoning subset instead.
If your evaluator cannot access a strong VLM and no human review is possible, results on hard, fast-motion tasks may be misleading.

Open questions:

Data: What is the right blend and curriculum of symbolic vs. perceptual data to build balanced reasoning? How to create scalable, diverse ARC-like generators without leakage?
Objectives: Which training signals best reward rule adherence and causality (e.g., RL from rule-checkers, neuro-symbolic constraints, temporal consistency losses)?
Architecture: What designs best maintain global state (memory modules, object-centric slots, structured latent spaces, graph dynamics)?
Evaluation: How to make automated judges reliably catch one-frame violations and be robust to minor cosmetic differences, reducing reliance on humans?
Transfer: Can strengths in physical commonsense be leveraged to improve planning, or can abstract logic boost long-horizon control?

Honest take: MMGR does not solve reasoning; it measures it sharply. The benchmark exposes real deficits—especially abstract and long-horizon consistency—and points toward concrete fixes in data, models, and losses. Its strict metrics can feel harsh, but that’s the point: pass only when the output truly thinks and behaves right.

06Conclusion & Future Work

Three-sentence summary: MMGR reframes how we judge image and video generators—from appearance to reasoning—by testing five abilities (physical, logical, 2D, 3D, temporal) across three domains with strict, holistic metrics and human checks. Results show a clear modality split: image models lead in abstract logic (e.g., ARC-AGI, Sudoku), while video models fare better on natural motion yet fail at exact rule-keeping and long-horizon consistency. The benchmark reveals root causes in data imbalance, weak global state, and appearance-first objectives, offering a roadmap to reasoning-aware world models.

Main achievement: A principled, comprehensive evaluation suite that unifies abstract puzzles, embodied navigation, and physical commonsense, with fine-grained diagnostics plus an “all-correct-or-fail” primary metric calibrated by human evaluation.

Future directions:

Data: Balance perceptual and symbolic corpora; build synthetic curricula that gradually add constraints.
Models: Add memory, object-centric state, and structured latent spaces; couple perception with program-like executors.
Objectives: Mix appearance with rule-checking, causal rewards, and temporal consistency; use RL from structured feedback.
Evaluation: Improve automated judges to catch fleeting violations and better align with human judgments.

Why remember this: MMGR shifts the goalposts from pretty pictures to principled behavior. It tells us not only whether a model can paint a scene but whether it can keep promises to physics, logic, space, and time—exactly what we need for trustworthy simulators, tutors, and robots.

Practical Applications

•Use MMGR to audit a new video generator before deploying it for robotics simulation, ensuring no rule-breaking paths.
•Design a balanced training curriculum by adding symbolic reasoning datasets (Sudoku, ARC-AGI) to complement natural videos.
•Incorporate rule-checker rewards (e.g., no wall crossings, no clue changes) into training via reinforcement learning.
•Adopt object-centric memory or world-state modules after MMGR flags temporal drift in ARC-AGI or navigation tasks.
•Build a QA gate for content creation that runs MMGR-like checks to prevent physics-breaking scenes going public.
•Evaluate navigation planners with MMGR’s embodied tasks to detect long-horizon inconsistency before real-world tests.
•Use MMGR’s human-in-the-loop calibration to tune VLM evaluators for catching fleeting, frame-level violations.
•Benchmark architectural variants (with/without external memory) to see which best preserves global state across frames.
•Create dataset controls (maze sizes, ARC-AGI Match/Mismatch) to stress-test specific weaknesses found in your model.
•Track progress over time by re-running MMGR after each training change, verifying holistic correctness improves—not just appearance.

Version: 1