RISE-Video: Can Video Generators Decode Implicit World Rules?
Key Summary
- •RISE-Video is a new test that checks whether video-making AIs follow hidden world rules, not just make pretty pictures.
- •It turns a picture and a short instruction into a video and then asks: did the AI do what real life would do?
- •The benchmark covers eight kinds of thinking, like commonsense, time, space, logic, and even school subjects like physics and chemistry.
- •Four scores judge each video: Reasoning Alignment, Temporal Consistency, Physical Rationality, and Visual Quality.
- •A careful Large Multimodal Model (LMM) acts like a fair referee to score results in a way that matches human judges well.
- •Across 11 famous models, even the best system got only 22.5% accuracy when all strict rules had to be perfect at once.
- •Models are okay at noticing colors and sizes but weak at puzzle-like logic and following game rules.
- •Special tricks evaluate tough cases like mazes and symmetry so the grading is precise and not guessy.
- •This benchmark helps builders find where their video AIs break real-world rules and fix them.
- •It shifts the goal from 'looks real' to 'acts real,' which is what we need for trustworthy video AIs.
Why This Research Matters
Video AIs are moving from fun demos to tools we learn from, design with, and trust in simulations. If they only look real but ignore hidden rules, they can mislead people, teach wrong steps, or encourage unsafe actions. RISE-Video pushes the field to value reasoning—physics, time order, social norms—so results act real, not just appear real. This shift helps creators, educators, and engineers diagnose where models fail and fix those gaps. It also gives researchers a common yardstick, speeding up progress toward world-simulating models. As models improve on RISE-Video, we get closer to dependable video assistants for science labs, sports training, and classroom demonstrations. In short, RISE-Video helps turn video AI from a toy camera into a trustworthy window on how the world works.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
🍞 Hook: Imagine your friend makes a flipbook from drawings. It looks smooth and colorful, but when a ball is dropped, it floats up instead of falling. It looks cool, but it doesn’t follow real-world rules.
🥬 The Concept (Text-Image-to-Video, or TI2V):
- What it is: TI2V is when an AI takes a starting picture plus a short instruction and turns it into a moving video.
- How it works: 1) Read the text instruction; 2) Look at the input image; 3) Predict how the scene should change frame by frame; 4) Render the result as a video.
- Why it matters: Without understanding the world’s rules, videos may be pretty but nonsense—like pouring water upward or opening a bottle without twisting the cap. 🍞 Anchor: Start with a photo of a kid holding a kite and say, “Show the kite lifting into the sky.” A good TI2V video makes wind tug the string and the kite rise smoothly, not teleport.
🍞 Hook: You know how you just “know” that hot air rises, you can’t walk through walls, and night comes after day?
🥬 The Concept (Commonsense/Implicit World Rules):
- What it is: These are the everyday rules people use without thinking, like gravity, cause-and-effect, and social norms.
- How it works: 1) See the situation; 2) Recall everyday patterns; 3) Predict the next likely step; 4) Keep everything realistic.
- Why it matters: If AIs ignore these rules, their videos become magic tricks instead of real life. 🍞 Anchor: A person picks up a bottled drink. Before sipping, the cap must come off. If the AI skips that, it fails commonsense.
🍞 Hook: Think of a school test that doesn’t just check handwriting, but checks if your answer makes sense.
🥬 The Concept (Benchmark):
- What it is: A benchmark is a fair test with clear rules to compare different AIs.
- How it works: 1) Build many challenge tasks; 2) Ask models to solve them; 3) Score with strict criteria; 4) Compare scores.
- Why it matters: Without a strong benchmark, we can’t tell if an AI truly understands or just looks fancy. 🍞 Anchor: RISE-Video is that test for TI2V—less about “prettiness,” more about “does it act like the real world?”
The World Before:
- Video generators got great at making sharp, colorful, and smooth clips. But tests mostly rewarded looks and short-term smoothness, not deep understanding.
- Many benchmarks focused on things like sharpness, flicker, and motion smoothness. These are important, but they don’t ask, “Did the AI do the sensible thing next?”
The Problem:
- We lacked a clear, fair way to measure whether TI2V models follow hidden rules: gravity, time order, object permanence, social customs, and multi-step procedures.
- Models often failed subtle steps: forgetting to unscrew a bottle cap, changing a person’s identity mid-video, or making objects move without contact.
Failed Attempts:
- Pure perceptual metrics: They praised pretty but illogical videos.
- General video benchmarks: Broader coverage but light on reasoning specifics.
- One-size-fits-all automatic judges: Struggled to grade puzzles like mazes or strict geometry.
The Gap:
- We needed a reasoning-first benchmark that spans different kinds of thinking (commonsense, time, space, logic, societal norms, and subject knowledge), plus a judging pipeline that asks targeted, reasoning-aware questions.
Real Stakes:
- If video AIs can’t follow world rules, they will mislead, confuse, or even be unsafe—for example, showing a wrong lab reaction, incorrect sports mechanics, or dangerous driving behaviors.
- For creators and educators, realism means trust. For robotics and simulation, reasoning means safety. For storytelling, it means believable cause and effect.
🍞 Hook: Imagine judging a talent show. You don’t just check the costume; you check the dance moves, rhythm, and timing.
🥬 The Concept (Large Multimodal Models, LMMs, as Judges):
- What it is: Big AIs that understand both pictures and words, used here to grade videos.
- How it works: 1) Show sampled frames and precise questions; 2) The LMM answers Yes/No or gives a score; 3) Compare with humans to verify fairness; 4) Use special routines for puzzles.
- Why it matters: Humans are slow and expensive to judge many videos; LMMs make judging scalable while staying close to human decisions. 🍞 Anchor: An LMM checks if the chameleon’s color gradually matches the branch (commonsense) and if nothing else randomly changes (consistency).
02Core Idea
🍞 Hook: You know how a great magic trick can fool your eyes, but a scientist checks if it obeys real rules? Pretty is not the same as true.
🥬 The Concept (Key Insight):
- What it is: To truly evaluate video AIs, we must test if they follow hidden world rules, not just if they look real.
- How it works: 1) Build tasks that require reasoning across eight kinds of knowledge; 2) Score with four complementary metrics; 3) Use an LMM judge with targeted questions and smart frame sampling; 4) Add special evaluators for strict puzzles like mazes and symmetry.
- Why it matters: Without reasoning-focused checks, we mistake eye candy for understanding. 🍞 Anchor: A model that shows a rose doing capillary action (water climbing the stem) slowly and logically scores higher than one that just flashes pretty petals.
Three Analogies for the Same Idea:
- School Report Card: Don’t just grade neat handwriting (visual quality); also grade science facts (physical rationality), story order (temporal consistency), and correct answers (reasoning alignment).
- Cooking Show: The dish must look tasty (visual quality), be cooked in the right order (temporal consistency), follow kitchen physics (physical rationality), and match the recipe idea (reasoning alignment).
- Sports Replay: The video should be crisp (visual quality), players must stay themselves and in position except when they move as instructed (temporal consistency), physics must make sense (no teleporting balls), and the play must match the coach’s plan (reasoning alignment).
Before vs After:
- Before: Models were rewarded for smoothness and sharpness, even if they broke physics or skipped crucial steps.
- After: Models are rewarded for acting like the real world—obeying physics, keeping identities stable, following instructions exactly, and reasoning through multi-step tasks.
Why It Works (Intuition):
- Each metric catches a different failure type: Reasoning Alignment checks if the outcome fits the rule; Temporal Consistency catches drift and identity swaps; Physical Rationality catches physics cheats; Visual Quality keeps the baseline of clarity.
- Targeted questions reduce guesswork: asking, “Was the cap twisted off before drinking?” is sharper than “Does the video look good?”
- Smart sampling (like 2 fps or focusing on the final frame) feeds the judge just the right evidence.
- Special puzzle graders avoid language ambiguity and directly compare structure, paths, or final layouts.
Building Blocks (with sandwiches):
-
🍞 Hook: You know how detectives match clues to what really happened? 🥬 Reasoning Alignment:
- What it is: Does the video’s outcome match the rule or knowledge the task requires?
- How it works: Ask binary Yes/No questions tied to the knowledge type; sample key frames; tally correctness.
- Why it matters: Without it, a video might look fine but solve the wrong problem. 🍞 Anchor: In a “unscrew the bottle” task, the judge asks, “Was the cap removed?”
-
🍞 Hook: Stories need smooth scenes so nothing pops in or out without reason. 🥬 Temporal Consistency:
- What it is: Everything stays stable except what the instruction says should change.
- How it works: Show frames to the judge; ignore instructed motion; score remaining stability 1–5.
- Why it matters: Without it, shirts change color mid-clip or faces morph randomly. 🍞 Anchor: If only the camera angle should change, the person’s face and clothes shouldn’t.
-
🍞 Hook: You can’t drop a rock and have it fly upward. 🥬 Physical Rationality:
- What it is: Motions and interactions must obey physics and everyday logic.
- How it works: The judge looks for gravity, contact, collisions, fluids, and continuity; scores 1–5.
- Why it matters: It prevents melting hands, ghost objects, and impossible paths. 🍞 Anchor: A hook must touch the gold before dragging it—no spooky motion.
-
🍞 Hook: A clear window lets you see the scene; a foggy one hides details. 🥬 Visual Quality:
- What it is: Sharpness, clean textures, and steady lighting.
- How it works: Sample six frames; apply super-resolution first so blur isn’t misjudged; score 1–3.
- Why it matters: If you can’t see clearly, you can’t judge reasoning fairly. 🍞 Anchor: A crisp basketball shot shows fingers, seams, and motion cleanly.
-
🍞 Hook: Like grading in different school subjects. 🥬 Dimensional Evaluation Protocol:
- What it is: A structured way to test eight reasoning types with four metrics.
- How it works: Curate 467 human-annotated samples across experiential, commonsense, temporal, societal, perceptual, spatial, subject, and logical tasks; then score with the four metrics and aggregate.
- Why it matters: Covers many kinds of thinking so one trick can’t game the test. 🍞 Anchor: A model might ace colors (perceptual) but fail mazes (logical)—the protocol shows both.
03Methodology
High-Level Recipe: Input Image + Instruction → Generate Video → Sample Frames → Ask Targeted Questions/Checks → Score on Four Metrics → Aggregate Scores
Step 1: Build Diverse Reasoning Tasks (the data)
- What happens: The authors curate 467 TI2V tasks, each with an input image and a reasoning-heavy instruction, spanning eight categories: experiential, commonsense, temporal, societal, perceptual, spatial, subject-specific, and logical.
- Why this exists: Variety ensures we test many kinds of thinking, from color counting to multi-step procedures and strict puzzle rules.
- Example: “Show the process of the man drinking from the bottle.” Hidden rule: the cap must be removed first.
Sandwich mini-primers for the eight categories (first mention):
-
🍞 Hook: You learn a lot from doing things, like tying shoes. 🥬 Experiential Knowledge: It checks if the AI follows human-like steps, identities, intentions, and context in the right order. Without it, the AI might eat before peeling the orange. 🍞 Anchor: “Take a letter out of an envelope” should show fingers grasping, pulling, then revealing the letter.
-
🍞 Hook: You know a vase shatters when hit. 🥬 Commonsense Knowledge: Everyday physics and life facts. Without it, snow won’t keep footprints. 🍞 Anchor: A knocked cup should spill downward, not hover.
-
🍞 Hook: Timelines matter—breakfast before bedtime. 🥬 Temporal Knowledge: Ordering over short, medium, long, or reversed time. Without it, seasons might shuffle randomly. 🍞 Anchor: “Reverse five seconds” should play actions backward smoothly.
-
🍞 Hook: Maps and rooms have layouts. 🥬 Spatial Knowledge: Viewpoint moves, object arrangement, and structure completion. Without it, cameras teleport or shapes don’t fit. 🍞 Anchor: “Shift to overhead view” should lift the camera smoothly.
-
🍞 Hook: Counting apples is basic but important. 🥬 Perceptual Knowledge: Low-level attributes—size, color, count, position, occlusion. Without it, you mix up numbers or colors. 🍞 Anchor: “Make the blue cube move right” must not turn it red.
-
🍞 Hook: We follow rules—stop at red lights. 🥬 Societal Knowledge: Emotions, social rules, cultural customs. Without it, behaviors look odd or rude. 🍞 Anchor: “Place the most iconic Chinese New Year main food” should show dumplings or similar staples, not random snacks.
-
🍞 Hook: Classroom facts count. 🥬 Subject Knowledge: Physics, chemistry, geography, sports. Without it, lab reactions or ball shots look wrong. 🍞 Anchor: Marble in acid should bubble; a jump shot should bend knees then release.
-
🍞 Hook: Puzzles have exact rules. 🥬 Logical Capability: Games, mazes, symmetry, board grids. Without it, paths cross walls or pieces jump illegally. 🍞 Anchor: A maze path must not cut through walls.
Step 2: Human Annotation and Privacy Stylization
- What happens: Experts write and verify instructions/goals and, when necessary, stylize faces for privacy while keeping task meaning.
- Why this exists: Ensures ground truth is reliable and respectful of privacy.
- Example: A real person’s image might be stylized cartoon-like, but the action steps remain the same.
Step 3: Four-Metric Evaluation (the judges) 3A) Reasoning Alignment
- What happens: For each sample, manually designed, knowledge-aware Yes/No questions guide an LMM judge. Frame sampling adapts to the task: e.g., 2 fps for full progress, fewer frames for final-state checks.
- Why this exists: Targeted questions reduce vagueness and test the exact rule the task depends on.
- Example: “Has the person twisted the bottle cap off before drinking?” Yes/No.
3B) Temporal Consistency (1–5)
- What happens: Provide instruction and sampled frames; the LMM must ignore intended changes and rate stability of everything else.
- Why this exists: Prevents identity drift or random background swaps from slipping by.
- Example: If only the camera viewpoint should change, the shirt color must remain constant.
3C) Physical Rationality (1–5)
- What happens: The LMM checks gravity, contact, collisions, fluid motion, and continuity. Abstract puzzles are excluded.
- Why this exists: Enforces real-world plausibility so motions and interactions feel natural.
- Example: The Gold Miner hook must contact the stone before moving it; no sliding stone without touch.
3D) Visual Quality (1–3)
- What happens: Six uniformly sampled frames (excluding first/last) are judged after super-resolution so low native resolution doesn’t look like blur.
- Why this exists: Crispness and clean textures are needed to fairly judge reasoning.
- Example: A basketball clip with preserved seams, no melting fingers, and steady lighting earns a 3.
Step 4: Special Scoring for Schematic Puzzles
- Maze Navigation: Track the agent’s colored path across frames to check two constraints—no wall-crossing and reaching the target. Map {0,1,2} satisfied → scores {0,0.5,1}.
- Symmetry Generation: Compare grid cells between last frame and ground-truth reference; compute accuracy = 1 − (FP + FN)/N; discretize to {0, 0.5, 1} with a human-aligned 0.85 threshold.
- Board Games: Give the LMM both the last frame and a labeled reference so it can visually compare structures without guessing rules in words.
- Why this exists: Language-only judgments struggle with rigid geometry; direct structural checks are clearer and fairer.
- Example: In symmetry, colors can differ, but the positions must match the target pattern.
Step 5: Aggregating Scores
- Weighted Score = 0.4×Reasoning Alignment + 0.25×Temporal Consistency + 0.25×Physical Rationality + 0.1×Visual Quality.
- Accuracy: Count a case as fully correct only if all four dimensions are perfect; normalize to 100.
- Why this exists: Weighted Score balances contributions; Accuracy shows how often everything is flawless at once (a tough bar).
Step 6: Choosing the Judge
- The authors compare several LMM judges against human ratings using Mean Absolute Error and variability. GPT-5 best matches humans across most metrics; gpt-5-mini works well for visual quality, balancing cost.
- Why this exists: Trustworthy automatic judging must agree with careful human evaluation.
Secret Sauce:
- Reasoning-aware prompts and adaptable frame sampling give the judge just the right evidence.
- Puzzle-specific evaluators eliminate ambiguity and reward exact structure.
- Super-resolution before visual grading prevents punishing a model for resolution rather than artifacts.
- Multi-metric scoring catches complementary failure modes so models can’t game the system by optimizing one thing only.
04Experiments & Results
The Test: What and Why
- The authors run 11 well-known TI2V models (closed- and open-source) on 467 tasks covering eight reasoning categories. They score each output using four metrics: Reasoning Alignment (RA), Temporal Consistency (TC), Physical Rationality (PR), and Visual Quality (VQ).
- Why: To see if models not only look good but also think and act like the real world under implicit constraints.
The Competition: Who’s In
- Closed-source contenders include Hailuo 2.3, Veo 3.1, Sora 2, Wan 2.6, Kling 2.6, and Seedance 1.5-pro—generally strong on visuals.
- Open-source contenders include Wan 2.2 (I2V and TI2V variants), HunyuanVideo-1.5 variants, and CogVideoX1.5-5B—often more accessible but currently weaker.
The Scoreboard (with context):
- Overall Accuracy (the “everything perfect at once” bar) is low for all models. The best, Hailuo 2.3, reaches only 22.5%. Think of that as getting 22.5 out of 100 on a super-strict, all-or-nothing test where every part must be perfect.
- Weighted Scores also separate models: closed-source models lead, open-source lag. Visual quality is high among leaders (often above 90%), but reasoning drags results down.
- By metric: • Reasoning Alignment: Hailuo 2.3 leads at about 76.6% RA, outscoring Wan 2.6 by 6.6%. Many models fail hidden steps (like unscrewing a cap) or make rule-breaking moves (like moving objects without contact). • Temporal Consistency: Sora 2 shines (around 92.2% TC), showing strong stability in non-instructed details, though it still faces certain discontinuities in tricky cases. • Physical Rationality: Several closed models are mid-to-high 70s on average. Failures include melting textures, detached flames, or abrupt motion jumps. • Visual Quality: Top systems score in the low-to-mid 90s after super-resolution, showing the field already makes sharp, attractive frames.
Category-Level Insights:
- Perceptual Knowledge (colors, sizes, counts) is the easiest—most models do fairly well.
- Logical Capability (games, mazes, symmetry) is the hardest—scores are consistently low across the board. In a Gold Miner-style setup, no model correctly grabs the stone from the shown hook position.
- Experiential Knowledge shows clear differences: only a few models (Hailuo 2.3, Veo 3.1) reliably include necessary steps like unscrewing bottle caps before drinking.
- Dynamics and Responsiveness: Some models (e.g., Kling 2.6) produce minimal motion, ignoring requested transformations. Others (Veo 3.1, Sora 2) sometimes follow instructions partially but not fully, or show abrupt temporal jumps.
Surprising Findings:
- Even with high visual quality, reasoning can be weak: looking real isn’t the same as acting real.
- LMM-as-judge can match humans closely when guided by reasoning-aware prompts and special puzzle graders. GPT-5 aligns best overall; gpt-5-mini is solid for visual quality at lower cost.
- Smart pre-processing (like super-resolution) meaningfully changes fairness: it avoids mislabeling low-res frames as blurry failures.
Bottom Line:
- Today’s TI2V models are “A-” in looks but “C-” in logic. The strict everything-perfect Accuracy score (best at 22.5%) shows how far we are from seamless world reasoning.
- RISE-Video pinpoints exactly where models stumble, giving builders a roadmap for improvement.
05Discussion & Limitations
Limitations (be specific):
- Scope: RISE-Video targets reasoning in TI2V, not every aspect of video generation (e.g., music sync, long-form storytelling beyond the curated set, or interactive user edits).
- Dataset Size: 467 samples is strong for careful human annotation, but small compared to internet-scale data; rare edge cases may still be underrepresented.
- Judge Dependence: The primary judge is an LMM; although it aligns well with humans, it can inherit biases or blind spots, especially in unusual visuals.
- Physics Coverage: Physical Rationality is only scored for physically grounded scenes; abstract puzzles are excluded from it by design.
- Closed-Source Variance: Production systems may change over time or apply hidden post-processing, making exact reproducibility tricky.
Required Resources:
- To run the benchmark, you need: TI2V models (often GPU-heavy), the RISE-Video dataset, and API or local access to a capable LMM judge (e.g., GPT-5 for RA/TC/PR and gpt-5-mini for VQ), plus storage for generated clips.
- For those extending the benchmark: expert annotators for new tasks, careful prompt engineering, and validation against human raters.
When NOT to Use:
- Pure Aesthetics Tasks: If your goal is only cinematic look and style, a reasoning-heavy benchmark may be overkill.
- Non-Video or Non-Visual Reasoning: Questions that don’t manifest visually (e.g., internal thoughts) aren’t a fit.
- Extremely Long or Interactive Scenarios: The current sampling strategies target short-to-medium clips, not hour-long narratives or interactive simulations.
Open Questions:
- Training for Rules: What model training schemes best teach implicit rules—instruction tuning, rule-augmented data, physics engines, or hybrid symbolic-neural training?
- Better Judges: Can we build even more trustworthy, interpretable, and bias-resistant evaluators, perhaps combining LMMs with programmatic checks by default?
- Data Curriculum: What mix of commonsense, subject, and logical tasks creates the biggest gains per unit of training?
- Robustness: How do models behave under adversarial or ambiguous instructions that still have a “most reasonable” outcome?
- Generalization: Can a model learn rules from one domain (e.g., sports) and transfer them to another (e.g., lab chemistry) without explicit examples?
06Conclusion & Future Work
Three-Sentence Summary:
- RISE-Video is a reasoning-first benchmark that tests whether Text-Image-to-Video models obey the world’s hidden rules, not just whether they look good.
- It covers eight reasoning types and scores each video with four complementary metrics, guided by an LMM-based judging pipeline and special graders for puzzles.
- Results across 11 state-of-the-art models show strong visuals but weak reasoning, with the top full-accuracy rate only 22.5%, revealing a major gap between appearance and understanding.
Main Achievement:
- The paper reframes video evaluation around rule-following intelligence and delivers a practical, scalable pipeline that aligns closely with human judgment, making reasoning failures visible and fixable.
Future Directions:
- Train video models with rule-aware data and objectives; blend neural generation with physics engines or symbolic constraints; expand puzzle coverage and long-horizon tasks; and continue improving judges for transparency and bias control.
Why Remember This:
- It marks a turning point: from eye-candy grading to world-rule grading. If we want video AIs that people can trust—in education, design, simulation, and beyond—then acting real is as important as looking real. RISE-Video gives the field a compass to get there.
Practical Applications
- •Model debugging: Pinpoint whether failures come from physics, timing, logic, or perception, then retrain with targeted data.
- •Safety checks: Verify that generated lab or sports demonstrations follow real-world rules before sharing with learners.
- •Curriculum design: Create training curricula focused on weak categories (e.g., logical puzzles or experiential steps).
- •Benchmarking and reporting: Provide standardized scores across four metrics to compare new TI2V models fairly.
- •Automated QA in production: Use the LMM-as-judge pipeline to continuously test updates and catch regressions.
- •Data curation: Mine or synthesize new examples in underperforming dimensions to balance training sets.
- •Instruction design: Write sharper prompts using the benchmark’s question style to elicit better reasoning.
- •Hybrid evaluation: Combine LMM judging with puzzle-specific structural checks for rigorous, low-cost assessment.
- •Research ablations: Swap judges, sampling rates, or metrics to study what most improves human alignment.
- •Educational demos: Generate videos for classrooms and verify they obey physics and correct procedures.