ProPhy: Progressive Physical Alignment for Dynamic World Simulation

Zijun Wang; Panwen Hu; Jing Wang; Terry Jingchen Zhang; Yuhao Cheng; Long Chen; Yiqiang Yan; Zutao Jiang; Hanhui Li; Xiaodan Liang

ProPhy: Progressive Physical Alignment for Dynamic World Simulation

Intermediate

Zijun Wang, Panwen Hu, Jing Wang et al.12/5/2025

arXiv PDF

Key Summary

•ProPhy is a new two-step method that helps video AIs follow real-world physics, not just make pretty pictures.
•It first guesses the big, overall physics from the text (like “combustion” or “liquid flow”), then fine-tunes where and when each rule applies in the video frames.
•A smart router picks which “physics experts” to use: some think globally about the scene and others fix tiny, local details.
•ProPhy borrows precise “where the action happens” hints from vision-language models (VLMs) to teach the generator where physical effects should show up.
•This makes videos respond anisotropically—different parts of the scene follow different physics—so dust only appears when an object hits the ground, for example.
•On the VideoPhy2 benchmark, ProPhy significantly boosts physical commonsense and overall pass rates compared to strong baselines like CogVideoX and Wan.
•It preserves or improves visual quality (VBench), especially in fast, dynamic motions, while keeping inference fully end-to-end (no external models at runtime).
•The system is modular and general: it layers onto popular video diffusion backbones and shows consistent gains across them.
•Limitations include depending on VLM-made training signals and learning physics from patterns rather than strict equations, so it’s plausible but not perfect.
•ProPhy points toward better world simulators where actions and reactions follow the rules you’d expect in real life.

Why This Research Matters

Better physics in generated videos makes digital worlds more trustworthy. Educators can demonstrate cause-and-effect (like collisions or fluid flow) safely and convincingly. Robotics and simulation can gain more reliable expectations about motion and interaction when testing ideas virtually. Creators get scenes that not only look real but also feel real, improving immersion. Scientists and engineers can prototype visually plausible setups before running costly precise simulations. Over time, such methods bring AI closer to world models that respect the rules of reality.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

You know how movie stunts can look amazing but still feel “off” when gravity or motion doesn’t behave right? Early AI video generators were a bit like that: they made scenes that looked sharp and colorful, but objects sometimes floated, splashed wrong, or passed through each other. These models learned how things look, not how things behave. Before this work, text-to-video systems based on diffusion and transformers grew bigger and better at realism, but they often broke physical rules in tricky situations: dust flying before a disc hits the ground, liquids moving against gravity, or collisions that ignore momentum. People tried three main paths to fix this. First, physics simulators: write explicit equations (like rigid-body physics or fluid dynamics), simulate, then render. This gives accuracy, but it’s hard to generalize to open-world scenes and every possible prompt. Second, learning-based cues: transfer “motion relations” from vision encoders into the generator. This helps, but guidance is still coarse and often misses localized events. Third, add structured physical hints: guess a category like “combustion” from the text and route to a special module. This is clearer, but still too global—if multiple phenomena happen in different spots at once, the model struggles. What was missing? Fine-grained, place-and-time-aware physics. Real scenes are patchy: smoke lives here, sparks fly there, liquid only rises where it’s poured. Prior systems treated physics almost isotropically, like turning the same dial for all pixels. Without local alignment, models blur physical cues across the whole frame, mixing dust, splash, and fire in the wrong places. Researchers also noticed something useful: vision-language models (VLMs) are surprisingly good at pointing to where events happen in a frame, especially when asked focused questions. While they aren’t video generators, their attention maps can highlight the exact regions where “combustion” or “liquid motion” occurs. If a generator could learn from those maps during training, it might learn to place physics precisely, not just globally. So the gap was clear: we need explicit physics-aware conditioning that is both (1) discriminative across different laws (combustion vs refraction), and (2) locally aligned so different spatial regions obey the right rules at the right time. The stakes are real. For education, kids learning physics won’t trust a tool that shows coffee flowing up. For robotics, planners need believable cause-and-effect. For creative studios, realism sells the story—fake-looking collisions break immersion. Enter ProPhy. It adds a physical “coach” next to your usual video generator. First, it reads the prompt and forms a global, learnable “physics prior” (like turning on the right set of physics spotlights). Then, as frames are denoised, it refines those priors token-by-token (tiny space-time chunks), deciding exactly which rule applies where. To teach that second step, ProPhy uses VLM attention as a guide during training—like a teacher pointing to the correct spots on the page—so the model learns to focus physics where it belongs. The result is videos with sharper, more believable dynamics: dust on impact, not before; coffee that pours down and raises the liquid level reasonably; and collisions that look like momentum matters.

02Core Idea

Top Bread (Hook): Imagine building a LEGO city. First, you sort pieces by type (roads, windows, doors). Then, when you assemble, you place each piece exactly where it belongs. That’s how ProPhy treats physics in videos—first pick the right kinds of physics, then put each one in the right place and time.

The Concept in One Sentence: ProPhy progressively aligns a video generator with physics by first choosing the right global physics (semantic stage) and then refining them locally per token (refinement stage), using VLMs to teach where each physical phenomenon happens.

Multiple Analogies:

Orchestra: The conductor (semantic stage) decides which sections should play (strings for swelling motion, brass for bursts), while the section leaders (refinement stage) fine-tune which exact musicians come in on each bar.
Cooking: Pick the right spices for the dish (semantic physics), then season different bites to taste (local refinement) so each mouthful feels just right.
Map + GPS: The map tells you the overall route (global physics prior), while GPS updates turn-by-turn (local alignment) to keep you on the best path.

Before vs After:

Before: Models often applied one-size-fits-all physics to the whole frame, confusing where effects should appear. They might know “combustion” is involved but light up the wrong regions.
After: ProPhy makes different parts of the video obey different rules simultaneously (anisotropic response), so only the region with fire glows, only the point of impact makes dust, and only the poured area changes liquid level.

Why It Works (Intuition): Physics in videos is both categorical (what kind of phenomenon?) and localized (where exactly?). Split the problem. First, learn clean, discriminative categories as “physical priors.” Then, teach the model to assign these priors to the right tokens—using a strong teacher (VLM) that’s already good at localization. This two-stage setup helps experts specialize (less confusion) and puts physical effects where they belong (more accuracy).

Building Blocks (each explained with the Sandwich pattern):

Top Bread (Hook): You know how a hospital triage sends each patient to the right specialist? Filling:
- What it is: A Mixture-of-Physics-Experts (MoPE) is a group of small expert modules, each focusing on different physical laws.
- How it works: A router looks at inputs and assigns weights to experts; their outputs combine to guide generation.
- Why it matters: Without MoPE, one generalist tries to do everything and often muddles different laws. Bottom Bread (Anchor): If the prompt mentions “refraction” and “splash,” MoPE can send glass-light behavior to one expert and liquid motion to another.
Top Bread (Hook): Imagine a librarian who sorts books by topic before you start reading. Filling:
- What it is: Semantic Experts form global, video-level physical priors from the text.
- How it works: A semantic router scores learnable “basis maps” (combustion, liquid, collision) and blends them into the latent features.
- Why it matters: Without this big-picture step, the model may pick the wrong physics family entirely. Bottom Bread (Anchor): A prompt about a “campfire by snow” boosts combustion and thermodynamics priors globally.
Top Bread (Hook): Think of a makeup artist touching up tiny spots before a photoshoot. Filling:
- What it is: Refinement Experts make token-level (tiny space-time) adjustments to apply the right law in the right place.
- How it works: For each token, a router picks top-k experts whose outputs refine that token’s features.
- Why it matters: Without local refinement, dust or fire might appear in the wrong regions. Bottom Bread (Anchor): Only the area around the torch tip gets “combustion” refinement; snow stays snow.
Top Bread (Hook): Picture a friend who can watch a clip and explain what’s happening where. Filling:
- What it is: Vision-Language Models (VLMs) link images/videos to text and can localize phenomena via attention.
- How it works: Ask focused questions; use their attention maps to find where “combustion” or “liquid motion” occurs.
- Why it matters: Without a good teacher, the generator guesses and misplaces effects. Bottom Bread (Anchor): Ask a VLM “Where is the fire?” and use its heatmap to guide training.
Top Bread (Hook): You know how rain only wets spots under a cloud, not the whole city? Filling:
- What it is: Fine-grained Physical Alignment teaches the model to respond differently across space-time.
- How it works: Compare the refinement router’s per-token predictions to VLM attention targets with a loss mask; learn to highlight only the right regions.
- Why it matters: Without it, physics spreads everywhere and looks fake. Bottom Bread (Anchor): Dust appears at ground impact, not midair.
Top Bread (Hook): Think of marinating first, then searing for a perfect steak. Filling:
- What it is: Progressive Physical Alignment Framework = global semantic stage → local refinement stage.
- How it works: Inject global priors early; refine token-by-token later; train with category similarity and fine-grained alignment + balance.
- Why it matters: Skipping progression either loses global sense or fails locally. Bottom Bread (Anchor): A “pouring coffee by a campfire” clip gets heat + liquid globally, then exact flames and flow locally.

03Methodology

High-Level Recipe: Text prompt → Semantic Expert Block (global physics priors) → Physical Blocks (carry priors through the generator) → Refinement Expert Block (token-level physics) → Denoised video.

Step 1: Input and Backbone

What happens: The system takes a text description and starts denoising random noise into a video using a diffusion-transformer backbone (like Wan or CogVideoX).
Why this step exists: The backbone is the skilled “painter” that creates visuals; ProPhy guides its physics.
Example: Prompt: “A child kicks a soccer ball, then catches it.” The backbone paints the ball, the field, and the child.

Step 2: Semantic Expert Block (SEB) — Global Prior Injection

What happens: The text embedding goes to a semantic router that scores a bank of learnable physical basis maps (experts). The weighted sum is added to the video latent, forming a physics-enhanced latent.
Why this step exists: It selects the right physics family (e.g., collision, liquid flow, combustion) at the video level; without it, the model may start from the wrong assumptions.
Example with data: If WISA categories suggest “collision + dust,” the router raises those maps’ weights; the latent gets nudged toward impact-aware features.

Step 3: Physical Blocks (PB) — Progressive Carriers

What happens: Multiple PBs (mirroring backbone transformer layers and initialized from them) progressively carry and mix the physics-enhanced latent through the denoising steps.
Why this step exists: It preserves the backbone’s strong rendering while gradually blending in physics cues, avoiding sudden, destabilizing changes.
Example: As frames sharpen, hints of where impact might occur become clearer in the intermediate features.

Step 4: Refinement Expert Block (REB) — Token-Level Routing

What happens: For each space-time token, a refinement router picks the top-k refinement experts (linear layers) to apply; their combined output adjusts that token’s features.
Why this step exists: Different regions need different physics at the same time; without token-level routing, physics gets smeared across the frame.
Example: Only tokens near the disc-ground contact get “dust emission” refinement; sky tokens remain unaffected.

Step 5: Physical Alignment Objectives — Teaching the Experts

Semantic alignment (SEB): Within a batch, samples from the same physical category (from WISA-style labels) are encouraged to have similar routing distributions; different categories diverge. This is implemented via a relative, cosine-similarity loss over pairwise routing similarities.
Why: It makes semantic experts discriminative—“combustion” and “explosion” related, “refraction” different—so the global prior is meaningful.
Example: Two “combustion” prompts end up with similar SEB weight patterns; an unrelated “refraction” prompt differs.
Fine-grained alignment (REB): Use VLM attention to mark where a phenomenon occurs. Ask two questions per video: (1) specific phenomenon (e.g., “Describe the combustion”), (2) generic background. Subtract attentions to get a sharp map of the phenomenon. After masking and filtering, supervise only high-confidence tokens. An MLP projects router outputs to match the attention dimension before computing the loss; a load-balance loss prevents a few experts from hogging all tokens.
Why: This directly teaches the router to activate the right expert only where needed and to keep specialists fairly used across data.
Example: The “liquid motion” attention highlights the pouring stream and puddle region; the router learns to light up matching tokens.

Step 6: Denoising to Output

What happens: With global and local priors guiding it, the diffusion backbone iteratively denoises the latent into a final video.
Why this step exists: It transforms guidance into pixels and motion.
Example: The final sequence shows dust only on impact, ball trajectories that look plausible, and liquid levels that respond sensibly.

The Secret Sauce

Progressive two-stage design: First get the category right, then get the location right. This reduces confusion and overfitting.
VLM-as-teacher: Borrow VLM’s strong localization at training time, but keep inference end-to-end (no external models needed later).
Token routing with balance: Top-k per-token experts let different regions specialize; load balancing averts mode collapse.
Reuse backbone layers: Initializing PBs from the backbone preserves visual quality while adding physics control.

What Breaks Without Each Part

No SEB: The model may pick the wrong physics family and drift.
No REB: Effects appear in the wrong places or times.
No VLM alignment: The router lacks a precise teacher and learns fuzzy boundaries.
No load-balance: A few experts dominate, others never learn; performance stalls.

04Experiments & Results

The Test: The team used VideoPhy2, which grades two things that matter: Physical Commonsense (PC: does physics look right?) and Semantic Adherence (SA: did you actually do what the prompt asked?). A strict Joint score requires both PC and SA to be strong for the same video. They also checked VBench quality metrics to ensure visuals stayed good or improved.

The Competition: ProPhy was added to strong base models (Wan2.1-1.3B and CogVideoX-5B) and compared to them alone and to physics-aware upgrades like WISA and VideoREPA.

The Scoreboard (with context):

On VideoPhy2 (ALL prompts), adding ProPhy to Wan2.1-1.3B boosted the key Joint score by about +19.7% absolute. That’s like moving from a solid B to an A on the hardest part of the test: being right and looking right at the same time.
On the harder subset (HARD), ProPhy still led or tied top performance across metrics, showing it handles fast, complex motions better than baselines.
On CogVideoX-5B, ProPhy achieved best or second-best results across PC, SA, and Joint, matching or beating other physics-aware rivals.
On VBench, ProPhy maintained strong visuals and especially improved Dynamic Degree, meaning it handled high-motion scenes more convincingly. The overall Quality Score went up too, showing physics gains did not come at the cost of beauty.

Why These Numbers Matter: Joint is the real-world test—viewers need both correct actions and faithful physics. Big gains there mean fewer “that looks wrong” moments. The Dynamic Degree lift shows better handling of splashes, collisions, and speedy motions—the exact cases where physics mistakes are most obvious.

Surprising Findings:

VLM attention really helps: even with some noise, using it as a teacher made local alignment much sharper. Human checks estimated around 77% of these fine-grained labels were accurate enough, which was sufficient to teach the generator useful localization.
Balance beats brute force: The load-balance loss for the refinement router prevented expert overuse and noticeably stabilized learning.
Relative (cosine) semantic loss worked better than a plain BCE in aligning categories: it kept PC and Joint high without sacrificing SA.

Takeaway Clips (qualitative):

Discus throw: Dust appears on impact, not traveling with the disc midair.
Iron-ball collision: The smaller ball moves realistically after being hit (momentum transfer), instead of fusing or passing through.
Pouring coffee near fire: Flames stay on the torch; liquid level rises reasonably; no random “ignition” of the coffee itself.

05Discussion & Limitations

Limitations:

Training supervision depends on VLM attention maps that can be noisy, especially for subtle dynamics. While masked and filtered, some imprecision remains.
The model learns physics from patterns, not explicit equations. It aims for plausible behavior, not guaranteed exact solutions to the laws of motion or fluid dynamics.
Multi-phenomenon scenes with very tiny effects can still confuse the router, leading to faint misplacements.

Required Resources:

A capable video diffusion backbone (e.g., Wan2.1-1.3B or CogVideoX-5B).
GPU memory to host additional experts (roughly +20–30% parameters) and training compute to run alignment losses.
Access to a VLM (only during training) to produce attention maps.

When NOT to Use:

If you need physics-grade accuracy bound by differential equations (e.g., engineering sims), ProPhy’s pattern-based realism isn’t enough.
If your content rarely contains physical events (mostly static scenes), the added complexity may not pay off.
If you cannot afford the extra training cost or don’t have a suitable VLM for alignment, benefits drop.

Open Questions:

Can we combine this with lightweight, equation-inspired constraints (e.g., conservation hints) to further raise accuracy?
How can we auto-clean VLM attention to reduce noise, especially for small or fast-moving phenomena?
Can experts become interpretable knobs (e.g., a “friction” slider) for controllable physics editing at inference time?
What’s the best curriculum: start global then local, or interleave multiple local teachers (e.g., motion, thermodynamics, optics) over time?

06Conclusion & Future Work

Three-Sentence Summary: ProPhy teaches video generators to obey physics by first selecting the right global physics and then aligning them locally, token by token. It borrows precise “where it happens” signals from VLMs during training so that dust, fire, liquids, and collisions appear in the right places at the right times. This brings big gains in physical commonsense and overall pass rates without sacrificing visual quality. Main Achievement: A progressive, two-stage Mixture-of-Physics-Experts with fine-grained alignment that reliably turns textual physics cues into anisotropic, region-specific behaviors in generated videos. Future Directions: Add light-touch equation-based constraints for core laws (e.g., momentum, continuity) to sharpen correctness; improve VLM signal cleaning; expose experts as user controls for physics editing; expand to longer, interactive scenes. Why Remember This: ProPhy shows that realism isn’t just how things look—it’s how they act. By splitting “what physics?” from “where and when?” and letting a good teacher guide localization, it sets a practical path toward world simulators that feel physically believable.

Practical Applications

•Interactive science demos that show students realistic collisions, splashes, and heat effects from plain-language prompts.
•Previsualization for films and games where directors check whether stunts, dust, and debris behave believably before full production.
•Robotics sandboxing to imagine and evaluate motion plans with more physically plausible reactions.
•Design brainstorming for products involving fluids or materials (e.g., testing how syrup spreads on a surface) at a visual prototype level.
•Training data generation for downstream perception models that benefit from physically consistent motion cues.
•Augmented reality previews where added effects (smoke, sparks, rain) stick to the right places and obey gravity.
•Safety education clips (e.g., what happens when a hot object touches fabric) that look realistic without risking harm.
•Sports analysis visuals that better mimic ball spins, bounces, and dust on impact for coaching content.
•Museum and classroom exhibits that explain optics (reflection, refraction) using convincing synthetic videos.
•Content moderation and QA pipelines to flag likely physics violations (as a heuristic) in auto-generated media.

Version: 1