PhyGDPO: Physics-Aware Groupwise Direct Preference Optimization for Physically Consistent Text-to-Video Generation

Yuanhao Cai; Kunpeng Li; Menglin Jia; Jialiang Wang; Junzhe Sun; Feng Liang; Weifeng Chen; Felix Juefei-Xu; Chu Wang; Ali Thabet; Xiaoliang Dai; Xuan Ju; Alan Yuille; Ji Hou

PhyGDPO: Physics-Aware Groupwise Direct Preference Optimization for Physically Consistent Text-to-Video Generation

Intermediate

Yuanhao Cai, Kunpeng Li, Menglin Jia et al.12/31/2025

arXiv PDF

Key Summary

•This paper teaches text-to-video models to follow real-world physics, so people, balls, water, glass, and fire act the way they should.
•It builds a physics-rich training set (PhyVidGen-135K) using a vision-language model that reasons step-by-step to find videos with strong physical interactions.
•Instead of judging videos in pairs, it ranks whole groups at once (groupwise Plackett–Luce), which better captures big-picture qualities like motion smoothness and plausibility.
•A physics-guided rewarding scheme uses a physics-aware VLM to give more training weight to hard, physics-breaking cases so the model learns faster where it struggles.
•A memory-saving LoRA-Switch Reference trains the model without duplicating the full network, making preference training more stable and efficient.
•On VideoPhy2 and PhyGenBench, the method beats leading open-source approaches and even outperforms strong commercial systems on some tough action categories.
•Human testers consistently preferred the physics of PhyGDPO’s videos over other methods, showing that the improvements are noticeable to people, not just machines.
•This approach reduces body deformations, improves object interactions (like balls and rackets or mallets), and models phenomena like refraction and fire spread more realistically.
•The key idea is aligning video generation with physics using groupwise preferences and physics-aware rewards, not just better prompts or bigger models.

Why This Research Matters

Videos that follow physics are more trustworthy and useful for real-world tasks. Robots, self-driving systems, and training simulators need realistic motion to learn safe behavior. Filmmakers and game designers benefit from believable actions, collisions, and materials without hand-crafting physics every time. Students and educators can rely on physics-faithful demonstrations to teach concepts like momentum, refraction, and combustion. This approach reduces the gap between pretty visuals and real cause-and-effect, making AI a better world simulator. It also does so efficiently, saving memory and stabilizing training so more teams can adopt it.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine watching a superhero movie where a basketball floats up to the hoop, pauses in mid-air, and then shoots sideways without anyone touching it. Even if the video looks crisp and colorful, your brain shouts, “That’s not how the world works!”

🥬 The Concept (Text-to-Video Generation): Text-to-video (T2V) models turn sentences into short video clips. How it works, in simple steps: (1) You type a prompt like “A soccer player kicks a ball.” (2) The model imagines frames that match the words. (3) It fills in details—people, motion, lighting—and stitches frames into a video. Why it matters: Without understanding physics, the model may produce pretty but impossible motion—like legs bending wrong or balls teleporting.

🍞 Anchor: Ask a T2V model for “A gymnast does a forward roll on a beam.” A good video shows smooth, balanced motion; a bad one has a twisting torso that warps like rubber.

The World Before: Early T2V systems got better at sharp images, color, and style, thanks to huge datasets and powerful transformers. But there was a catch: they often messed up physics. People’s arms bent strangely, objects clipped through each other, and forces like gravity or friction seemed optional. Two popular fixes didn’t fully solve the problem:

Graphics-based simulators could make perfect bounces and rigid-body motion—but only in simple worlds with carefully set parameters. Real scenes are messy: clothes flap, glass shatters, water ripples, people move with muscles and balance.
Prompt-extension with large language models (LLMs) tried to stuff physics into the text—“mention gravity,” “add friction,” “describe energy transfer.” But if the video model can’t truly reason about physics, adding more words doesn’t guarantee better motion. Worse, LLMs can be wrong about physics, which misleads training and generation.

The Problem: Models trained only by matching videos to captions learn to imitate appearance, not cause-and-effect. They also lack negative signals—examples that say, “This looks realistic” versus “This violates physics.” That makes it hard to learn the difference between smooth, lawful motion and physically broken motion.

🍞 Hook: You know how picking a movie with just two trailers is easier than judging the whole year’s films? But to pick a true “best,” you need to compare many at once.

🥬 The Concept (Direct Preference Optimization): DPO teaches models by comparing outputs and preferring better ones. How it works: (1) Generate alternatives. (2) Decide which is preferred. (3) Push the model to make preferred outputs more likely. Why it matters: Classic DPO usually compares just pairs, missing the big-picture signal you get by seeing a whole lineup at once.

🍞 Anchor: If you judge only two soccer videos, you might miss that a third one shows the ball curving naturally with spin. A group comparison spots that.

Failed Attempts: Pairwise DPO and aesthetics-focused alignment improved beauty and crispness more than physics. Memory-heavy DPO, which duplicates a full reference model, slowed training and sometimes destabilized it. And without physics-aware scoring, the “winner” could still be physically wrong but prettier.

The Gap: We needed three ingredients working together: (1) a large set of videos rich in physical interactions, (2) a way to judge groups of videos so big-picture realism (like momentum, collisions, and smooth joints) shows up in the signal, and (3) a physics tutor that points training toward the hardest, most error-prone cases.

🍞 Hook: Think of a science fair judge who not only ranks all projects together but also gives extra attention to the trickiest experiments, like building a working bridge out of spaghetti.

🥬 The Concept (Physical Consistency): Physical consistency means videos follow real-world rules—gravity pulls down, glass shatters into shards, a kicked ball accelerates then slows. How it works: (1) Spot objects and materials. (2) Track forces and interactions. (3) Check outcomes match common-sense physics. Why it matters: Without it, videos feel uncanny or useless for planning, training robots, games, or education.

🍞 Anchor: When a baseball hits a bottle, you expect cracks, shards flying outward, and gravity pulling them down—not the bottle bouncing like a beach ball.

Real Stakes: Why should we care? Because physics-aware videos aren’t just cooler; they’re safer and more useful. Robots trained with fake physics might drop objects. Driving simulators with wrong motion could teach bad habits. Filmmakers want believable stunts. Game developers need consistent rules. Even students learn science better from demonstrations that follow real laws.

This paper’s pitch: Build a physics-savvy training set, judge whole groups of videos (not just pairs), use a physics-aware critic to reward or penalize, and make training efficient with a memory-saving reference. That combination—data + groupwise judging + physics rewards + efficient training—closes the physics gap in text-to-video.

02Core Idea

🍞 Hook: Imagine coaching a soccer team by playing one-on-one drills only. Helpful? Sure. But when you watch the whole team scrimmage, you suddenly see timing, spacing, and teamwork—the big picture you missed.

🥬 The Concept (Aha!): Align text-to-video models with real physics by ranking whole groups of candidate videos using physics-aware rewards, and train efficiently using a lightweight switchable reference.

How it works, at a glance:

Build a physics-rich dataset so the model actually sees real interactions. 2) Generate groups of candidate videos per prompt. 3) Use a groupwise ranking model (Plackett–Luce) so the winner must beat the entire group, not just one opponent. 4) Ask a physics-aware vision-language model for guidance, giving extra weight to the hardest, most physics-violating cases. 5) Train without copying the full model, using a LoRA-Switch Reference to save memory and keep training stable. Why it matters: Pairwise comparisons miss holistic motion; physics-ignorant rewards chase pretty but impossible videos; memory-heavy training is slow and unstable. This idea fixes all three.

🍞 Anchor: For “A player dunks a basketball,” the method generates multiple clips, spots the one where the ball arcs naturally and passes through the hoop without ghosting or warping, and learns from that choice—over and over.

Multiple analogies:

Science fair: Instead of judging two volcanoes, the judge ranks all volcano projects together, focusing extra on those with tricky chemistry.
Classroom essay grading: The teacher compares the whole stack, not just one pair, and gives extra feedback on the hardest prompts to lift class performance.
Cooking contest: Judges pick the best dish across all entries at a table, and chefs get extra points for mastering the toughest techniques (like perfect soufflés).

Before vs After:

Before: T2V models looked good but bent physics; training aligned to aesthetics or pairwise preferences; memory costs were high.
After: T2V models keep bodies stable, collisions believable, and phenomena like refraction and fire propagation realistic; training uses holistic preferences and physics-aware weights; memory is slim and stable via LoRA-Switch.

Why it works (intuition):

Groupwise ranking forces the winner to be globally best, capturing smoothness, timing, and cause-and-effect that pairwise misses.
Physics-guided rewards act like a tutor, steering the model’s attention to tough, error-prone motions, so it learns where it struggles most.
Memory-efficient reference (LoRA-SR) keeps the student and the reference close, reducing drift and training wobble, while saving compute.

Building blocks (with Sandwich explanations on first use):

🍞 Hook: You know how a librarian picks books for a science shelf by skimming for experiments, materials, and results? 🥬 The Concept (PhyAugPipe): PhyAugPipe is a pipeline that builds a physics-rich video dataset by filtering and organizing videos with strong physical interactions. How it works: (1) A vision-language model parses objects, materials, and actions; (2) it reasons step-by-step (chain-of-thought) about forces and outcomes; (3) it scores physics richness and extends prompts with clear causal details; (4) it clusters by action type; (5) it samples more from hard categories. Why it matters: If the model never sees real physics, it can’t learn it. 🍞 Anchor: From a million videos, it picks things like “soccer kicks,” “glass shattering,” and “paper burning,” not just “cute dog sitting.”

🍞 Hook: Imagine choosing the best from a whole lineup, not just choosing between two. 🥬 The Concept (Groupwise Plackett–Luce): Groupwise Plackett–Luce ranks a set of videos so the winner must be better than all others together. How it works: (1) Make a candidate set; (2) score each; (3) compute the chance each is the top choice; (4) train the model to make the true winner likelier. Why it matters: Pairwise misses global context; groupwise captures whole-scene realism. 🍞 Anchor: For five gymnastics clips, the true best shows smooth balance and joint limits; it beats all four others at once.

🍞 Hook: Think of a coach who gives extra practice to the toughest moves. 🥬 The Concept (Physics-Guided Rewarding, PGR): PGR uses a physics-aware VLM to give more weight to clips that break physics or are more complex, so training focuses on the hardest parts. How it works: (1) Get physics and semantic scores; (2) turn them into weights; (3) emphasize correcting low-scoring (hard) cases. Why it matters: The model learns fastest by fixing its biggest mistakes. 🍞 Anchor: If a squash video botches ball–wall bounces, PGR dials up that case so the model corrects it.

🍞 Hook: You know how adding a small plugin to a big app is lighter than copying the whole app? 🥬 The Concept (LoRA-Switch Reference, LoRA-SR): LoRA-SR freezes the big backbone as the reference and swaps tiny LoRA adapters for training vs reference, avoiding copying the whole model. How it works: (1) Keep one backbone; (2) attach trainable LoRA modules; (3) flip a switch to evaluate trained or reference mode. Why it matters: Saves memory, speeds training, and keeps the student close to the reference, improving stability. 🍞 Anchor: It’s like using one TV and swapping HDMI inputs instead of buying two TVs.

🍞 Hook: Solving math problems step-by-step is clearer than jumping to the answer. 🥬 The Concept (Chain-of-Thought Reasoning with a VLM): A vision-language model explains physics in steps—objects, forces, interactions, outcomes—before scoring. How it works: parse → check vision → reason about forces → rate richness → extend prompt. Why it matters: Step-by-step logic catches errors and clarifies causality. 🍞 Anchor: For “pouring charcoal and lighting it,” it notes charcoal, oxygen, flame, heat increase, and smoke rise.

03Methodology

At a high level: Text prompt → PhyAugPipe builds physics-rich training data and sampling plan → Generate groups of candidate videos → Groupwise preference optimization with physics-guided rewarding → Memory-efficient training with LoRA-SR → Physically consistent video output.

Step 1: Build physics-rich training data (PhyAugPipe)

What happens: Start from a large text–video pool (~1M). A vision-language model (Qwen2.5) follows a chain-of-thought script: it parses objects (ball, glass), materials (metal, water), actions (kick, shatter), and forces (push, gravity). It checks frames to avoid hallucinations, reasons how forces cause outcomes, assigns a 0–1 physics-richness score, and writes a short physics-aware prompt extension. Then, it clusters videos into action categories (like gymnastics, squash, glass smashing) via sentence embeddings and counts distribution. Finally, it samples more from categories where models perform poorly, guided by a physics-aware VLM (VideoCon-Physics) that scores semantics adherence and physics common sense.
Why it exists: Without targeted data, the model sees too many easy scenes and too few physically rich ones. Clustering and sampling ensure balance and focus on hard motions.
Example: For “A baseball bat smashes a glass bottle,” the parser identifies bat (rigid), glass (brittle), force (impact), and outcome (fracture, shards flying). It scores high and keeps it. Meanwhile, a simple “A house with a red roof” gets a low score and is filtered out.

Step 2: Form groups of candidates per prompt

What happens: For each selected prompt, the T2V model (e.g., Wan2.1) generates several different videos via different random seeds. Together with the real video (when available), these form a group.
Why it exists: A group exposes variety—smooth vs jerky motion, correct vs impossible collisions—so the training sees the whole landscape at once.
Example: For “A gymnast performs a forward roll on a balance beam,” five generated clips differ subtly: one with stable torso and hands, one with ankle slip, one with arm warp, etc. The real clip is the gold standard.

Step 3: Groupwise preference optimization (Plackett–Luce)

What happens: Instead of pairwise win/lose, the model learns from the probability that a given video is the top choice in the whole group. It increases the likelihood of the true winner (usually the real video) while decreasing the likelihood of the others. Training operates efficiently by using a theoretically justified upper bound that lets it update from just a subset each iteration, keeping compute manageable.
Why it exists: Physics realism is holistic. Comparing only two clips can miss context like multi-frame smoothness, coordinated joints, and consistent cause-and-effect.
Example: Among five squash videos, the winner shows the ball compressing briefly on impact and bouncing at a believable angle and speed. The model is nudged to favor features leading to that result.

Step 4: Physics-Guided Rewarding (PGR)

What happens: A physics-aware VLM (VideoCon-Physics) scores each generated video on semantic match and physics common sense. These scores are converted into training weights. Lower physics scores mean higher weights—so the model focuses on fixing its biggest mistakes. The weighting is smoothly adjusted to keep optimization stable (no wild swings) while sharpening preference when differences are clear.
Why it exists: Not all mistakes are equal. Fixing a severe physics error (like a ball accelerating upward without force) matters more than polishing a small visual glitch.
Example: If a dunk video shows the ball passing through the rim but then hanging midair, it gets a low physics score and high weight, drawing strong corrective updates.

Step 5: LoRA-Switch Reference (LoRA-SR) for efficient, stable training

What happens: Traditional DPO copies a full model to serve as a static reference, doubling memory and risking instability as the trainable model drifts far away. LoRA-SR freezes the backbone once and attaches tiny trainable LoRA modules. By toggling a switch, the same backbone produces either the reference output (no LoRA) or the trainable output (with LoRA). This cuts memory and helps the trainable model stay close to the reference, stabilizing learning.
Why it exists: Memory savings enable bigger batches or higher resolution, and smaller, steadier steps reduce training wobble.
Example: On 8 H100 GPUs, LoRA-SR allows long training runs at 480×832 resolution without storing a second full model, while improving hard-action scores.

Step 6: Flow-matching-friendly training signal (intuitive view)

What happens: The T2V backbone predicts how to move from noisy intermediate frames toward clean frames across time. Training compares the trainable model’s step with the reference model’s step on the winner video versus loser videos and prefers steps that better match the winner’s motion field. You can think of it as “which direction-of-improvement would get us closer to the physically correct motion?”
Why it exists: This aligns directly with how modern video generators refine frames over time, making preference training efficient and well-targeted.
Example: For water refraction, the correct motion field bends edges of the pencil at the water line consistently across frames; the model learns to reproduce that subtle pattern.

The secret sauce:

Groupwise ranking captures the whole-scene physics signal that pairwise misses.
Physics-guided weights turn a generic preference loss into a physics tutor.
LoRA-SR makes the whole thing practical—faster, lighter, and more stable.

Sandwich recap of key concepts introduced:

PhyAugPipe: 🍞 Imagine a treasure hunt for videos with real physics. 🥬 It parses, reasons, scores, clusters, and samples to build a physics-rich set. 🍞 Example: keeps “glass shatters with falling shards,” drops “static cartoon house.”
Groupwise Plackett–Luce: 🍞 Like ranking the whole talent show. 🥬 It learns which video is best in the full lineup, not just head-to-head. 🍞 Example: picks the most balanced gymnast clip among many.
Physics-Guided Rewarding: 🍞 A coach has you repeat the trickiest stunt more. 🥬 Lower-physics-score clips get more training weight. 🍞 Example: fixes impossible squash rebounds first.
LoRA-SR: 🍞 Swap a small adapter instead of buying a second TV. 🥬 One frozen backbone, switch-in LoRA for training vs reference. 🍞 Example: trains faster with less memory while staying stable.

04Experiments & Results

The test: Can the method improve physics realism across tough actions and physical phenomena, as judged by both automated tools and humans?

Datasets and metrics:

Training: PhyVidGen-135K filtered by PhyAugPipe; a focused 17K subset sampled with physics rewarding for preference optimization.
Evaluation: VideoPhy2 (hard actions, activities/sports, interactions) and PhyGenBench (mechanics, optics, thermal, material). Automated VLM-based scorers check semantic match and physics common sense. A user study with 104 participants directly measures human preference for physics realism.

Competition:

Strong baselines: VideoDPO (pairwise preference-trained T2V), PhyT2V (LLM-augmented physics prompts), leading open-source models, and even recent strong commercial systems (OpenAI Sora2 and Google Veo3.1) for certain action categories.

Scoreboard with context:

VideoPhy2: The proposed method achieves higher overall and hard-action scores than all open methods and even surpasses strong commercial systems on some challenging categories. Think of it like moving from a B to a solid A on the hardest physics questions, where others still hover around B or B−. The hard-action score sees especially large improvements over the base model and over VideoDPO/PhyT2V, indicating better handling of complex motions (gymnastics, polo, squash, handsprings).
PhyGenBench: Gains appear across mechanics (forces, motion), optics (refraction, reflection), thermal (combustion, heat effects), and material (deformation, fracture). This is like performing well not only in physics class overall but also in each unit test—kinematics, optics, heat, and materials.
Human preference: In head-to-head comparisons, people picked the new method’s videos far more often. That’s crucial: humans are sensitive to weird motion and timing; they rewarded the model that respected gravity, momentum, and material behavior.

Surprising findings:

Beating strong commercial systems on some categories: On action-heavy prompts like dunks or glass smashing, the approach produced cleaner, more plausible sequences (ball trajectory, net interaction, shard fallout) than expected.
Buoyancy and refraction details improved notably: Subtle cues like a tennis ball riding low in water or a pencil bending at the water line came out convincingly, suggesting the groupwise + physics weighting picks up fine-grained physical cues.
Stability and efficiency from LoRA-SR: Cutting memory while improving hard-action scores seems counterintuitive, but sharing one frozen backbone with switchable adapters reduced drift and helped training converge on physically consistent motion.

Concrete examples from qualitative results:

Gymnastics: Bodies keep shape; joint limits look natural; balance on the beam is steady rather than jerky.
Polo and squash: Tool–ball contact (mallet, racket) transfers force correctly; bounces and angles feel right.
Basketball: The ball arcs and passes through the hoop cleanly, without ghosting or rubbery net behavior.
Glass smashing: Fractures and shards disperse outward and fall under gravity.
Refraction & combustion: Pencil-in-water shows magnification and bending lines; burning paper spreads flame realistically.

Ablations that make the numbers meaningful:

PhyAugPipe components (filtering with chain-of-thought, clustering, physics-rewarded sampling) each raise scores, with the biggest jumps on hard actions. That means better data construction really matters.
Groupwise modeling + PGR: Replacing pairwise with groupwise and adding physics-aware weights steadily boosts performance, again most on hard actions. The model learns complex, coordinated motion.
LoRA-SR: Slashes memory and storage needs while improving scores—practical and accurate, not a trade-off.

Bottom line: Across benchmarks, user studies, and ablations, the method consistently lifts physical realism—especially where it’s hardest—while training more efficiently.

05Discussion & Limitations

Limitations:

Physics breadth: Even 135K physics-rich pairs can’t cover every combination of materials, camera motions, lighting, and edge cases (e.g., soft-body tearing plus splashing plus smoke). Extremely rare events may still fool the model.
VLM dependence: Physics-guided rewarding relies on a physics-aware VLM. If that scorer has blind spots (e.g., misreads fast motion blur), sampling or weighting might under- or over-correct certain cases.
Real video as winner: Using real clips as gold standards is powerful, but some training prompts may lack perfectly matched real references, limiting supervision quality.
Long-horizon physics: Very long videos or complex chains of causes (domino runs, multi-step tools) remain challenging; small drifts can accumulate.

Required resources:

Compute: Multi-GPU training (e.g., 8×H100 over several days) with mixed precision helps. Storage for datasets and generated candidates is needed, though LoRA-SR reduces model duplication.
Models: Access to a capable T2V backbone (e.g., Wan2.1) and a reliable physics-aware VLM (e.g., VideoCon-Physics) for scoring.
Data: A large initial pool to filter from; the more diverse, the better.

When not to use:

Stylized fantasy where physics is intentionally broken (cartoon gravity, magical motion) and realism is not the goal.
Ultra-tight latency constraints that can’t afford groupwise candidate generation (though candidate count can be tuned).
Domains demanding exact numerical physics (engineering-grade simulation). This improves plausibility, not precise measurement.

Open questions:

Can we further reduce reliance on external VLM scorers by learning a self-calibrating physics critic inside the T2V model?
How to extend to longer videos with consistent object states across shots and scenes (e.g., a ball getting scuffed over time)?
Can we disentangle appearance from dynamics better, so physics learning transfers across styles and domains?
What’s the best curriculum: which physics to learn first—rigid collisions, frictional contact, deformable materials, or fluids?
How can we add lightweight human feedback (small preference batches) to correct VLM biases without heavy annotation?

06Conclusion & Future Work

Three-sentence summary: This work makes text-to-video models follow real-world physics by building a physics-rich dataset, ranking whole groups of candidate videos, and guiding training with physics-aware rewards. A memory-saving LoRA-Switch Reference keeps training efficient and stable, avoiding full model duplication. The result is visibly more plausible motion and interactions, outperforming strong baselines and earning higher human preference.

Main achievement: Unifying groupwise preference optimization with physics-guided rewards and efficient LoRA-based referencing to directly align video generation with physical laws, leading to substantial gains on hard action and physics phenomenon benchmarks.

Future directions: Train a built-in physics critic to reduce external scoring dependence; scale to longer, multi-scene videos with consistent object states; expand coverage to richer materials and fluid–structure interactions; and explore small-in-the-loop human feedback to refine tricky edge cases. Also, adapt the approach to controllable generation (camera paths, forces) and embodied simulations.

Why remember this: It shows that better physics doesn’t just come from bigger models or longer prompts—what matters is teaching the model to prefer physically correct videos across whole groups, with a smart coach (physics-guided rewards) and practical training tools (LoRA-SR). This pathway brings AI video closer to trustworthy world simulation.

Practical Applications

•Safer robotics training videos that teach grasping, carrying, and placing with proper physics.
•Autonomous driving scenario generation with realistic vehicle dynamics and collisions.
•Game prototyping where characters, balls, and breakable objects behave believably out of the box.
•Previsualization for film and ads with physically plausible stunts and effects.
•STEM education clips that correctly demonstrate refraction, combustion, and momentum.
•Sports analytics and drills visualization with accurate ball trajectories and impacts.
•Industrial safety simulations showing realistic equipment motion and material failures.
•AR/VR content generation where interactions with virtual objects feel physically right.
•Content moderation and quality control tools to flag physics-breaking synthetic videos.
•Research testbeds for studying emergent physical reasoning in generative models.

Version: 1