Goal Force: Teaching Video Models To Accomplish Physics-Conditioned Goals

Nate Gillman; Yinghua Zhou; Zitian Tang; Evan Luo; Arjan Chakravarthy; Daksh Aggarwal; Michael Freeman; Charles Herrmann; Chen Sun

Goal Force: Teaching Video Models To Accomplish Physics-Conditioned Goals

Beginner

Nate Gillman, Yinghua Zhou, Zitian Tang et al.1/9/2026

arXiv PDF

Key Summary

•Video models can now be told what physical result you want (like “make this ball move left with a strong push”) using Goal Force, instead of just vague text or a final picture.
•Goal Force teaches a video model to plan causes (like a cue ball hit) that will create the desired effect (the target ball moves left).
•It uses a simple, 3-channel control signal: one for direct pushes, one for goal pushes, and one for mass, all drawn as moving or static fuzzy dots.
•Trained only on simple synthetic physics (balls, dominos, a swaying flower), the model surprisingly handles new, real-world scenes like tools, people, and pets interacting.
•Compared to text-only prompting, people preferred Goal Force for actually achieving the desired motion, with little loss in realism or picture quality.
•The model picks physically valid plans (like choosing the unblocked object to start a chain) and shows diversity by finding multiple different valid solutions.
•It can even adjust plans using mass info (heavier things need stronger or faster hits), acting like a rough neural physics simulator without any physics engine at test time.
•The system is built by fine-tuning a ControlNet on top of an open video diffusion model (Wan2.2) with a clever masking curriculum that forces planning.
•Limitations include reliance on relative (not absolute) forces, occasional visual glitches from the base generator, and difficulty with very long or hidden chains.
•This matters for robotics and visual planning: clearer, physics-aware goals make smarter, safer, and more controllable video plans and action suggestions.

Why This Research Matters

Clear, physics-aware goals make video models far more useful for real tasks. Instead of vague text like “make it move,” you can specify the exact push a target should feel and let the model plan the causes. This could power visual planners that help robots choose safer, more reliable actions in messy homes or factories. It also offers controllable animation tools for artists and educators who want realism without hand-authoring every motion. Because the model learns physics-like behavior from simple demos and needs no simulator at test time, it’s easy to deploy. Over time, this approach could bridge from videos to real-world robotics, improving safety and precision by grounding plans in physical effects.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you want to knock over the last domino in a long line. You don’t just say “knock over the last one.” You choose which earlier domino to tap so the chain reaction reaches the end.

🥬 The Concept: Before this paper, video models mostly listened to text like “a ball scores a goal” or to a final image, and tried to fill in the rest. They were great at making pretty videos, but not great at following precise, physics-heavy instructions.

What it is: A video “world model” tries to imagine future frames from a starting scene and a prompt.
How it worked: You gave it text (and sometimes an image), and it sampled a future that looked plausible.
Why it struggled: Text like “hit the ball hard to the left” is vague about where, when, and how much force. A target image is also hard to specify for moving things and exact timings.

🍞 Anchor: It’s like telling a friend “win the game” versus “tap the 3rd domino from the left so the last one falls.” One is fuzzy; the other is a clear physical plan.

🍞 Hook: You know how in sports, a coach doesn’t just say “score,” but “kick low and to the corner with medium power”? They’re specifying the force and direction, not just the outcome.

🥬 The Concept: The problem researchers faced was making video models obey detailed, physics-aware goals.

What it is: A precise way to tell the model “make this object feel a push in this direction, with this strength.”
How it works: Instead of high-level text, you specify a goal force vector for a target object.
Why it matters: Without force details, the model might make a nice-looking clip that misses the actual physical goal.

🍞 Anchor: If you want the orange ball to move left, you need something to hit it from the right at the right speed and moment. Saying “score a goal” won’t guarantee that exact hit happens.

🍞 Hook: Think of building with toy blocks: you can show the final tower, or you can say exactly how to push each block to get there safely.

🥬 The Concept: Past attempts gave models control over cameras, paths, or direct forces, but not over planning the cause that creates a desired effect on a different object.

What it is: Camera controls move your “eyes,” trajectories draw exact paths, and direct force says “poke here now.”
How they work: They directly set motion or view, often needing dense paths.
Why they fall short: Dense paths are hard to know ahead of time; direct force doesn’t plan multi-step cause-and-effect.

🍞 Anchor: If you want the red block to fall, a direct force says “push the red block.” But a planned goal might swing a pendulum first so it knocks the red block over later.

🍞 Hook: Imagine knowing the end of a Rube Goldberg machine (ring the bell) but needing the model to figure out which marble to release to make it happen.

🥬 The Concept: The missing piece was a goal language that’s physical (forces), local (which object), and time-aware (when), so the model can plan causes to reach effects.

What it is: A goal force: “make object B feel this push.”
How it works: The model backtracks and generates the needed earlier events.
Why it matters: It turns guessing into planning.

🍞 Anchor: To knock the last domino, the model learns to pick which earlier domino to hit, not just pretend the last one magically falls.

🍞 Hook: Think of teaching by simple science demos: rolling balls and dominos can explain lots of physics.

🥬 The Concept: The authors trained on simple synthetic “causal primitives” (balls, dominos, a swaying flower) to bootstrap complex reasoning.

What it is: A curriculum of basic interactions.
How it works: Pair videos with goal forces and direct forces, and sometimes hide one so the model must infer it.
Why it matters: From simple building blocks, the model generalizes to tools, people, and pets—like learning ABCs to write stories.

🍞 Anchor: Even though it studied dominos and balls, it later figured out that a golf club can deliver the right hit to a ball, zero-shot, in new scenes.

02Core Idea

🍞 Hook: You know how a GPS lets you enter your destination, and it figures out the turns to get there? You don’t list every street; you just say the goal.

🥬 The Concept: The “aha!” is to tell the video model the physical goal (a goal force on a target object) so it plans the earlier actions that cause it.

What it is: Goal Force—“make object X feel a push of this size and direction.”
How it works: The model back-reasons a causal chain (who hits what, in what order) and generates that video.
Why it matters: Without this, you either over-specify paths (hard!) or get vague results that don’t meet the physical goal.

🍞 Anchor: If you want the orange ball to slide left strongly, the model decides to hit it with the white ball from the right, then shows that action.

🍞 Hook: Think of bowling. You don’t move the pins; you roll the ball so the pins fall. The effect (pins fall) comes from a cause (ball hits pins).

🥬 The Concept (Analogy 1): Goal Force is like saying “I want the pins to fall this way,” and the system figures out how to roll the ball.

What it is: An effect-first command.
How it works: The system imagines backward from the effect to the needed cause.
Why it matters: It automates planning, not just motion.

🍞 Anchor: To topple the rightmost domino, it may choose to push the third domino so the wave reaches the end.

🍞 Hook: Picture a coach telling a soccer player, “bump the ball with medium power to the left.” That’s a force vector, not a poem.

🥬 The Concept (Analogy 2): A force vector is like a tiny arrow showing where and how hard to push.

What it is: Direction + magnitude tied to a specific target.
How it works: The model treats this as the goal to achieve.
Why it matters: Arrows are unambiguous; words can be fuzzy.

🍞 Anchor: “Push the red block north with a gentle nudge” is much clearer than “make the red block move.”

🍞 Hook: Imagine assembling a chain reaction with different tools—a stick, a ball, a ramp—any of which could start the sequence.

🥬 The Concept (Analogy 3): Goal Force samples different valid plans, not just one.

What it is: A distribution of solutions that all achieve the goal.
How it works: The model learns multiple causal paths and picks diverse ones.
Why it matters: Flexibility helps in cluttered, blocked, or risky scenes.

🍞 Anchor: To move a duck toy left, it might choose the center duck to bump it if the side duck is blocked by a wall.

🍞 Hook: Think of a smart lab partner who can do “backward science”—starting from the desired effect to design the experiment.

🥬 The Concept: Why it works: grounding in simple physics plus a clear control signal teaches the model to propagate forces across time and space.

What it is: A 3-channel control: direct force (cause), goal force (effect), and mass.
How it works: By sometimes hiding cause or effect during training, the model learns to infer the missing piece.
Why it matters: It becomes an implicit neural physics simulator—no external engine needed at test time.

🍞 Anchor: Even without a physics engine at inference, it can figure out that a heavier target needs a stronger or faster hit to reach the same motion.

🍞 Hook: Think of switching from “move this now” to “make that end-result happen.”

🥬 The Concept: Before vs After.

What it is: Before: direct pokes, fixed paths, or vague text. After: effect-first planning with physics-aware choices.
How it works: The model plans who should act, when, and how hard.
Why it matters: You get controllable, realistic videos that meet precise physical goals.

🍞 Anchor: Instead of instantly sliding the orange ball (cheating), the model shows the cue ball striking it so the orange ball ends up moving left, as requested.

03Methodology

🍞 Hook: Imagine giving a treasure map with one big X showing where you want a knock to happen, and your partner figures out the steps to make it happen.

🥬 The Concept: High-level recipe: Input image + text + goal force → multi-channel control signal → video diffusion model with ControlNet → video that shows the causes leading to the goal.

What it is: A pipeline that conditions a video generator on physics signals.
How it works: A 3-channel tensor encodes direct force, goal force, and mass; a ControlNet steers a pretrained video diffusion model.
Why it matters: Without this structure, the model can’t consistently translate desired effects into causal actions.

🍞 Anchor: You give a frame of a pool table and an arrow on the orange ball (goal force). The system outputs a clip where the white ball strikes the orange one so it moves as requested.

🍞 Hook: You know those fuzzy stickers on a map that show “go this way”? We’ll use fuzzy dots in videos to show pushes.

🥬 The Concept: Multi-channel physics control tensor.

What it is: A video-shaped tensor with 3 channels over time and space.
How it works (step-by-step):
1. Channel 0 (Direct Force): a moving Gaussian blob marks where/when a direct poke happens.
2. Channel 1 (Goal Force): a moving Gaussian blob marks the desired effect on a target object.
3. Channel 2 (Mass): a static Gaussian blob marks relative mass per object.
Why it matters: Without clear, separate channels, the model would confuse causes with effects or ignore mass.

🍞 Anchor: For a pendulum-and-block setup, the goal blob sits on the block (effect), and the model figures out the pendulum swing (cause).

🍞 Hook: Think of a teacher covering parts of a diagram so you have to reason out what’s missing.

🥬 The Concept: Masked training for implicit planning.

What it is: During training, the model sees either the cause (direct force) or the effect (goal force), not both, at random.
How it works:
1. If only the goal is shown, the model must generate the earlier cause.
2. If only the direct force is shown, it must simulate the resulting effect and secondary impacts.
3. The mass channel is sometimes hidden so the model learns to infer mass from appearance when needed.
Why it matters: Without masking, the model could “cheat,” copying signals instead of learning cause-effect reasoning.

🍞 Anchor: If the video shows a target balloon must drift right (goal), the model learns to make a fan or a bump appear earlier to make that happen.

🍞 Hook: Think of learning physics with three classic demos: dominos, rolling balls, and a flower swaying when nudged.

🥬 The Concept: Synthetic training curriculum.

What it is: Three datasets: dominos (3k), rolling balls (6k), and a swaying carnation (3k).
How it works:
1. Dominos: Start a chain so a far domino receives the goal force.
2. Balls: Collide or miss; learn angles, timing, and mass effects.
3. Flower: Learn non-rigid oscillations after a poke.
Why it matters: These simple “physics primers” cover rigid hits, chains, and soft dynamics, teaching transferable skills.

🍞 Anchor: From ball collisions, the model learns that off-center hits change directions; from dominos, it learns multi-step propagation.

🍞 Hook: Imagine installing a steering wheel on a powerful engine so you can guide it precisely.

🥬 The Concept: ControlNet on a video diffusion backbone.

What it is: A ControlNet module plugged into Wan2.2 (a Mixture-of-Experts video diffusion model).
How it works:
1. Clone early transformer layers (DiT) and fine-tune them only for the high-noise expert (global structure, low-frequency motion).
2. Encode the control tensor and feed its features via zero-convolutions back into the frozen base model.
3. Train ~3,000 steps with batch size 4 on 4×80GB A100s; 81 frames at 16 FPS.
Why it matters: Steering the global dynamics expert matches the need to plan big, causal motions across time.

🍞 Anchor: Like turning the big wheel that shapes the entire scene’s motion, not just tiny details.

🍞 Hook: Think of using relative terms like “small nudge” versus “big shove” instead of exact Newtons.

🥬 The Concept: Relative normalization of force and mass.

What it is: Values are scaled within each synthetic domain rather than absolute units.
How it works: Gaussian sizes and trajectories are proportional to the dataset’s value ranges.
Why it matters: This lets the model generalize concepts like “stronger hit” across varied scenes without needing a perfect universal scale.

🍞 Anchor: The model knows that a “big poke” in dominos isn’t the same number as a “big poke” in balls, but both mean “stronger than usual” within each world.

🍞 Hook: Imagine a secret ingredient that turns a good recipe into a great one.

🥬 The Concept: The secret sauce.

What it is: The pairing of goal-vs-direct force channels plus masking, taught on simple causal primitives.
How it works: This forces the model to learn to backchain from effects to causes and forward-simulate outcomes.
Why it matters: It creates a planner-like behavior inside the video generator, no simulator needed at test time.

🍞 Anchor: That’s why, given a blocked path, the model picks another initiator that isn’t blocked and still achieves the goal force.

04Experiments & Results

🍞 Hook: Think of a science fair test: not just “does it look cool?” but “did it do the thing it promised?”

🥬 The Concept: The Test.

What it is: Measure goal-force adherence (did the target object get the requested push?), motion realism, and visual quality.
How it works: Human studies (2AFC) compare Goal Force videos vs. text-only prompts on challenging, real-world clips.
Why it matters: If it doesn’t achieve the goal, pretty motion isn’t enough.

🍞 Anchor: Judges picked which video actually made the specified ball move the right way, while still looking real and nice.

🍞 Hook: Picture a race between two strategies: “say it with words” vs. “say it with forces.”

🥬 The Concept: The Competition.

What it is: Baselines are text-only prompting: (1) zero-shot Wan2.2 and (2) a fine-tuned variant without physics signals.
How it works: All models get the same scene text; only Goal Force gets the explicit goal force channel.
Why it matters: Shows whether force conditioning truly adds control beyond better wording.

🍞 Anchor: Both contenders describe the scene; only one gets a precise arrow for the target object.

🥬 The Scoreboard (with context):

What it is: On four categories—two-object collisions, multi-object chains, human-object, and tool-object—Goal Force is preferred for achieving the goal.
How it works: People consistently chose Goal Force more often for goal adherence, with only small trade-offs in realism or quality.
What it means: Getting the physics right beats fancy phrasing when you need a specific outcome. Think “A-grade” goal-following when others hover around B-level.

🍞 Anchor: For “golf club hits ball left strongly,” Goal Force actually shows the club giving the right hit; text-only often looks nice but misses the exact motion.

🍞 Hook: Imagine asking older methods for “make the orange ball end up moving left” and they just… directly slide it left without any hit.

🥬 The Concept: Comparison to prior force/motion control.

What it is: Prior methods (PhysGen, PhysDreamer, Force Prompting) apply a direct cause and don’t plan antecedents; ToRA follows trajectories but can break causality.
How it works: When given a goal force, they often interpret it as a direct poke on the target.
Why it matters: They can’t solve the “how-to” planning task; Goal Force can still do direct force too, when asked.

🍞 Anchor: Instead of magically moving the target, Goal Force shows a cue ball hit first, then the target moves.

🍞 Hook: Think of a puzzle room with fake doors. Only some doors actually open; the planner must pick the valid one.

🥬 The Concept: Visual planning accuracy with blockers.

What it is: Scenes where some initiators are physically blocked; success = choosing an unblocked, valid initiator.
How it works: Across many trials, the model picks the correct initiator far above chance (e.g., ~98% in one pool setup after filtering visual glitches).
Why it matters: The model respects constraints instead of cheating with spontaneous motion.

🍞 Anchor: If the orange ball’s path is blocked by a stick, the model chooses the white ball to start the hit.

🍞 Hook: Imagine having more than one good way to solve a maze. You don’t want a robot that always chooses the same path.

🥬 The Concept: Diversity of valid plans.

What it is: In a 6-domino line, any of the first five can start a chain to topple the 6th; diversity measures spread across these choices.
How it works: A custom diversity metric shows Goal Force samples multiple initiators (score ~0.66 vs. ~0.39 for a deterministic baseline).
Why it matters: Multiple options help when one path is blocked or risky.

🍞 Anchor: Sometimes it starts from domino #2, other times #4, still ending with the last domino falling as required.

🍞 Hook: Think of pushing a heavy box versus a light one—you adjust how hard or fast you push.

🥬 The Concept: Using privileged mass information.

What it is: The mass channel guides planning; heavier targets need stronger or faster hits.
How it works: In both familiar and new scenes, measured speeds match the expected relationships most or all of the time.
Why it matters: Shows physics awareness, not just memorized patterns.

🍞 Anchor: With a heavier target ball, the model speeds up the projectile to achieve the same goal force effect.

🥬 Surprises and takeaways:

Zero-shot generalization: trained on simple scenes, yet plans tool use or human-object interactions plausibly.
Minimal trade-off: improved goal adherence with only small drops (if any) in realism/quality.
Emergent planning: behaves like an implicit simulator without needing one at test time.

🍞 Anchor: Like a student who learned from dominos and balls, then can handle a golf club and jello in the real world.

05Discussion & Limitations

🍞 Hook: Even great gadgets have instruction labels—when they shine and when to be careful.

🥬 The Concept: Limitations.

What it is: Known weak spots.
How it works: Relative, not absolute, forces; occasional visual glitches from the base video model; struggles with very long chains, heavy occlusion, or tiny precise contacts.
Why it matters: Know when to expect imperfections or plan for post-filtering.

🍞 Anchor: In a crowded workshop where tools hide each other, the model might pick a reasonable plan but the visuals could muddle tiny contacts.

🍞 Hook: Imagine needing a good kitchen and decent time to cook a fancy meal.

🥬 The Concept: Required resources.

What it is: A strong video diffusion backbone (Wan2.2), a ControlNet, GPUs (e.g., 4×A100 80GB for ~2 days) for fine-tuning, and the provided synthetic data.
How it works: The base model provides visual priors; the ControlNet injects physics conditioning.
Why it matters: Without enough compute or a capable base, results may degrade.

🍞 Anchor: On smaller GPUs, you might need fewer frames or lower resolution and expect weaker planning.

🍞 Hook: Some tools aren’t the right screwdriver for every screw.

🥬 The Concept: When not to use.

What it is: Cases needing exact engineering-grade numbers; ultra-long, branching plans; guaranteed safety-critical outcomes.
How it works: The method uses relative scales and sampling; it’s not a certified physics engine.
Why it matters: Don’t use it as the sole controller of real robots in risky settings without extra checks.

🍞 Anchor: For a factory robot needing millimeter-precise torque planning, you still want a proper simulator.

🍞 Hook: Curiosity drives science—what’s next to explore?

🥬 The Concept: Open questions.

What it is: Absolute calibration of forces; richer object properties (friction, elasticity); longer-horizon, branched plans; tighter robot-action coupling; safety filters.
How it works: Add more channels (e.g., friction maps), better datasets, and closed-loop planning with feedback.
Why it matters: Each piece moves us toward robust, safe, physically grounded world models for real tasks.

🍞 Anchor: Imagine adding a “slippery floor” channel so the model plans gentler hits on ice.

06Conclusion & Future Work

🍞 Hook: Think of telling a helper not just “what you want,” but the physical nudge the target must feel—and they figure out the steps to make it happen.

🥬 The Concept: Three-sentence summary.

What it is: Goal Force lets users specify a desired physical effect (a goal force on a target object) for video generation.
How it works: A 3-channel control signal and a masking curriculum teach a video diffusion model to plan the causes that achieve the effect.
Why it matters: This creates physics-aware, controllable videos that generalize beyond training, without a simulator at test time.

🍞 Anchor: Ask for “move the orange ball left strongly,” and the model shows the cue ball strike that makes it so.

Main achievement: Turning a generative video model into an implicit neural physics planner that backchains from goals to causes.

Future directions: Add more physical properties (friction, elasticity), scale to longer and branched plans, calibrate to absolute units, and integrate with robot controllers and safety filters.

Why remember this: It reframes control from “do this motion now” to “achieve this physical effect,” giving us a simple, powerful language—force arrows—to plan realistic, goal-hitting videos and, eventually, real actions.

Practical Applications

•Robot task planning: specify a desired force on an object (e.g., “push the mug gently right”) and visualize feasible action sequences.
•Sports training visualizations: show how to strike a ball with a given power and direction to reach a target trajectory.
•Tool-use planning: preview how to use a stick, club, or ruler to deliver the right hit when a direct push is blocked.
•Safety checks in cluttered scenes: test multiple visual plans to avoid collisions with obstacles before acting.
•Education demos: create physics lessons showing cause-and-effect chains (dominos, collisions, oscillations) from goal forces.
•Film and game previsualization: rapidly generate realistic motion beats that meet physical constraints without hand-animating paths.
•Warehouse picking and packing: visualize how to nudge heavier or lighter items to desired spots respecting mass differences.
•Assistive manipulation: plan gentle, compliant motions (e.g., moving a fragile item) using goal forces to control impact.
•Rube Goldberg design: explore diverse valid chains that achieve a final effect, comparing creative alternatives.
•Autonomous vehicle edge cases (simulation): specify goal interactions (e.g., a barrier moves) and visualize physically plausible antecedents.

Version: 1