Wan-Move: Motion-controllable Video Generation via Latent Trajectory Guidance

Ruihang Chu; Yefei He; Zhekai Chen; Shiwei Zhang; Xiaogang Xu; Bin Xia; Dingdong Wang; Hongwei Yi; Xihui Liu; Hengshuang Zhao; Yu Liu; Yingya Zhang; Yujiu Yang

Wan-Move: Motion-controllable Video Generation via Latent Trajectory Guidance

Intermediate

Ruihang Chu, Yefei He, Zhekai Chen et al.12/9/2025

arXiv PDF

Key Summary

•Wan-Move is a new way to control how things move in AI-generated videos by guiding motion directly inside the model’s hidden features.
•It uses tiny paths called point trajectories to say where objects should be in each frame and copies the first frame’s rich features along those paths.
•Instead of adding extra motion encoders, Wan-Move edits the existing image condition feature, so it scales easily and stays simple.
•This approach creates crisp, 5-second 480p videos with motion control that rivals commercial tools in user studies.
•A new benchmark, MoveBench, was built with 1,018 videos, precise motion labels (points and masks), and long 5-second clips for fair testing.
•On MoveBench, Wan-Move gets better visual quality and more accurate motion following than prior academic methods.
•Human evaluations show Wan-Move is preferred for motion accuracy and quality, even competing closely with Kling 1.5 Pro.
•The key trick—latent trajectory guidance—avoids error-prone extra modules and preserves texture and context during motion.
•It works for single-object, multi-object, camera moves, motion transfer, and even 3D-like rotations using depth estimates.
•Limitations include loss of control during long occlusions and challenges in very crowded or physically implausible motions.

Why This Research Matters

Precise motion control turns video generation from a guessing game into a directed performance—great for filmmakers, animators, educators, and game designers. By avoiding extra encoders, Wan-Move is easier to train at scale, making high-quality tools more accessible and efficient. In classrooms, teachers can show exact physics motions (like parabolas or rotations) rather than approximate animations. For creators, matching choreography, camera moves, or transferring motion from one clip to another becomes faster and more reliable. Studios can storyboard with precise paths, saving time and reducing revision cycles. As video AI grows, simple but powerful control like this bridges research and real-world production.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

You know how a flipbook turns still drawings into a moving scene? Before tools like Wan-Move, AI could already make flipbook-like videos from text or a single picture, but steering the exact paths of objects—like telling a bird to swoop just so—was hard. People could roughly say “move left” with boxes or masks, but getting fine, wiggly, or multi-part motions right was tricky.

🍞 Hook: Imagine you’re a director telling actors how to move on stage—you need clear directions and actors who listen. 🥬 The Concept (Motion Guidance): Motion guidance is the idea of giving AI specific instructions for how things should move over time.

What it is: A set of signals that tell the video generator where and how objects should move across frames.
How it works:
1. Define motion signals (like paths or regions) over time.
2. Feed them to the video model alongside text or the first frame.
3. The model follows these signals while generating frames.
Why it matters: Without motion guidance, the model guesses motion and may drift, wobble, or ignore the exact path you want. 🍞 Anchor: Telling a dog to run in a circle vs. just saying “move”—the first gives a precise loop; the second might wander.

The problem: Existing methods used two types of signals. Sparse signals (boxes, masks) can say “the cat goes right,” but not how the tail flicks or the paw arcs. Dense signals (like optical flow or point trajectories) promise fine detail, but often add heavy extra models or lose context, making training slow and results less stable at scale.

🍞 Hook: Think of tracing a car’s exact route on a map with a thin pen. 🥬 The Concept (Point Trajectories): A point trajectory is the path a single pixel-sized point takes across frames.

What it is: A sequence of positions (x, y) for a point over time.
How it works:
1. Pick key points on the object (e.g., the tip of a wing).
2. For each frame, record where that point should be.
3. Use these paths to guide the generator.
Why it matters: Point paths let you describe delicate, local motions (like fingers, fur, or ripples) precisely. 🍞 Anchor: Drawing a dotted line the bee follows as it buzzes—dot by dot, frame by frame.

Failed attempts: Optical flow adds a separate estimator at inference, stacking errors and slowing things down. Other methods feed trajectories into extra encoders (like ControlNets). These pipelines can blur or weaken the motion signal and make large-scale fine-tuning harder.

🍞 Hook: When you shrink a high-res photo, you keep what matters most; later you can rebuild it. 🥬 The Concept (VAE – Variational Autoencoder): A VAE compresses video frames into a smaller, meaningful code (latent) and can reconstruct them later.

What it is: A pair of networks—an encoder that compresses data and a decoder that rebuilds it.
How it works:
1. Encoder zips images/video into a compact latent grid.
2. Generator works in this smaller space (faster and easier).
3. Decoder unzips to make the final video.
Why it matters: Operating in latents makes video generation efficient and consistent. 🍞 Anchor: Like making a travel-size version of a board game that still has all the key pieces.

🍞 Hook: Making a video is like painting from blur to sharp. 🥬 The Concept (Video Diffusion Models): These models start from noisy latents and learn to denoise them into coherent videos over steps.

What it is: A step-by-step clean-up process from noise to video.
How it works:
1. Start with noise in latent space.
2. At each step, predict and remove some noise using learned patterns.
3. After enough steps, a crisp video emerges.
Why it matters: This stable process makes high-quality video generation possible. 🍞 Anchor: Like developing a photo in a darkroom, watching the image slowly appear.

🍞 Hook: Turn a single photo into a short movie where only the parts you choose move exactly how you wish. 🥬 The Concept (Image-to-Video Generation): It animates the first frame into a multi-frame video following given guidance.

What it is: A model that takes one image (plus text or signals) and generates a short, moving clip.
How it works:
1. Encode the first frame into latent features (its appearance).
2. Add noise latents to be denoised into future frames.
3. Use guidance (text and motion) to steer the denoising toward desired motion.
Why it matters: It lets you control motion while preserving the first frame’s look. 🍞 Anchor: Turning a still photo of a dancer into a 5-second twirl while keeping the outfit and lighting the same.

The gap: We needed a way to keep motion fine-grained and strong without stacking extra modules. What if we could inject motion directly into the model’s own condition features, keeping all the rich texture and context the model already understands?

Real stakes: This matters for creators (directing specific movements), education (showing exact physics demos), filmmaking (storyboarding complex shots), robotics simulations (precise movement plans), and accessibility (adapting motion for different needs). Precise motion control saves time, reduces trial-and-error, and unlocks new types of content.

02Core Idea

Aha! Moment in one sentence: Copy the rich features of the first frame along user-specified paths inside the model’s latent space, so the model naturally animates objects exactly where they should go—no extra motion encoders needed.

🍞 Hook: Think of using a rolling stamp to print the same pattern along a winding road. 🥬 The Concept (Latent Feature Replication): We copy the first frame’s latent features and place them along each motion path over time.

What it is: A simple operation that duplicates the first-frame latent vector at the path’s new positions in later frames.
How it works:
1. Encode the first frame into latent features (a grid of rich descriptors).
2. Convert each point trajectory from pixels to the latent grid.
3. For each time step, paste the source feature at the trajectory’s new latent coordinate.
Why it matters: The copied features carry texture and context (not a single-pixel thread), so local motion looks natural and consistent. 🍞 Anchor: If the first frame has a striped sleeve, copying its sleeve features along the arm’s path keeps stripes aligned as the arm swings.

🍞 Hook: You know how maps have a big version and a mini version? You need to point on the right map. 🥬 The Concept (Latent Trajectory Guidance): Transform pixel paths into latent-space paths and use them to guide generation directly.

What it is: A mapping that turns (x, y) pixel locations over time into (h, w) latent coordinates, then uses them to steer motion.
How it works:
1. Downscale spatial coordinates to match the latent grid.
2. For time, align to the latent’s temporal stride (averaging when needed).
3. Replicate first-frame features along these latent coordinates.
Why it matters: Working in the model’s native space keeps guidance strong, simple, and scalable. 🍞 Anchor: Marking your route on the same zoom level as the map you’re using—no confusion, just clear directions.

Three analogies for the main idea:

Sticker trail: Place a sticker from the first picture (with its full pattern) at each spot along a winding trail across later pictures.
Mold and clay: Use the first frame as a mold; press it gently along the trajectory so shapes keep their texture as they move.
Highlight and slide: Highlight the key feature in frame one, then slide that highlight along a drawn path so the model knows what to move and where.

Before vs After:

Before: Extra motion encoders (like ControlNets) tried to translate motion signals into something the model understands, risking signal loss and slowing training/inference.
After: No extra modules. Motion is baked directly into the condition feature, keeping the guidance crisp and making scaling easier.

Why it works (intuition, no math):

The VAE encoder is roughly translation-equivariant for local features—meaning a feature for “this patch of object” stays meaningful if moved to a nearby location. By copying the first-frame latent vector to the new location at later times, the model keeps both identity (texture, semantics) and movement aligned. The diffusion denoiser fills in the rest, harmonizing neighbors in space and time.

Building blocks:

Motion Guidance (paths you draw).
Point Trajectories (fine-grained, per-point motion).
VAE Latents (compact grid where features live).
Latent Trajectory Mapping (pixel → latent coordinates).
Latent Feature Replication (copy/paste the rich feature through time).
Diffusion Denoiser (turns guided latents into clean video).

03Methodology

At a high level: Input (first frame + text + point trajectories) → Encode to latents → Map trajectories to latent space → Replicate first-frame features along latent paths → Concatenate with noisy latents → Diffusion denoising → Output video.

Step-by-step like a recipe:

Inputs

What happens: You provide (a) a first frame image to preserve appearance, (b) an optional text prompt for context, and (c) one or more point trajectories describing where chosen points should be across frames.
Why this step exists: The first frame anchors the look; trajectories anchor the motion; text clarifies overall scene and style.
Example: First frame: a person planing wood. Trajectory: the planer’s handle moves left-to-right-to-left. Text: “Close-up of a person using an electric planer; wood shavings fly.”

Encode first frame and text

What happens: The VAE encoder turns the first frame (plus zero-padded future frames) into a latent condition grid. The text encoder (umT5) and image-global encoder (CLIP) provide contextual embeddings.
Why it exists: Latents are the model’s working language; text and image embeddings give global guidance.
Example: The sleeve texture and the wood grain become compact latent features; the text anchors the scene as a workshop with motion expectations.

Map trajectories from pixels to latents

What happens: Pixel coordinates are downsampled to the latent grid’s (height, width) and time-aligned to the latent’s frame stride (averaging across strides when needed).
Why it exists: The model operates in latent space; guidance must be aligned to that space.
Example: A pixel path at (x=160, y=120) maps to a latent cell (h=10, w=8). Across frames, we align to every latent time step.

Latent feature replication (the secret sauce)

What happens: For each trajectory, copy the first frame’s latent vector at the start point and paste it at the mapped latent location in each later time step (only when that point is visible). If multiple trajectories overlap, randomly pick one to avoid feature averaging.
Why it exists: Copying a full latent vector preserves local texture and semantics (like fabric pattern), producing natural local motion. Without it, single-pixel threads lack context, leading to stiff or broken motion.
Example: The planer’s metal sheen and shape features are carried along the handle path, so the tool looks consistent as it moves.

Fuse guidance and generate

What happens: Concatenate the updated condition features with the noisy latents. Use cross-attention to inject text and global image embeddings. Apply classifier-free guidance to balance unconditional and conditional predictions.
Why it exists: Concatenation is simple, avoids extra modules, and kept performance while being fast. Without it, we’d add heavy encoders and slow everything down.
Example: The model denoises toward frames where the planer follows the path while keeping the workshop look and prompt intent.

Denoise through time and decode

What happens: The diffusion backbone (DiT) removes noise step by step, guided by the latent condition we edited. Finally, the VAE decoder reconstructs the video frames.
Why it exists: It incrementally builds a sharp, coherent video consistent with motion and appearance.
Example: Over ~50 steps, the clip becomes a smooth 5-second, 480p video of planing wood with flying shavings.

What breaks without each step:

No trajectory mapping: Guidance points don’t line up with latent cells; motion becomes wobbly or ignored.
No feature replication: Local motion loses texture continuity; objects may smear or drift.
Using heavy fusion (e.g., ControlNet): Adds latency and complexity, risking signal dilution.
No classifier-free guidance: Weaker alignment to the prompt and image.

Concrete data example:

Suppose you tracked 64 points on a dancer’s skirt hem over 5 seconds. After mapping to latents, each point gets a per-time-step (h, w). You copy the first frame’s hem features to those (h, w) at each time, preserving pleat textures as the skirt swirls. The denoiser harmonizes neighboring areas so the whole skirt flows.

Secret Sauce (why this method is clever):

Directly editing the model’s condition feature keeps motion instructions potent and context-rich. It avoids extra encoders, simplifies training, speeds inference (only ~+3s overhead vs. +225s with ControlNet in tests), and scales to large backbones and datasets while improving motion fidelity.

04Experiments & Results

The Test: The team evaluated two things: visual quality (does the video look clean and consistent?) and motion accuracy (does it follow the paths?). They used standard visual metrics (FID, FVD, PSNR, SSIM) and a motion metric (EPE, end-point error) that measures how far generated tracks are from ground-truth trajectories—the smaller, the better.

The Competition: Wan-Move was compared to academic baselines like ImageConductor, Tora, MagicMotion, and LeviTor, and even to a strong commercial tool, Kling 1.5 Pro’s Motion Brush, via user studies.

The Scoreboard (with context): On MoveBench (1,018 videos, 5s each, 480×832), Wan-Move achieved FID 12.2, FVD 83.5, PSNR 17.8, SSIM 0.642, and EPE 2.6. Think of EPE like golf: lower is better—Wan-Move’s 2.6 is like a clean shot near the hole, while others around 3.2–3.4 are still decent but farther away. Visual scores (higher PSNR/SSIM) were also stronger, meaning the videos looked clearer and more consistent. On DAVIS (a public dataset), similar trends held, showing robustness.

Multi-object motion: On the multi-object subset of MoveBench, Wan-Move had much lower FVD and EPE than competitors, meaning it kept complex, multiple motions aligned and stable better than others.

Human studies (2AFC): 20 participants compared 50 samples per method. Wan-Move’s win rates vs academic methods were above 89% across motion accuracy, motion quality, and visual quality—like getting A’s while others mostly got B’s. Against Kling 1.5 Pro, Wan-Move had comparable preferences (around half the time winning), notably edging on motion quality—remarkable for a research model.

Surprising findings:

Simple concatenation vs. ControlNet: Wan-Move’s direct concat strategy matched or beat ControlNet-style fusion in both quality and speed, avoiding massive latency (+225s) while adding only ~+3s.
More guidance points help at inference: Increasing the number of trajectories (up to dense tracking) kept shrinking EPE, hitting as low as ~1.1 with dense tracks—even though training capped at 200 points. That shows strong generalization.
Ablations confirmed the secret sauce: Pixel-level copy (before VAE) and random embeddings performed worse. Latent feature replication had the best combo of high PSNR/SSIM and lowest EPE.

Challenging scenarios:

Large-motion and out-of-distribution (OOD) subsets: Wan-Move maintained a lead over baselines when motions were huge or unusual (like foreground–background entanglements or rare camera moves), with only mild drops from full-set performance.

MoveBench itself:

Scale and precision: 1,018 5-second videos (longer than typical), curated into 54 classes with both point and mask labels. An interactive labeling flow mixed human clicks with SAM segmentation to ensure accurate regions, plus CoTracker for reliable trajectories. Detailed, generation-friendly captions help fair text+motion evaluations.

05Discussion & Limitations

Limitations:

Long occlusions: If a guided point stays hidden too long, control can weaken or be lost until it reappears.
Crowded scenes: Many interacting objects can introduce artifacts.
Physics edge cases: If asked to follow impossible motions, the model may produce unrealistic results.
Tracking noise: Bad input trajectories (e.g., tracking errors) can misguide the generation.

Required resources:

A capable I2V backbone (e.g., Wan-I2V-14B or similar), VAE, and DiT-based diffusion.
Point trajectories (sparse or dense) from a tracker like CoTracker, plus optional SAM-assisted masks.
GPU resources for training/fine-tuning (the paper used multi-GPU setups) and modest overhead at inference (~+3s).

When NOT to use:

If you can’t obtain reasonable trajectories (e.g., extreme occlusions, severe motion blur) and precise control is essential.
If latency is extremely critical and you cannot afford any extra seconds.
If motions must strictly follow real-world physics and constraints beyond visual plausibility.

Open questions:

How to maintain strong control across very long occlusions or out-of-frame segments?
Can we auto-correct or validate noisy trajectories during generation?
How to blend multiple overlapping trajectories more intelligently than random selection?
Can this approach extend to 3D/4D controls (full depth-aware paths, camera rigs) in a unified interface?
How far can we push resolution and duration without losing motion fidelity, while keeping speed?

06Conclusion & Future Work

Three-sentence summary: Wan-Move introduces latent trajectory guidance: convert user-specified point paths into latent coordinates and copy the first frame’s rich features along those paths, guiding motion directly inside the model. This simple edit avoids extra encoders, preserves texture and context, and scales to large backbones, delivering precise, high-quality motion control. A new benchmark, MoveBench, shows that Wan-Move outperforms academic baselines and competes with commercial tools in user studies.

Main achievement: Demonstrating that directly editing the model’s condition feature via latent feature replication yields fine-grained, scalable motion control without architectural changes or heavy fusion modules.

Future directions: Improve robustness under long occlusions, integrate smarter overlap handling, expand unified 3D/camera controls, and scale to longer, higher-resolution videos with efficient inference. Exploring automatic trajectory correction and physics-aware constraints could further increase reliability.

Why remember this: Wan-Move flips the script on motion control—keep it in the model’s native language (latents), and motion becomes both precise and practical. This idea unlocks creator-friendly tools that are fast, accurate, and easy to scale, narrowing the gap between research and production-grade video generation.

Practical Applications

•Direct a character’s limb paths for choreography or action scenes by drawing point trajectories.
•Synchronize product demos (e.g., a phone spinning, a watch moving) to exact timing for ads.
•Create educational physics clips (projectiles, pendulums, rotations) with precise motion paths.
•Transfer motion from a reference video to a new scene (e.g., copy a dancer’s footwork).
•Perform camera moves (pan, dolly, zoom) by generating trajectory-guided background motion.
•Animate multi-object interactions (e.g., two balls colliding) with separate trajectories per object.
•Do style-preserving edits: change first-frame appearance (color/style) while keeping original motion.
•Prototype robotics simulations by sketching end-effector path trajectories for visual previews.
•Design UI motion studies by animating icons or cursors along exact curves for usability tests.
•Previsualize sports tactics by animating players along planned routes from a single overhead image.

Version: 1