3D-Aware Implicit Motion Control for View-Adaptive Human Video Generation

Zhixue Fang; Xu He; Songlin Tang; Haoxian Zhang; Qingfeng Li; Xiaoqiang Liu; Pengfei Wan; Kun Gai

3D-Aware Implicit Motion Control for View-Adaptive Human Video Generation

Intermediate

Zhixue Fang, Xu He, Songlin Tang et al.2/3/2026

arXiv PDF

Key Summary

•This paper introduces 3DiMo, a new way to control how people move in generated videos while keeping the camera moves flexible through text.
•Instead of using 2D poses or fixed 3D body models that can be wrong, 3DiMo learns an implicit, view-agnostic motion signal directly from regular videos.
•A small motion encoder turns driving video frames into compact motion tokens and feeds them into a powerful video generator using cross-attention.
•The system is trained with view-rich supervision—single-view, multi-view, and moving-camera clips—so the learned motion truly works in 3D from any angle.
•Early in training, a gentle helper (auxiliary geometric supervision from SMPL/MANO) jump-starts 3D understanding and is then faded out to zero.
•3DiMo reproduces the same 3D motion as the driver clip, while camera paths are guided by natural language prompts like “camera arcs left.”
•On standard metrics (LPIPS, FID, FVD) and in user studies, 3DiMo beats recent 2D- and 3D-controlled baselines in realism, motion accuracy, and 3D plausibility.
•Ablations show each part matters: view-rich data, cross-attention conditioning, the body+hand dual encoders, and the annealed geometric helper.
•The method also enables single-image novel view synthesis, video stabilization, and automatic motion-to-appearance alignment.
•Main limits are 480p resolution and weaker handling of object interactions, but the approach points toward more natural, 3D-consistent video generation.

Why This Research Matters

3DiMo lets anyone take a single photo, borrow motion from a video, and still move the camera freely with simple text, while keeping the motion physically believable. That means better TikTok edits, smoother home videos, and more cinematic shots without expert tools. Coaches, teachers, and creators can study and reuse motions from different angles, making learning and remixing easier. Games and virtual worlds can get lifelike character animations from everyday clips, not complex motion-capture rigs. It also helps with stability tasks like de-jittering footage while keeping the action intact. By aligning with a generator’s natural 3D sense instead of forcing it to follow possibly-wrong skeletons, 3DiMo makes video AI both simpler to use and more reliable in 3D.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: You know how movies feel realistic when the camera can move around a person, and their body still looks correct from every angle? Real life is 3D, so our eyes expect motion to make sense in three dimensions.

🥬 The Concept (Video Generation): What it is: Video generation is when a computer creates a whole video from inputs like text, images, or other videos. How it works:

Compress videos into simpler signals (so training is efficient).
Learn how pixels change over time and space.
Use a trained model to predict clean video frames from noisy ones.
Guide the model with things like a reference image or a caption. Why it matters: Without strong video generation, motion control is like trying to build a house without bricks—there’s nothing to animate. 🍞 Anchor: When you type “a dancer twirls on a stage” and get a short clip, that’s video generation at work.

The World Before: Early human animation systems moved a still photo using 2D poses. These were like stick figures drawn on top of the image. They worked if you stayed in the same view, but the moment you tried to move the camera or see a different angle, they broke. Arms could look like they pass through the body, and depth got confusing. To fix that, people used explicit 3D body models, such as SMPL, which is like a template human that you can pose and render back to 2D. This gave structure but came with its own problems: the 3D it guessed from a single view was often wrong (depth flips, incorrect contacts), and when used as a hard constraint, it sometimes forced the video generator to follow bad geometry.

🍞 Hook: Imagine playing with a paper doll. You can pose it, but it’s still flat. If you tilt it, it doesn’t suddenly become truly 3D.

🥬 The Concept (SMPL Model): What it is: SMPL is a parametric 3D template of the human body that you can shape and pose with a small set of numbers. How it works:

Start with a standard 3D human mesh.
Adjust shape knobs to match body type.
Adjust pose knobs to bend arms, legs, and torso.
Render this 3D body back into a 2D image. Why it matters: SMPL provides a neat, consistent structure—but when the inputs are uncertain, it can still be wrong, and forcing a generator to follow wrong 3D makes videos look off. 🍞 Anchor: If a video shows a hand on a hip, a noisy SMPL guess might place the hand slightly in front of the stomach from another view, breaking the illusion.

The Problem: Motion is truly 3D, but most controls were either glued to the original 2D viewpoint or depended on possibly-wrong 3D reconstructions. This meant you couldn’t freely move the camera at generation time without the motion falling apart.

Failed Attempts: 2D-stick-figure methods copied the driver’s projection too literally. 3D-parametric methods injected strong but sometimes biased geometry that overruled the video model’s native 3D knowledge, leading to stiff or incorrect motions.

🍞 Hook: Imagine learning to ride a bike by watching only one camera angle. You might think you’ve mastered it, but turn the bike and suddenly your balance fails.

🥬 The Concept (3D-Aware Motion Representation): What it is: A way to describe how a body moves that stays correct from different camera angles. How it works:

Look at 2D frames and infer the underlying 3D motion ideas.
Store only the motion essence, not the exact pixels or view.
Use that essence to drive a generator from any angle. Why it matters: Without it, motion looks right only from one view, and breaks when the camera moves. 🍞 Anchor: A person waving looks like a real wave from the front, side, or back—not a flat cardboard flap.

The Gap: We needed a motion signal that was view-agnostic (not tied to one angle), expressive (captures nuance like hands and timing), and that worked with, not against, the generator’s own 3D smarts.

Real Stakes: This affects everything from making better movie shots and game avatars to reliable sports replays, virtual fitting rooms, safer motion analysis, and less shaky home videos. If AI understands motion in 3D, anyone can create camera-worthy shots from a single photo and a motion clip—with fewer glitches and more realism.

02Core Idea

🍞 Hook: Imagine teaching a friend to dance by describing the moves (“step, turn, raise your arm”), not by forcing them to copy your exact position from your camera angle.

🥬 The Concept (Implicit Motion Encoder): What it is: A small model that looks at a driving video and turns it into compact motion tokens that capture the “idea of the motion” without locking to any viewpoint. How it works:

Take each frame and patchify it into tokens.
Mix them with a few learnable latent tokens.
Use attention to distill only the motion essence into those latent tokens.
Throw away the rest so we can’t overfit to appearance or specific view. Why it matters: Without this distillation, the generator might copy 2D shadows instead of real 3D motion, breaking when the camera moves. 🍞 Anchor: From a cartwheel clip, the encoder saves “cartwheel-ness” (timing, flips, limbs) but not the exact left-camera look.

The “Aha!” Moment in one sentence: Let the video generator keep its own 3D wisdom, and feed it a tiny, view-agnostic motion code through cross-attention so it reenacts motion correctly from any camera angle you ask for with text.

Three Analogies:

Music sheet vs. recording: The motion tokens are like a music sheet (abstract), not a noisy phone recording (view-specific). The generator is the orchestra that plays it in any concert hall you choose (camera path).
Recipe vs. photo: The tokens are the recipe of a move, not a picture from one side. You can cook it in any kitchen (viewpoint).
GPS directions vs. street view: The tokens give turn-by-turn motion directions, not a single street photo.

🍞 Hook: You know how a great coach gives cues (“hips forward, shoulders relaxed”) instead of micromanaging every muscle?

🥬 The Concept (DiT-based Video Generator): What it is: A powerful diffusion-with-transformers video model that already understands 3D structure and motion patterns from large-scale training. How it works:

Compress video into a latent space with a VAE.
Add noise to latents and train the model to denoise step by step.
Use self-attention to mix information from text, a reference image, and the evolving video.
Generate clean video latents, then decode back to frames. Why it matters: This model is the skilled “performer.” If you over-constrain it with wrong geometry, you lose its natural 3D sense. 🍞 Anchor: Ask for “front-right view, camera arcs left” and it keeps things 3D-consistent while following your words.

🍞 Hook: Think of two people working together: one describes the move, the other performs it, constantly listening.

🥬 The Concept (Cross-Attention): What it is: A way for the generator’s video tokens to focus on the motion tokens and borrow just the useful motion details. How it works:

After self-attention, insert cross-attention.
Let video tokens query motion tokens for relevant parts.
Keep text tokens independent to preserve camera/control words.
Merge this info to continue generation. Why it matters: Without cross-attention, motion either overwhelms or gets ignored. This keeps the interaction smart and flexible. 🍞 Anchor: During a jump, video tokens pull “start crouch → spring → airtime → land” from motion tokens at just the right moments.

🍞 Hook: Imagine learning a dance by watching it from the front, side, and while the camera circles around—your brain discovers the true 3D move.

🥬 The Concept (View-Rich Supervision): What it is: Training with single-view, multi-view, and moving-camera clips of the same motions so the model must learn motion that stays correct from different angles. How it works:

Self-reconstruct motions from single-view videos.
Reproduce the same motion from other viewpoints in synchronized multi-view videos.
Reproduce the same motion under moving cameras, guided by text camera prompts.
Balance these to force 3D-consistent learning. Why it matters: Without this, the model might memorize flat patterns that only look right from one view. 🍞 Anchor: If someone high-fives in one camera, the model must keep that contact true when the camera swings around.

Before vs. After:

Before: Motion control stuck to one camera angle or to sometimes-wrong 3D skeletons.
After: A small, abstract motion code guides a 3D-savvy generator, so motion holds together while you freely change the camera with text.

Why It Works (intuition): By throwing away view-specific layouts and using cross-attention, the system prevents “flat copying.” View-rich training then rewards only motion that survives angle changes. A gentle, early geometric helper points in the right direction and then bows out.

Building Blocks:

Implicit motion tokens (view-agnostic).
Cross-attention injection into the DiT video generator.
Dual encoders for body and hands.
View-rich training: single-view, multi-view, moving-camera.
Auxiliary geometric supervision (SMPL/MANO) that’s annealed away.
Text-guided camera control through the native generator.

03Methodology

High-Level Recipe: Input (reference image + driving video + text camera prompt) → Motion Encoder makes view-agnostic motion tokens → Cross-attention injects tokens into DiT video generator → Output video reenacts the 3D motion with your chosen camera path.

🍞 Hook: Imagine summarizing a whole dance into a tiny set of cue cards you can hand to any performer.

🥬 The Concept (Motion Tokens): What it is: A tiny sequence of learned vectors that capture only the essence of the motion. How it works:

Patchify each driving frame into visual tokens.
Add a few learnable latent tokens.
Run transformer layers so latent tokens “soak up” motion.
Keep only the latent tokens; drop the rest. Why it matters: If we keep view-specific pixels, we’ll overfit to one angle and break camera control. 🍞 Anchor: Five tokens might encode “step forward → swing arm → turn → hold → reset.”

Step-by-Step:

Inputs:

Reference image IR: who to animate.
Driving video VD: the motion to follow.
Text prompt T: camera path like “camera arcs left.”

Randomized Augmentations for View-Agnostic Learning: 🍞 Hook: You know how practicing on different stages makes you better anywhere? 🥬 The Concept (Random Perspective Augmentation): What it is: Slight changes to the driver frames’ viewpoint during training so the encoder can’t rely on a single angle. How it works: (1) Apply perspective warps; (2) add mild color/blur jitter; (3) also prevent identity leakage; (4) feed into encoder. Why it matters: Without this, the encoder might latch onto flat patterns tied to one camera. 🍞 Anchor: Tilt a running clip a bit left or right; the distilled motion remains “running,” not “left-tilted running.”
Dual-Scale Motion Encoding: 🍞 Hook: When you watch a performance, you notice both the big moves and the tiny finger flicks. 🥬 The Concept (Dual Encoders: Body + Hand): What it is: One encoder for whole-body motion and another for fine hand gestures; their tokens are concatenated. How it works: (1) Crop/orient inputs for body vs. hands; (2) each encoder makes tokens; (3) join them; (4) send to generator. Why it matters: Without a hand encoder, finger details vanish; with only hands, you lose global rhythm. 🍞 Anchor: A drum solo needs both arm swings (body) and stick taps (hands).
Injecting Motion via Cross-Attention:

After each self-attention block in the DiT generator, a cross-attention block lets video tokens query motion tokens so only relevant motion details are pulled in.
Text tokens stay separate so the camera description remains a clear guide.

Text-Guided Camera Control: 🍞 Hook: Think of shouting stage directions: “Move the camera up!” and the crew responds. 🥬 The Concept (Text-Guided Camera Control): What it is: Using plain language to steer camera trajectories during generation. How it works: (1) Include camera words in T; (2) the generator’s native skill interprets them; (3) motion tokens supply motion; (4) the result is motion + camera together. Why it matters: Without text guidance, you can’t flexibly craft shots—no pans, arcs, or zooms on demand. 🍞 Anchor: “Camera pulls away while circling right” yields a smooth orbiting zoom-out around the subject.
Training with View-Rich Supervision: 🍞 Hook: Learning to dribble a ball from all sides makes you a better player everywhere on the court. 🥬 The Concept (Same-View Reconstruction and Cross-View Reproduction): What it is: Two goals—rebuild the same clip, or rebuild the same motion from a different view. How it works:

Same-view: input VD, predict VD; build motion expressiveness.
Cross-view: input VD, predict V′D (same motion, different camera); build 3D awareness.
Use the first frame of the target as the reference image to keep identity and facing consistent. Why it matters: Only reconstructing same views teaches 2D shortcuts. Cross-view forces true 3D motion learning. 🍞 Anchor: A kick that looks right from the front must also look right from the side.

Progressive Training Stages: 🍞 Hook: First crawl, then walk, then run. 🥬 The Concept (Three-Stage Curriculum): What it is: A schedule that gradually increases difficulty so learning stays stable and targeted. How it works:
Stage 1: single-view reconstruction only—learn lots of motions.
Stage 2: mix reconstruction + cross-view—start disentangling motion from viewpoint.
Stage 3: only cross-view/moving-camera—polish 3D awareness and camera control. Why it matters: Jumping straight to hard tasks can destabilize training; the curriculum builds reliable skills. 🍞 Anchor: Like practicing scales, then duets, then a full orchestra concert.
Early Geometric Helper: 🍞 Hook: Training wheels help at the start but must come off later. 🥬 The Concept (Auxiliary Geometric Supervision with Annealing): What it is: A small MLP decodes motion tokens into SMPL/MANO poses during early training, then its loss weight fades to zero. How it works:
Estimate pseudo 3D poses from off-the-shelf tools.
Decode predicted poses from tokens.
Supervise pose (excluding root orientation) to seed 3D sense.
Linearly reduce this loss to zero. Why it matters: Without it, the powerful generator might ignore motion tokens early, slowing or collapsing learning. 🍞 Anchor: Like a coach guiding posture early on, then letting the athlete perform independently.
Backbone Details:

Causal 3D VAE compresses/decodes video latents.
DiT blocks learn to denoise latents conditioned on text, reference, and motion tokens (via cross-attention).
Rectified flow/flow-matching style training improves stability and speed.

Secret Sauce:

A semantic bottleneck (tokens) that strips away view-specific layout.
Cross-attention that lets the generator selectively use motion.
View-rich data that rewards only motion consistent across angles.
An annealed geometric hint that kickstarts 3D without handcuffing the model later.

04Experiments & Results

The Test: The authors evaluated whether 3DiMo can (1) keep motion accurate and natural, (2) look visually good frame by frame, and (3) stay consistent across time and viewpoints, while allowing text-guided camera moves. They used PSNR/SSIM (pixel closeness), LPIPS/FID (perceptual quality), and FVD (video realism over time). They also ran a user study rating motion accuracy, naturalness, 3D plausibility, and overall quality.

The Competition: Baselines included two strong 2D-pose systems (AnimateAnyone, MimicMotion) and two 3D-SMPL-style systems (Uni3C, MTVCrafter). Most baselines do not support free camera text control; so the core comparison used static-camera prompts for fairness.

The Scoreboard with Context:

3DiMo achieved SSIM ≈ 0.739 and PSNR ≈ 17.96. While not the top on these pixel metrics, small viewpoint differences can hurt these scores even when the video looks better.
3DiMo led on LPIPS ≈ 0.220 (lower is better) and had strong FID ≈ 36.92 and FVD ≈ 297.4. Think of this as scoring an A in how real and consistent the video feels, while others get more like B’s or C’s.
AnimateAnyone, for example, had much worse FID and FVD, meaning frames and motion felt less natural and coherent.
User study: People rated 3DiMo highest for accuracy, naturalness, 3D plausibility, and overall look—like winning gold across all audience-choice awards.

Surprising Findings:

Pixel-perfect metrics (SSIM/PSNR) can be a bit misleading: if 3DiMo chooses a slightly steadier camera than the ground truth’s tiny drift, it may look better to humans but score lower by raw pixels. Still, the perceptual and temporal metrics (LPIPS/FID/FVD) favor 3DiMo.
A variant that fed SMPL pose parameters directly as control caused classic depth mistakes (e.g., missed hand-to-hip contact from new views). The implicit tokens fixed this.

Ablations (What matters most):

Remove cross-attention → motion control weakens; simple concatenation can’t manage rich interactions.
Skip view-rich training → camera control and 3D consistency suffer.
No early geometric helper → unstable training; the model may ignore motion tokens at first.
Drop the hand encoder → you lose fine finger/hand details.

Takeaway: The complete recipe—implicit tokens + cross-attention + view-rich curriculum + annealed geometric helper + dual encoders—delivers the strongest, most 3D-plausible motion reenactment and the best user ratings.

05Discussion & Limitations

Limitations:

Resolution is 480p, which can miss tiny details, especially in faces or fingers when the subject is small in frame.
The system focuses on human motion; it doesn’t explicitly model objects or tools (e.g., a ball or a bike), so interactions can be imperfect.
It relies on a big pretrained video generator and multi-stage training, which need compute and data variety.

Required Resources:

A strong DiT-style video generator pretrained on text/image-to-video.
View-rich training clips: single-view in quantity, plus some multi-view and moving-camera data.
GPUs and storage to train the encoder with the generator end-to-end.

When NOT to Use:

If you need 4K ultra-sharp close-ups or tiny texture fidelity (e.g., film-grade VFX) without an upscale stage.
If scenes have heavy human-object interactions (e.g., juggling multiple items) that demand explicit object motion modeling.
If you require exact known camera parameters rather than text-guided camera control.

Open Questions:

How to scale to higher resolutions efficiently without losing temporal stability?
How to extend from “human-only” to “human + objects + scene” interactions in an implicit way?
Can we replace pseudo 3D pose supervision entirely with more clever self-supervision while keeping stability?
What is the best way to represent and edit the motion tokens (e.g., interpolate, compose, or stylize them)?
How can we measure 3D plausibility more directly than 2D pixel metrics?

06Conclusion & Future Work

3-Sentence Summary: 3DiMo learns a tiny, view-agnostic motion code from ordinary videos and feeds it into a 3D-savvy video generator via cross-attention, so motion stays right even when you move the camera with text. Trained with single-view, multi-view, and moving-camera clips—and helped early by a gentle, fading geometric hint—it captures the essence of 3D motion without being handcuffed to noisy 3D reconstructions. The result is higher perceptual quality, better 3D plausibility, and flexible, text-driven camera control compared to 2D- and SMPL-based baselines.

Main Achievement: Showing that implicit, end-to-end motion tokens aligned with a pretrained video model’s own spatial priors can outperform explicit pose control, especially for 3D consistency and camera freedom.

Future Directions:

Scale to higher resolutions and add cascaded super-resolution.
Extend implicit motion to handle human-object interactions.
Build tools to edit, mix, or stylize motion tokens.
Explore self-supervised or geometry-free warm starts to remove reliance on auxiliary pose labels.

Why Remember This: 3DiMo flips the script—rather than forcing the generator to obey an external 3D skeleton, it learns a compact motion language the generator naturally understands. That shift unlocks cleaner 3D motion, better-looking videos, and camera moves on command from simple text.

Practical Applications

•Single-image novel view synthesis: orbit around a person using only one photo by prompting “camera circles right, subject stays still.”
•Video stabilization: take a shaky human clip and prompt “camera remains static” to keep motion but remove jitter.
•Cinematic camera moves: generate arcs, dolly-ins, and zooms with text while preserving the driver clip’s motion.
•Avatar animation for creators: animate a character image with a favorite dance clip, then pick the best camera path.
•Sports motion study: replay the same motion from multiple angles to analyze technique and timing.
•Virtual try-on or performance previews: show how a person might move in a scene and view it from different angles.
•Gesture-intensive content: preserve finger and hand details for sign language snippets or instrument tutorials.
•Game and XR prototyping: quickly get view-consistent motion for NPCs from everyday video references.
•Automatic motion-image alignment: match motion to the reference subject’s facing direction without manual calibration.
•Motion style remix: swap the same motion tokens across different characters, then direct the camera with text.

Version: 1