Joint 3D Geometry Reconstruction and Motion Generation for 4D Synthesis from a Single Image

Yanran Zhang; Ziyi Wang; Wenzhao Zheng; Zheng Zhu; Jie Zhou; Jiwen Lu

Joint 3D Geometry Reconstruction and Motion Generation for 4D Synthesis from a Single Image

Intermediate

Yanran Zhang, Ziyi Wang, Wenzhao Zheng et al.12/4/2025

arXiv PDF

Key Summary

•This paper teaches a computer to turn one single picture into a moving 3D scene that stays consistent from every camera angle.
•Past methods either made cool motion but broke 3D shape, or kept 3D shape but made boring, limited motion; this work joins both at once.
•The authors build a new dataset, TrajScene-60K, with 60,000 real videos plus dense point tracks and depth to teach motion and shape together.
•Their main engine (4D-STraG) is a diffusion transformer that predicts how every point in the scene moves in 3D over time, starting from one image.
•A depth-guided motion normalization trick teaches the model that close objects should appear to move more than far ones, keeping motion stable.
•A Motion Perception Module (MPM) points out where motion is likely to happen so the model moves the right parts more and the background less.
•A separate module (4D-ViSM) renders the moving 3D points into videos from any camera path and smartly fills in missing areas.
•Across multiple tests, the method shows stronger dynamics and better 3D consistency than strong baselines, while running efficiently.
•Ablation studies show each piece (depth latents, normalization, MPM) clearly boosts stability, realism, and motion quality.
•This joint approach is a step toward easy, interactive AR/VR content, animated game assets, and creative tools from just a single photo.

Why This Research Matters

Turning a single picture into a believable 3D scene that moves unlocks instant content creation for AR filters, educational demos, and game assets. Because motion and shape are learned together, the results stay consistent when you move the camera, avoiding the usual “melting” or “drifting” look. This joint approach means artists and developers don’t need full 3D scans or multi-view videos to get dynamic scenes. The method also offers a practical balance of quality and speed, making it closer to real-world use. Beyond art, better 4D understanding supports robotics and simulation, where correct motion–shape coupling matters. The new dataset gives the community a stronger base to train and evaluate such systems. Overall, it nudges AI toward tools that can imagine and animate our world from minimal input, safely and coherently.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook): You know how a flipbook turns drawings into motion? Now imagine doing that, but in 3D, and starting from just one photograph.

🥬 Filling (The Actual Concept)

What it is: 4D scene generation means building a 3D scene that changes over time (the fourth dimension), so you can move the camera and still see everything move correctly.
How it works (story of the field):
1. For a long time, AI video tools made great-looking motion but stayed in flat 2D pixels, so shapes bent or drifted when you changed the camera.
2. Other tools first built a clean 3D model and then tried to animate it, but motions were simple and limited (like only swinging or bobbing).
3. Because motion and shape were solved separately, they often disagreed—like a puppet dancing while its strings get tangled.
Why it matters: If motion isn’t tied to the real 3D shape, you get weird artifacts—faces stretch, arms detach, textures slide—so the magic breaks when you move the camera.

🍞 Bottom Bread (Anchor): Think of turning a single photo of a violinist into a short scene: you want the bow to move, the arm to bend, and the camera to orbit around smoothly, all without the violin melting or the background wobbling.

The World Before:

Generate-then-reconstruct: First, a video model makes multi-view clips; then a 3D model tries to rebuild the scene. Cool motion, but 3D falls apart because the video wasn’t truly 3D-aware.
Reconstruct-then-generate: First, recover a static 3D shape; then animate it. Stable 3D, but motions are small or externally driven (like a fan pushing leaves), not self-initiated (like a fox starting to run).

The Problem:

Split thinking (shape here, motion there) leads to spatiotemporal inconsistency: motion that disagrees with geometry.
From a single picture, there’s massive uncertainty: which parts should move, how far, and how should that look from any camera angle?

Failed Attempts:

Using video-only diffusion: crisp pixels but shape drifts across frames and views.
Animating a fixed mesh: preserves shape but misses rich, emergent motions.
Simple normalization (like min–max): motion scales explode when depth varies, causing jitter and instability.

The Gap:

We need a single brain that learns shape and motion together, so that what moves and how it moves always respects the real 3D structure.

Real Stakes:

Everyday creativity: make AR stickers that stay glued to objects, animate product shots, bring yearbook photos to life.
Training robots and self-driving systems: realistic 3D-consistent motion helps simulate edge cases safely.
Education and science: visualize processes (like heartbeats or weather) from sparse images without breaking 3D reality.

New Direction in This Paper:

A joint framework (MoRe4D) where geometry and motion co-evolve inside one diffusion process.
A large dataset (TrajScene-60K) with dense point tracks to teach what realistic, physically plausible motion looks like in 3D.
Smart tricks to stabilize learning from a single image: depth-guided motion normalization and a motion perception module that nudges the model toward likely movers.

🍞 Bottom Bread (Anchor): Imagine a single photo of a surfer. With MoRe4D, the surfer leans, the board cuts the wave, the camera swings left, and the shoreline stays solid—no rubbery bodies, no swimming textures, just a believable 4D scene born from one image.

02Core Idea

🍞 Top Bread (Hook): Imagine building a LEGO city where, as you add streets (motion), the buildings (geometry) automatically adjust so nothing collides or floats.

🥬 Filling (The Actual Concept)

What it is: The key insight is to jointly generate 3D shape and motion as one coupled process, so every move respects the underlying 3D structure.
How it works (in spirit):
1. Start from one image and estimate initial 3D hints (depth, points).
2. Use a diffusion transformer to predict how each 3D point should move over time (point trajectories), while keeping the evolving 3D structure consistent.
3. Normalize motion by depth so nearby points don’t jitter wildly and faraway points don’t freeze.
4. Use a motion perception module to highlight where motion is likely (arms, legs, wheels) and keep backgrounds calm.
5. Render any viewpoint and fill missing parts to make a complete, pretty video.
Why it matters: Without joint learning, you either get lively but broken 3D, or solid 3D that barely moves. The joint approach gives both.

🍞 Bottom Bread (Anchor): From a single photo of a street drummer, you get the drumsticks tapping, wrists bending, and the camera circling—while the drum set and alleyway stay rigid and correct.

Multiple Analogies:

Orchestra Conductor: Geometry is sheet music; motion is how the musicians play. A good conductor keeps both in sync—no one rushes or drifts off-key.
Puppet and Strings: The puppet’s shape is geometry; the strings’ pulls are motion. If you pull without respecting the joints, the puppet tangles. Joint modeling makes every pull anatomically safe.
GPS with Road Rules: Motion is your route; geometry is the map. You can’t drive through buildings; your path must follow streets. Joint modeling obeys the map while planning smooth travel.

Before vs After:

Before: Motion made in 2D often breaks when rotated; static geometry leads to timid, external-only motion.
After: Motion and shape are generated together; motion looks bold yet physically grounded, and the scene stays solid under camera moves.

Why It Works (intuition not math):

Diffusion learns to transform noisy guesses into clean trajectories; by feeding depth and motion cues, the model discovers stable, scale-aware patterns.
Normalizing motion by depth equalizes learning across near and far points, preventing explosions or collapse.
Motion perception focuses learning on likely movers, so the model doesn’t waste effort animating walls.

Building Blocks (each with a mini sandwich):

🍞 Hook: You know how a stop-motion movie moves tiny parts frame by frame? 🥬 Concept: 4D Scene Generation is making a 3D world that changes over time.

How: Build 3D structure, plan motion for each point over frames, render from any camera.
Why: Without 3D, motion breaks when you turn the camera. 🍞 Anchor: A fox photo becomes a short clip where paws step and the camera pans, all staying 3D-correct.

🍞 Hook: Imagine dancers (motion) practicing on a stage (geometry) that grows with them. 🥬 Concept: Joint Geometry–Motion Generation couples shape and move planning together.

How: Predict trajectories while considering current 3D structure.
Why: Prevents limbs stretching or background drifting. 🍞 Anchor: A runner’s knees bend while the track stays flat as the camera circles.

🍞 Hook: Think of tracing where confetti pieces fly. 🥬 Concept: Point Trajectories track where each 3D point goes over time.

How: Start from the first frame’s points and predict their paths.
Why: Fine-grained motion beats coarse, blocky animation. 🍞 Anchor: Each pixel on a dog’s ear becomes a 3D point that flaps realistically as it runs.

🍞 Hook: Objects close to you seem to whiz by faster from a car window. 🥬 Concept: Depth-Guided Motion Normalization scales motion by distance.

How: Normalize displacements using depth so near/far points train fairly.
Why: Stops jitter for near points and freeze for far points. 🍞 Anchor: A hand near the camera moves a lot on-screen; a mountain far away barely shifts—training respects this.

🍞 Hook: A coach spots which players will sprint. 🥬 Concept: Motion Perception Module (MPM) predicts likely movers in the image.

How: Extract motion-aware features and inject them token-by-token into the diffusion blocks.
Why: Focuses motion where it belongs; keeps backgrounds steady. 🍞 Anchor: The model animates a dancer’s skirt more than the wall behind her.

🍞 Hook: A cleanup crew polishes a messy sketch into a clean art piece. 🥬 Concept: Diffusion and Denoising iteratively remove noise to reveal coherent trajectories.

How: Start noisy, step-by-step refine using learned patterns (flow matching here).
Why: Stable, realistic motion emerges from uncertainty. 🍞 Anchor: Rough guesses of motion become smooth, believable point paths.

🍞 Hook: A camera drone films from any path. 🥬 Concept: 4D View Synthesis Module renders the moving 3D points from any camera and fills holes.

How: Project points, detect gaps, inpaint missing regions with a video model.
Why: Guarantees complete, pretty videos even for views we didn’t see. 🍞 Anchor: You fly upward over a surfer and still see a coherent ocean without holes.

03Methodology

At a high level: Single Image + Caption → (Initial Depth + Points) → 4D Scene Trajectory Generator (predict per-point motion over time) → 4D View Synthesis Module (render any camera path, fill holes) → Final Video.

Step 0: Data and Priors 🍞 Hook: Learning to dance by watching lots of dancers. 🥬 Concept: TrajScene-60K is a 60k-video dataset with dense point tracks, depth, and captions.

How: Curate quality videos, filter with VLMs for countable, self-initiated motion; extract dense 4D point tracks; clean with depth checks; render references via Gaussian Splatting.
Why: Joint learning needs real, detailed examples to understand how scenes truly move in 3D. 🍞 Anchor: Many foxes sniff and walk in different terrains—this variety teaches the model what plausible motion looks like.

Step 1: Initial 3D Hints 🍞 Hook: Before building a fort, you sketch a floor plan. 🥬 Concept: Depth Estimation from the single image gives a first guess of distances.

How: Use a monocular depth model (UniDepthv2) to get per-pixel depth; back-project to a point cloud in 3D.
Why: Without depth, the model can’t tell near from far, causing wrong motion scales. 🍞 Anchor: The surfer is closer than the cliff; the model treats their motion differently.

Step 2: Motion as Relative Trajectories 🍞 Hook: It’s easier to say, “move 2 steps from here,” than “go to X=37.4.” 🥬 Concept: Predict relative motion for each point over time instead of absolute coordinates.

How: For each point, learn displacements across frames.
Why: Relative changes are easier to learn and more stable. 🍞 Anchor: The violin bow tip moves a little each frame, forming a smooth path.

Step 3: Depth-Guided Motion Normalization 🍞 Hook: Objects near your nose seem to zip; far mountains crawl. 🥬 Concept: Scale motion by depth so training treats all distances fairly.

What happens: The model divides per-point motion by a depth-dependent scale (like the frustum size) to make learning scale-invariant.
Why this step: Without it, scenes with big depth ranges cause either jitter (near) or lifelessness (far).
Example: A dancer close to camera pivots a lot in pixels; normalization keeps the learning balanced.

Step 4: Motion-Sensitive VAE and Trajectory Encoding 🍞 Hook: Turn a maze into a simple map before solving it. 🥬 Concept: A motion-sensitive VAE encodes trajectory maps so the diffusion model can learn them well.

What happens: Convert per-point displacements into RGB-like motion maps; a VAE encoder-decoder plus tiny trajectory encoder/decoder preserves fine motion.
Why this step: Standard VAEs may blur subtle motions; this keeps details crisp.
Example: Tiny skirt ripples still reconstruct after VAE decoding.

Step 5: Diffusion Transformer with Flow Matching and Depth Latents 🍞 Hook: Sculpting from a rough block into a statue, guided by a blueprint. 🥬 Concept: A Diffusion Transformer (DiT) learns to denoise motion latents into clean trajectories, guided by image, noise, and depth latents.

What happens: Concatenate image, noise, and depth latents; train with flow matching so the model predicts the velocity from noisy to clean states.
Why this step: Flow matching gives stable, deterministic training for pixel-level motion; depth latents inject strong 3D priors.
Example: The model learns to slide noisy guesses toward believable motion paths.

Step 6: Motion Perception Module (MPM) with MAdaNorm 🍞 Hook: A spotlight tells the audience where to look. 🥬 Concept: MPM finds likely movers and injects that hint into DiT layer-by-layer.

What happens: Extract motion-aware patch features (OmniMAE), align them to tokens, and modulate attention/MLP via Motion-aware Adaptive Normalization (MAdaNorm) per token.
Why this step: Prevents animating walls or skies; amplifies limbs, wheels, and tools.
Example: The model boosts motion on a dog’s legs more than on the sofa.

Step 7: De-normalize and Build the 4D Representation 🍞 Hook: After solving in “scaled units,” convert back to real meters. 🥬 Concept: Undo depth-based scaling to recover real-world-like motion and fuse with the initial 3D points.

What happens: Multiply back by depth-dependent factors and add to the first-frame point cloud to get full 4D point trajectories.
Why this step: Produces a coherent, physically sensible 4D scene.
Example: The surfer’s board arcs just the right amount in 3D space.

Step 8: 4D View Synthesis Module (4D-ViSM) 🍞 Hook: When assembling a jigsaw, you fill gaps to complete the picture. 🥬 Concept: Render points from any camera, detect holes where no points project, and inpaint missing regions with a finetuned video diffusion model.

What happens: Project point cloud per frame; create occlusion masks; the video model (Wan2.1-style) uses the mask and partial render to fill in consistent content.
Why this step: Arbitrary camera paths reveal unseen areas; we need coherent hallucination.
Example: Tilting up over a surfer shows sky patches the points can’t cover; 4D-ViSM paints a stable sky.

Secret Sauce (why this recipe is clever):

Joint training of geometry and motion in a single diffusion backbone prevents error accumulation common in two-stage pipelines.
Depth-guided normalization stabilizes learning across distances, taming jitter and drift.
Token-wise motion conditioning (MAdaNorm) focuses motion where it belongs, boosting dynamics without breaking structure.
A dedicated renderer (4D-ViSM) turns raw trajectories into polished, multi-view videos by smartly completing what the points can’t show.

04Experiments & Results

🍞 Top Bread (Hook): Think of a school talent show. You don’t just ask, “Was it good?” You ask, “Was it smooth, clear, exciting, and did it stay on beat?”

🥬 Filling (The Actual Concept)

What they measured and why: They used VBench, a widely used video quality test, to check six skills—Subject Consistency (looks the same over time), Background Consistency (no weird wobbles), Motion Smoothness (fluid moves), Dynamic Degree (how much action), Aesthetic Quality (looks nice), and Imaging Quality (sharp and clean). These matter because a 4D method must be pretty, stable, and truly moving.

The Competition:

Generate-then-reconstruct: 4Real, DimensionX, Free4D. Strong video priors but often weak 3D consistency.
Reconstruct-then-generate: Gen3C. Stable geometry but motion can be limited.
Also compared to GenXD and S-Director (DimensionX’s 3D) under certain camera paths.

Scoreboard with Context:

Group I (simple camera paths; vs 4Real): MoRe4D showed higher dynamics, aesthetic, and imaging quality—like getting best “stage presence” and “video clarity,” even if 4Real scored slightly higher on strict subject consistency.
Group II (moderate rotations; vs GenXD, DimensionX): MoRe4D improved on consistency and visual quality—like keeping the actor on-mark during a spin.
Group III (complex camera moves; vs Gen3C, Free4D): MoRe4D led in aesthetics and imaging quality and kept motion realistic—like nailing a tough dance while still looking great.

Concrete numbers (samples):

Against 4Real, MoRe4D boosted Aesthetic Quality from around 0.51 to about 0.56 and Imaging Quality from about 0.51 to ~0.62, signaling crisper, nicer-looking videos.
Against Gen3C and Free4D on complex trajectories, MoRe4D’s Aesthetic and Imaging scores stood clearly higher (around ~0.48 and ~0.59 vs ~0.36–0.48 and ~0.36–0.48), like getting an A when others got B–C.

Surprising Findings:

Motion Perception Module (MPM) strongly affects how much action appears: removing it dropped the “Dynamic Degree” (0.90 → 0.85). It’s like the coach didn’t show up and the team played it safe.
Depth features are not optional: without depth latents, both consistency and dynamics fell—proof that 3D cues anchor believable motion.
Depth-guided normalization improved aesthetics too, not just stability; balanced motion looks better to humans.

Ablations (what happens if we remove parts):

Without Depth-Guided Normalization: More jitter, exaggerated moves in near regions; scores dipped in consistency and aesthetics.
Without MPM: Motions got weaker; scenes felt less alive.
Without Depth Latents: Structure–motion coupling loosened; parts didn’t move together as well.
MAdaNorm detail: Using only a global token (no patch-level features) slightly helped consistency but hurt dynamics and aesthetics; fine-grained motion cues matter for lively, pretty results.

Efficiency Matters:

MoRe4D runs about 6 minutes total (roughly 3 minutes for trajectories and 3 minutes for novel-view video) at 512×368 for 49 frames—balanced quality and speed. Some baselines are faster but at lower resolution and fewer frames; others are much slower.

Extra Check: 4D Consistency via a VLM Rater

A large vision-language model rated 3D geometric consistency, temporal texture stability, identity preservation, motion–geometry coupling, and background stability from 1–5.
MoRe4D scored higher across groups, especially in motion–geometry coupling—like being praised for dancing perfectly on the beat of the music (the 3D structure).

🍞 Bottom Bread (Anchor): Picture a ballerina photo turned into a scene with spins and a circling camera. MoRe4D keeps the ballerina’s body solid, the tutu fluttering just right, and the stage steady—earning high marks from judges for both artistry and technique.

05Discussion & Limitations

Limitations (honest talk):

Data Bias: TrajScene-60K comes from web videos and LLM/VLM filters; some motions, body types, or regions may be underrepresented, nudging the model toward popular online patterns.
Single-Image Ambiguity: From one photo, many futures are possible. Even with joint modeling, the chosen motion is one of many plausible ones; it can’t read your mind.
Metric Gaps: VBench is great for appearance and smoothness but doesn’t fully measure deep 3D faithfulness (like tiny texture sliding). The paper uses a VLM rater to fill the gap, but standardized 4D metrics remain an open need.
Coverage Holes: No point cloud covers everything under wild camera moves. 4D-ViSM inpaints missing parts well, but these regions are “best guesses,” not strict reconstructions.
Compute and Memory: Joint diffusion over long sequences and dense points is heavy compared to tiny models, though the team kept runtime practical.

Required Resources:

A capable GPU (the paper used 4× H20 for training; single high-end GPU for inference works with runtime ≈ minutes per sample).
Pretrained backbones (Wan2.1-style), depth estimation (UniDepthv2), and motion features (OmniMAE).
The TrajScene-60K dataset or similarly curated tracks + depths for training.

When NOT to Use:

Ultra-precise, metric-true reconstruction needs (e.g., engineering measurement) where any inpainted view or guessed motion is unacceptable.
Scenes where the single photo misses critical parts (e.g., occluded limbs you must accurately recover) and hallucination would be misleading.
Tasks demanding guaranteed physical simulation (exact forces/torques), not just visually plausible motion.

Open Questions:

Unified Metrics: How do we automatically score 4D consistency (geometry + texture + motion coupling) across arbitrary camera paths?
Stronger Priors: Can we inject articulated body models, object part priors, or physics to guide motion beyond “plausible” toward “predictive”?
Lighter Representations: Can we compress point trajectories or switch to efficient primitives (e.g., dynamic Gaussians) for mobile devices?
Editing and Control: How can users sketch motion, set constraints, or say, “move the left hand only,” with fine control but no 3D modeling?
Safety and Fairness: How do we prevent biased motion patterns or misleading inpainted content in sensitive contexts?

06Conclusion & Future Work

Three-Sentence Summary:

This paper proposes MoRe4D, a method that jointly learns 3D geometry and motion from a single image, producing consistent 4D point trajectories.
It stabilizes learning with depth-guided motion normalization and focuses motion with a Motion Perception Module, then renders any camera path using a dedicated 4D view synthesis module.
Trained on the new TrajScene-60K dataset, MoRe4D outperforms strong baselines in dynamics and multi-view consistency while staying efficient.

Main Achievement:

The #1 contribution is tightly coupling motion generation with geometric reconstruction inside one diffusion framework, turning the usual “shape vs motion” trade-off into a win–win.

Future Directions:

A single, more unified model that co-generates appearance, geometry, and motion with even tighter coupling.
Robust, standardized 4D consistency metrics that go beyond 2D video quality.
Lighter, deployable 4D representations and richer user controls (text, sketches, keyframes).

Why Remember This:

It shows that the path to believable 4D from one image is not to make motion first or shape first, but to grow them together. That joint idea—and the practical tricks (depth-guided normalization, token-wise motion conditioning, and view completion)—pushes the whole field closer to everyday AR/VR creation, game asset animation, and creative storytelling from a single photo.

Practical Applications

•Instant AR effects: Animate a product photo so users can walk around it and see it move in space.
•Creative tools: Turn character concept art into short dynamic 3D shots for storyboards or teasers.
•Game prototyping: Quickly animate environment props or NPCs from reference images for early playtests.
•Education: Bring textbook pictures (e.g., animal locomotion) to life with 3D-consistent motion.
•Advertising: Make engaging rotating product demos with subtle, realistic motions from a single hero shot.
•Virtual showrooms: Generate dynamic 3D previews for furniture, wearables, or gadgets without full scans.
•Cinematic previz: Explore camera paths and motions from one frame of a scene to plan shots quickly.
•Social media: Produce stylish, multi-angle motion loops from a selfie or pet photo.
•Robotics simulation: Create plausible moving 3D environments for training perception modules.
•Cultural heritage: Animate historical photos with careful, gentle motions for museum exhibits.

Version: 1