Motion 3-to-4: 3D Motion Reconstruction for 4D Synthesis

Hongyuan Chen; Xingyu Chen; Youjia Zhang; Zexiang Xu; Anpei Chen

Motion 3-to-4: 3D Motion Reconstruction for 4D Synthesis

Intermediate

Hongyuan Chen, Xingyu Chen, Youjia Zhang et al.1/20/2026

arXiv PDF

Key Summary

•Motion 3-to-4 turns a single regular video into a moving 3D object over time (a 4D asset) by first getting the object’s shape and then figuring out how every part moves.
•It uses a simple but powerful trick: keep one clean reference 3D mesh as the starting point and predict how its points travel through the video.
•A special transformer reads each frame along with the mesh, so it can handle short or long videos and different mesh sizes.
•Instead of slow per-video optimization, it works feed-forward, so it’s much faster than many older methods.
•On a new Motion-80 benchmark with real ground-truth geometry, it beats strong baselines in both shape accuracy and visual consistency.
•It can even animate an artist’s static 3D model using motion seen in a different video (motion transfer).
•The method learns a compact “motion code” from video features (DINOv2) plus mesh features, then decodes per-frame 3D trajectories for mesh points.
•By aligning surface points to video evidence, it generalizes well even with limited 4D training data.
•It runs at about 6.5 FPS over long sequences (512 frames), while some older pipelines run at around 0.1 FPS.
•Limitations include potential vertex sticking when parts touch and difficulty with big topology changes later in the video.

Why This Research Matters

This method lets anyone turn a normal video into a moving 3D model that stays consistent from all angles. That helps filmmakers, game designers, and VR creators make high-quality animations quickly, without special camera rigs or long manual work. Robots and AR apps can better understand how objects move in the real world from simple phone videos. Teachers and students can bring science lessons to life by recording and instantly animating models for lab-style exploration. Artists can retarget motion from one video to a different 3D character, unlocking new creative workflows. Because it’s feed-forward and robust, it’s practical for long clips and real-world scenes. In short, it makes 4D content creation faster, cheaper, and more accessible.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

You know how a flipbook shows a drawing moving when you flip the pages? That’s like 4D: a 3D thing changing over time. For years, computers have become great at making pretty 2D pictures and even solid 3D models, but getting both shape and motion together (4D) from just one normal video is hard. Why? Because the video only shows one angle at a time, and parts of the object are often hidden.

🍞 Hook: Imagine trying to build a full LEGO statue of a dancer by watching a single home video. You can see the front, but sometimes the back is hidden, and she spins fast. 🥬 The Concept (Monocular Video Input): What it is: A regular video from just one camera, like a phone video taken with one eye open. How it works:

Record frames over time from one viewpoint.
Each frame shows the object’s appearance and hints of motion.
But hidden sides and true 3D positions aren’t directly visible. Why it matters: Without knowing we only have one viewpoint, we might expect perfect 3D everywhere. In reality, the system must guess what’s behind or occluded. 🍞 Anchor: Filming a toy car with your phone from the front only gives you the front view; you still need to imagine its back and sides.

Before this paper, people tried two main paths. One path first generated many fake camera views with 2D or video models, then tried to reconstruct the 3D+time object. That often caused view mismatches (like the object’s stripes not lining up across views) and needed slow, per-video tuning. Another path used big 3D generators to make a mesh per frame and then tried to align all frames into a single 4D mesh—again slow and fragile, since each frame could drift differently.

🍞 Hook: Think of baking 32 cupcakes separately and then trying to stack them into a perfect cake. Each cupcake might be a little different. 🥬 The Concept (3D Shape Generation): What it is: Making a clean 3D model (mesh) of the object. How it works:

Use a strong 3D generator to create a full shape from an image.
Get vertices (points), faces (triangles), and textures (colors).
Normalize the size so it fits a box and is ready to animate. Why it matters: A solid starting shape is like a sturdy skeleton for motion—without it, the movement has nothing reliable to hold onto. 🍞 Anchor: From the first video frame of a dragon, the model makes a full dragon mesh you can spin around.

🍞 Hook: You know how detectives re-create a chase by tracking footprints over time? 🥬 The Concept (Motion Reconstruction): What it is: Figuring out how each tiny part of the object moves over time based on the video. How it works:

Start with the reference 3D mesh at time 0.
For each next frame, predict where each mesh point moves in 3D.
Keep this consistent across all frames for a smooth, believable animation. Why it matters: If we don’t correctly reconstruct motion, the object will look jumpy, stretchy, or broken. 🍞 Anchor: A chicken’s wing tips start at one place on frame 1 and follow curving 3D paths as it flaps in later frames.

🍞 Hook: If you draw a bouncing ball, it looks weird if its size and position randomly jump. 🥬 The Concept (Temporal Consistency): What it is: Keeping the shape and motion smooth and stable over time. How it works:

Tie each mesh point to a continuous 3D path.
Use video features that remember what’s where across frames.
Apply the same reference geometry to avoid drift. Why it matters: Without temporal consistency, the motion flickers or the geometry swims around, breaking the illusion. 🍞 Anchor: Watching from different angles, the same dragon head stays the same shape as it nods, without wobbling textures.

The problem researchers faced: 4D data is scarce, and learning a general motion model directly is tough. Previous VAE-style motion models needed huge diverse datasets and still generalized poorly. And reconstruction-only methods can’t invent hidden geometry. The missing gap was a way to combine the strengths of 3D generators (great static shapes) with robust motion recovery (accurate over time) in a single fast, feed-forward system.

Real stakes: Better 4D from a single video helps VR, movies, robotics, education, and games. Imagine: filming your pet once and getting a high-quality animated 3D pet you can drop into a game, or digitizing a vintage puppet show into a faithful, re-animatable 3D asset. That’s why Motion 3-to-4 matters: it turns everyday videos into living 3D over time, quickly and reliably.

02Core Idea

Aha! Moment in one sentence: Keep one clean 3D reference mesh and learn to predict how all its points move in 3D for each video frame—turning 3D (shape) into 4D (shape + time) by reconstructing motion.

Three analogies:

Map-and-journey: First draw a clear map (the reference mesh), then trace the journey of every landmark over time (motion paths), instead of redrawing the map every day.
Mannequin-and-outfit: Dress one mannequin (mesh) and then smoothly bend its arms and legs (motion) instead of sewing a new doll for each pose.
Anchor-and-string: Tie every surface point to an anchor at frame 1 and pull it along a string path through time, keeping everything connected.

Before vs After:

Before: Systems tried to make a new 3D shape per frame or rely on many fake multi-views, causing drift, mismatch, and slow optimization.
After: Motion 3-to-4 reuses one static mesh and only learns the motion flow, so geometry remains consistent and motion stays smooth—and it runs feed-forward.

Why it works (intuition): The hardest part—learning a full 4D distribution from limited data—is replaced with two easier steps: (1) get a reliable static shape using strong 3D generators trained on huge datasets; (2) learn motion as local correspondences between surface points and video pixels. Because motion is predicted relative to the same reference, the system preserves correspondences and avoids per-frame shape drift. Video features (from DINOv2) provide powerful, general visual cues so the model knows which parts match across frames.

Building blocks, explained simply with the Sandwich pattern:

🍞 Hook: Like using a default starting pose for a character before animating it. 🥬 The Concept (Canonical Reference Mesh): What it is: A clean, standard 3D mesh of the object at the first frame that serves as the starting pose. How it works:

Either load the artist’s mesh or generate one from the first video frame.
Normalize scale and keep topology consistent.
Treat it as the anchor for all future motion predictions. Why it matters: Without a stable reference, each frame could drift, stretch, or change shape. 🍞 Anchor: Build one dragon mesh at time 0; all future poses bend this same dragon.

🍞 Hook: Imagine taking careful notes about a statue: position, tilt, and color. 🥬 The Concept (Geometric Features Encoding): What it is: Turning mesh points (position, normal, color) into compact tokens the network can understand. How it works:

Sample thousands of surface points from the mesh.
Embed their coordinates, normals, and colors.
Use attention to compress them into a small set of shape tokens. Why it matters: Without good shape tokens, the model can’t link video evidence to the right parts of the geometry. 🍞 Anchor: 4096 sampled points become 64 rich tokens that summarize the mesh.

🍞 Hook: Think of a librarian who first scans all books globally, then organizes each shelf separately. 🥬 The Concept (Frame-wise Transformer Architecture): What it is: A transformer that alternates between looking across all frames globally and then focusing on each individual frame. How it works:

Extract video frame features with a strong vision model.
Alternate global attention (share across time) and frame attention (detail per frame).
Handle arbitrary video lengths by keeping per-frame tokens. Why it matters: Without this, the model either forgets long sequences or can’t adapt to different durations. 🍞 Anchor: Whether the video has 12 or 300 frames, the same process works.

🍞 Hook: Like learning a dance by watching and remembering repeated moves. 🥬 The Concept (Motion Latent Learning): What it is: Learning a compact “motion code” that mixes shape tokens with per-frame video tokens. How it works:

Combine the shape tokens with DINOv2 frame tokens.
Add time awareness so the order of frames is known.
Produce per-frame motion tokens that describe how the mesh should move. Why it matters: Without a good motion code, the decoder can’t predict accurate 3D trajectories. 🍞 Anchor: For frame 10, a motion token says, “the tail flicks slightly left and up.”

🍞 Hook: Like calling students’ names and having the teacher focus on each one’s needs. 🥬 The Concept (Cross-Attention Layer): What it is: A mechanism where point queries from the mesh ask the motion tokens for guidance. How it works:

Turn each sampled point into a query vector.
Let it attend to motion tokens from the right frame.
Decode the final 3D position for that point. Why it matters: Without cross-attention, the model can’t match “this exact point on the mesh” to “what the video says it’s doing now.” 🍞 Anchor: The tip of a dragon’s horn asks the motion tokens, “Where am I at frame 23?” and gets a precise 3D coordinate.

Put together, these pieces let the system predict a full 3D trajectory for every surface point across time—precise, smooth, and fast.

03Methodology

At a high level: Input (one video + optional 3D mesh) → Encode shape points into shape tokens → Encode video frames into per-frame tokens → Alternate global-and-frame attention to form motion tokens → Cross-attention decode per-frame 3D positions for many mesh points → Reassemble into a moving 3D mesh (4D asset).

Step-by-step, with what, why, and an example:

Inputs and Preparation

What happens: We take a single monocular video. If the user has a clean mesh for frame 1, great; if not, we generate it from the first frame using a strong 3D model. We normalize size and keep textures.
Why this exists: A stable reference mesh avoids shape drift. Having texture helps link video pixels to surface regions.
Example: A 256×256 video of a dancing dragon; we generate a mesh from frame 1 and scale it into a [-0.5, 0.5] box.

Sample and Embed Mesh Points (Geometric Features Encoding)

What happens: Uniformly sample about N = 4096 surface points, each with position (xyz), normal (which way the surface faces), and color (RGB). Embed these into high-dimensional vectors.
Why this exists: The network needs a compact, learnable summary of the geometry.
Example: 4096 points cover the dragon’s body, wings, horns, and tail with enough detail.

Compress into Shape Tokens

What happens: Use cross-attention with a small learnable token set (K = 64) so each token gathers info from relevant points. Then refine with a few self-attention layers.
Why this exists: 4096 raw points are too many; 64 tokens are easier to mix with video features and still rich enough.
Example: One token may represent “left wing,” another “snout ridge.”

Extract Video Features per Frame

What happens: Use a frozen DINOv2 ViT to turn each frame into patch-level tokens (semantic features). Add temporal embeddings so tokens know their frame index.
Why this exists: Good video features make it easier to find consistent correspondences across frames, even with lighting or pose changes.
Example: Frame 7 features highlight bright scales on the dragon’s right side, similar to frames 6 and 8, so the model can track them.

Build Per-Frame Motion Representations (Frame-wise Transformer Architecture)

What happens: For each frame, append the global shape tokens to the frame’s video tokens. Run alternating attention: global attention shares information across frames, then frame attention sharpens details for that specific frame.
Why this exists: This lets the model handle any sequence length while keeping both long-range consistency and per-frame accuracy.
Example: Over 16 alternating blocks, the network learns that “the tail motion gradually swings left over frames 5–12.”

Motion Latent Learning

What happens: After the alternating blocks, we take the first K tokens per frame as motion tokens. These are compact codes telling how the object should move at that time.
Why this exists: The decoder needs a small, consistent set of tokens per frame to query for motion.
Example: At frame 12, a token encodes “head nod down slightly; wing tips lift.”

Cross-Attention Motion Decoding

What happens: Resample M = 4096 reference points as queries. For each frame t, each query attends to the motion tokens Z_t to get a predicted 3D position. A small MLP outputs final coordinates.
Why this exists: Querying motion per point keeps surface-to-pixel correspondences and avoids re-generating new meshes per frame.
Example: The exact vertex on the dragon’s nostril moves along a smooth arc from frames 1 to 12.

Training with Direct Supervision

What happens: During training, we have ground-truth meshes across frames, so we can compute the mean squared error (MSE) between predicted and true 3D positions for many points and many frames. We also keep point sampling consistent across frames (same barycentric coordinates), so each tracked point has a true path.
Why this exists: Simple, strong supervision teaches the model precise 3D motion without fancy losses.
Example: If the real point went from (0.10, 0.05, 0.02) to (0.12, 0.06, 0.01), the model is nudged toward that exact path.

Inference at Scale

What happens: For long videos (e.g., >256 frames), we use sliding windows that always include the first (reference) frame for stability. We process points in chunks for memory efficiency. The system runs feed-forward without per-instance optimization.
Why this exists: This makes the system practical and fast for long sequences.
Example: A 512-frame animation runs around 6.5 FPS.

What breaks without each step?

No reference mesh: shapes drift per frame; no stable correspondences.
No shape tokens: video features can’t attach to specific surface parts.
No temporal embeddings or global attention: the model loses track over long sequences.
No cross-attention: points can’t ask for “their” motion and get mixed up.
No dense MSE supervision: positions get sloppy and motion flickers.

Concrete toy example with tiny numbers:

Suppose we sample 4 points on a toy fish: nose, fin tip, tail top, tail bottom.
For frame t, each point queries motion tokens and gets a new 3D position.
Over 10 frames, the fin tip follows a smooth curve upward and back, matching the video’s flap.

The secret sauce:

Disentangle shape and motion. Reuse a powerful 3D generator for the static mesh, then just learn motion. Combine strong video semantics (DINOv2) with a frame-wise transformer so motion stays consistent. Predict per-point trajectories via cross-attention, locking surface-to-pixel correspondences without slow post-processing.

04Experiments & Results

The tests: The team measured two big things—geometry (Is the 3D shape right across time?) and appearance (Do the renderings look good and stay stable from new views?). They also cared about speed and robustness on short and long videos.

Datasets:

Motion-80 (new): 80 test subjects from Objaverse, including 64 short and 16 long sequences with accurate ground-truth geometry and realistic renderings from four views.
Consistent4D benchmark: 7 videos, each 32 frames, widely used to compare 4D methods, but without ground-truth meshes (so evaluation is by render-based metrics).

Competitors:

L4GM: A large 4D Gaussian reconstruction model (fast feed-forward but geometry can float off surfaces).
GVFD: A VAE-based motion model animating Gaussians (needs lots of diverse 4D data; struggled with generalization and long sequences).
V2M4: Generates a mesh each frame then aligns across time (plausible geometry but slow and can flicker over time).

Metrics (made meaningful):

Chamfer Distance (CD): Lower is better; imagine measuring how far your recreated dots are from the true dots—smaller means more accurate shape.
F-Score: Higher is better; like a matching test that rewards getting many points correct within a small distance.
LPIPS: Lower is better; compares how similar two images look to people.
CLIP score: Higher is better; checks overall content similarity.
FVD (Fréchet Video Distance): Lower is better; like a smoothness-and-quality score for videos.
DreamSim: Lower is better; another perceptual similarity measure for appearance.

Scoreboard highlights:

Motion-80 (short sequences): Motion 3-to-4 achieved CD ≈ 0.1113 and F-Score ≈ 0.3171, better than all baselines. That’s like getting an A when others get B’s or C’s on shape accuracy.
Motion-80 (long sequences): It stayed strong, showing consistent geometry and motion over 128+ frames. GVFD couldn’t run on long sequences (out of memory), while our method stayed steady.
With a given ground-truth static mesh (“Ours w/m”): Performance jumped a lot (e.g., F-Score up to ~0.6774 on shorts), showing that our motion reconstruction is extremely faithful when the starting shape is perfect.
Consistent4D benchmark: Our LPIPS ≈ 0.1455 (lower is better) and CLIP ≈ 0.8609 (higher is better), and FVD ≈ 1260.0 (lower is better) improved over baselines, indicating better perceived quality and temporal stability.

Speed:

Motion 3-to-4 runs about 6.5 FPS over 512 frames, while many generate-then-align or multi-view pipelines run around 0.1 FPS (much slower). It’s like going from a walking pace to biking.

Surprising findings:

Motion transfer: Even though trained on paired meshes and videos, the method can take motion from one video (e.g., a dragon) and animate a different mesh (e.g., a chicken) reasonably well—showing strong generalization of motion patterns.
In-the-wild videos: Despite training on synthetic data, it generalized to real-world clips once backgrounds were removed, thanks to robust video features and the shape-first design.
Novel-view rendering: Methods that look good from the input view sometimes fall apart from other angles (ghosting or flicker). By tying motion to a single reference mesh, Motion 3-to-4 remains coherent from new viewpoints.

Takeaway: Across geometry, appearance, and speed, Motion 3-to-4 consistently outperformed feed-forward Gaussian methods and alignment-heavy mesh methods, especially shining when long sequences or novel views were tested.

05Discussion & Limitations

Limitations (honest look):

Vertex sticking: If parts of the reference mesh are too close or not clearly separated (like fingers touching), nearby vertices may stick together during motion.
Topology changes: The method assumes the first-frame mesh topology stays valid. If the object opens a mouth that wasn’t open in frame 1 (big topology change), motion reconstruction can fail.
Single-view ambiguity: From one camera, some parts are always hidden. The model does a good job guessing, but perfect recovery is still impossible in principle.
Dependence on initial mesh quality: If the generated mesh from frame 1 is poor (holes, wrong proportions), motion will inherit those issues.

Required resources:

Training used 8×H100 GPUs for about 1.5 days on 60k steps with 12-frame clips.
A frozen DINOv2 backbone provides strong video features.
At inference, it’s feed-forward and memory-conscious (chunked points and sliding time window), enabling practical speeds.

When not to use:

Drastic topology changes mid-video (e.g., a closed book suddenly opens very wide when the initial mesh had pages fused).
Extremely thin or transparent structures that the first-frame mesh can’t capture (e.g., wispy cloth strands, complex hair), or heavy motion blur hiding key cues.
Scenes where background and object separation is unclear and foreground segmentation is missing.

Open questions:

How to better handle topology changes? Could a dynamic mesh that adapts its connectivity help?
Can we incorporate uncertainty estimates for hidden or occluded regions, so the system reports confidence?
Could joint refinement of the reference mesh over time (without losing consistency) further improve results?
How far can motion transfer go across very different body plans (e.g., bird-to-car)?
Can lightweight video features (instead of large backbones) keep quality high while lowering compute for mobile/edge use?

06Conclusion & Future Work

Three-sentence summary: Motion 3-to-4 turns a single video into a 4D asset by separating the problem into two easy pieces: get a good static 3D mesh and then reconstruct how every surface point moves over time. It learns a compact motion code from video-plus-mesh features and decodes per-frame 3D trajectories via cross-attention, keeping geometry consistent and motion smooth. The result is fast, generalizable, and strong on both shape accuracy and visual quality, even for long sequences and new viewpoints.

Main achievement: Reformulating 4D generation as feed-forward motion reconstruction relative to a canonical reference mesh—combining the best of 3D generation and 4D reconstruction to achieve state-of-the-art fidelity and consistency from a single-view video.

Future directions:

Make topology adaptive so the mesh can open/close parts naturally.
Add uncertainty-aware predictions for occluded regions.
Explore smaller backbones and distillation for on-device performance.
Expand motion transfer to very different shapes and categories with fine control.

Why remember this: The simple idea—keep one mesh and just predict motion—turns a messy, data-hungry 4D problem into an efficient, robust pipeline. It shows how anchoring time to a single, clean geometry unlocks smooth, believable animations from ordinary videos, bringing high-quality 4D content creation closer to everyday use.

Practical Applications

•Animate a static 3D character (e.g., a game asset) using motion captured from a phone video.
•Create 4D assets for movies and VR from quick on-set recordings without multi-camera rigs.
•Generate training data for robotics by capturing object motions in the lab with a single camera.
•Build educational AR experiences where recorded real objects become interactive 3D animations.
•Speed up previsualization for cinematography by converting rehearsal videos into 4D stand-ins.
•Prototype game mechanics by filming toys/models and importing their motion into a game engine.
•Perform motion retargeting: apply a dancer’s moves to a different 3D avatar’s body.
•Produce long, stable animations from home videos for indie filmmakers and creators.
•Augment product demos (e.g., rotating, opening lids) from simple marketing clips.
•Support scientific studies of movement (e.g., animal gait) from limited camera footage.

Version: 1