ActionMesh: Animated 3D Mesh Generation with Temporal 3D Diffusion

Remy Sabathier; David Novotny; Niloy J. Mitra; Tom Monnier

ActionMesh: Animated 3D Mesh Generation with Temporal 3D Diffusion

Intermediate

Remy Sabathier, David Novotny, Niloy J. Mitra et al.1/22/2026

arXiv PDF

Key Summary

•ActionMesh is a fast, feed-forward AI that turns videos, images + text, text alone, or a given 3D model into an animated 3D mesh.
•Its key idea is to add a time axis to a strong 3D diffusion model (Stage I) and then convert the changing shapes into a single, consistently connected mesh by deforming a chosen reference shape (Stage II).
•The animations are rig-free (no skeletons needed) and topology consistent (same mesh connections across frames), which makes texturing and editing easy.
•A special 'inflated attention' lets the model keep frames synchronized, while 'masked generation' lets it plug in known meshes as anchors.
•A temporal 3D autoencoder predicts vertex deformations over time so the whole sequence shares one mesh topology.
•On an Objaverse benchmark, ActionMesh beats prior methods in 3D accuracy, 4D consistency, and motion fidelity while running about 10× faster (about 3 minutes for 16 frames).
•It supports motion transfer (retargeting) and can extend animations autoregressively for longer videos.
•Limitations include trouble with true topological changes (like splitting/merging parts) and strongly occluded regions.
•Because outputs are production-ready meshes, it plays nicely with game engines, film pipelines, AR/VR, and texture workflows.

Why This Research Matters

ActionMesh turns everyday inputs into production-ready animated meshes in minutes, cutting iteration time for artists, studios, and indie creators. Because the mesh topology stays consistent, textures and materials flow naturally across frames, reducing manual cleanup and rework. Its rig-free nature lowers the barrier to animating complex or unusual shapes that don’t have obvious skeletons. Faster, higher-quality 4D outputs make it easier to prototype game characters, film shots, ads, and AR experiences. Motion transfer enables creators to reuse performances across different assets, boosting creativity and productivity. As models like this learn from large video corpora, they could democratize 3D animation much like smartphone cameras democratized filmmaking.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you’re making a stop-motion movie with clay figures. Every frame you slightly change the clay figure, then take a photo. Now imagine doing this in 3D on a computer, but you want the figure to look the same (same parts, same connections) while it moves. That’s hard!

🥬 The Concept: 3D mesh How it works:

A 3D mesh is a digital shape made of points (vertices) connected into triangles (faces).
Those triangles form a surface like a net wrapped around an object.
The net can have materials and textures painted on it. Why it matters: Without meshes, 3D objects would be fuzzy clouds and hard to animate or texture. 🍞 Anchor: A video game character’s body is a mesh made of many tiny triangles.

🍞 Hook: You know how a flipbook shows a drawing moving when you flip pages quickly?

🥬 The Concept: Animated 3D mesh How it works:

Start with one mesh (its shape and connections).
Move its vertices a little for each frame to show motion.
Keep the same set of vertices and faces so textures and parts line up over time. Why it matters: If the mesh changed its connections every frame, textures would slide, and editing would break. 🍞 Anchor: A single 3D dragon model flaps its wings across frames by nudging the same vertices frame after frame.

🍞 Hook: Think of LEGO pieces snapped together. You can bend the model, but the pieces and how they connect don’t change.

🥬 The Concept: Mesh topology (and topology consistency) How it works:

Topology is about which vertices are connected and which triangles touch.
Topology consistency means those connections stay the same across frames.
You can move vertices, but you don’t add/remove connections. Why it matters: Without consistent topology, textures, UVs, and downstream tools don’t work reliably. 🍞 Anchor: A textured astronaut keeps the same UV map while waving because its mesh topology never changes.

🍞 Hook: Puppets often need strings or a skeleton to move. What if your puppet could dance without strings?

🥬 The Concept: Rig-free animation How it works:

No skeleton or skinning is required.
The model directly predicts where each vertex should move every frame.
The object deforms as one continuous shape. Why it matters: Rigging is hard or impossible for some shapes (like an octopus with maracas), and skipping it speeds up production. 🍞 Anchor: An octopus jiggles each tentacle without ever building a skeleton.

The world before: Most systems that turned video into moving 3D shapes were slow and picky. They often required one specific input (like a certain kind of video), or needed long optimization (30–45 minutes per scene), and outputs weren’t always ready for real productions because topology changed or quality flickered.

The problem: We want a simple, fast way to go from everyday inputs (video, text, image, or an existing 3D model) to a clean animated 3D mesh that stays consistent and keeps textures.

Failed attempts:

Per-frame image-to-3D: Running a 3D reconstructor on each video frame separately caused shapes to spin randomly or flicker, because each frame was treated alone.
Optimization-heavy pipelines: They improved quality but were slow, fragile, and not feed-forward.
Non-mesh outputs: Some fast methods produced Gaussians or neural fields, not directly usable, textured meshes.

The gap: We needed a feed-forward method that is both temporally consistent and produces a single, editable mesh over time.

Real stakes: Faster, cleaner animated meshes help game studios, film/TV, ads, AR/VR, and creators iterate in minutes, not hours, while keeping textures and assets production-ready.

02Core Idea

🍞 Hook: Picture a marching band. Each player (a frame) must play in sync, and then the choreographer turns all those separate steps into one smooth dance for the whole group.

🥬 The Concept: 3D diffusion model (and latents) How it works:

A diffusion model learns to turn noisy signals into clean ones—it’s like un-scrambling TV static back into a picture.
In 3D, it predicts a compact code (a latent) that represents the whole shape.
A decoder turns this latent into a full 3D mesh. Why it matters: Without a strong 3D diffusion backbone, shapes would be blurry or wrong, and generation would be unstable. 🍞 Anchor: The model starts with a noisy 3D code and iteratively cleans it until it becomes a sharp mesh of a horse.

Aha moment (one sentence): Add a time axis to a strong 3D diffusion model to produce synchronized per-frame shapes, then convert those shapes into a single, consistently connected animated mesh by predicting vertex deformations of a chosen reference.

Analogy 1 (choir): Stage I is teaching each singer (frame) to stay on the same key and tempo; Stage II blends them into one harmony (one mesh) that moves smoothly. Analogy 2 (flipbook): Stage I draws each page clearly and aligned; Stage II keeps the same tracing paper so colors and outlines match across pages. Analogy 3 (baking): Stage I bakes matching cookies (shapes) from the same cutter; Stage II gently bends the first cookie into the poses of all others, so it’s always the same cookie.

🍞 Hook: You know how friends in a group chat sometimes talk over each other, but if they can see the whole thread, they stay on topic?

🥬 The Concept: Temporal 3D diffusion How it works:

Generate a sequence of 3D latents—one per video frame—but make them talk to each other.
Use inflated attention so tokens from any frame can attend to tokens from other frames.
Add timing cues (rotary embeddings) so the model knows which frames are earlier/later. Why it matters: Without temporal diffusion, per-frame shapes drift and flicker. 🍞 Anchor: A running cheetah stays the same cheetah as it strides across frames, rather than changing body orientation randomly.

🍞 Hook: Imagine building a puzzle with some pieces already placed. It’s easier to finish the rest.

🥬 The Concept: Masked generation (with known 3D inputs) How it works:

Keep some latents clean (from a known mesh) and mark them as sources.
Only denoise the remaining masked latents.
Let all tokens attend to the clean anchors. Why it matters: Without anchors, generation can wander and can’t reuse a given 3D asset. 🍞 Anchor: Provide a good 3D frame from the video as the anchor; the model completes the rest consistently.

🍞 Hook: Think of a tailor using one mannequin for all outfits—he pins the fabric differently each time, but it’s the same body underneath.

🥬 The Concept: Temporal 3D autoencoder (deformation fields of a reference mesh) How it works:

Take the sequence of per-frame shapes (independent meshes).
Choose a reference mesh (the mannequin).
Predict a deformation field per frame that moves the reference’s vertices to match each shape. Why it matters: Without this, you’d get different meshes each frame—no consistent topology, no stable textures. 🍞 Anchor: Start with a neutral cat mesh; for each frame, predict how to nudge its vertices so it walks, all while staying the same mesh.

Before vs. after:

Before: Slow pipelines, per-frame drift, changing topologies.
After: A 3-minute, feed-forward system that keeps one mesh, preserves textures, and stays in sync across frames.

Why it works (intuition):

Strong 3D priors (pretrained image-to-3D and VecSet latents) give high-fidelity shapes.
Inflated attention and timing cues synchronize frames.
Masked anchors stabilize and unlock multi-input tasks.
A deformation-based autoencoder guarantees a single, consistent mesh.

🍞 Hook: When you bend a wire sculpture, the joints don’t change; the wire just moves.

🥬 The Concept: Deformation field How it works:

For each vertex of the reference mesh, predict a small 3D offset for the target frame.
Apply these offsets frame by frame.
The mesh deforms smoothly over time. Why it matters: Without deformation fields, you’d need to rebuild the mesh each frame. 🍞 Anchor: Each vertex in a bunny mesh shifts slightly to show hopping motion.

03Methodology

High-level recipe: Input (video, or image+text, or 3D+text, or text) → Stage I: Temporal 3D diffusion (synchronized per-frame shapes) → Stage II: Temporal 3D autoencoder (deform a single reference mesh) → Output: One animated, topology-consistent 3D mesh.

Stage I: Temporal 3D diffusion

What happens: The model produces a sequence of 3D latents (one per frame) that stay synchronized.
Why it exists: Per-frame reconstructions drift and flicker; we want them to agree over time.
Example: For a 16-frame video of a hopping rabbit, we generate 16 latents that all describe the same rabbit across time.

🍞 Hook: Imagine you flatten a stack of comic pages into one long strip so the characters can ‘see’ each other across panels.

🥬 The Concept: Inflated attention How it works:

Take tokens from all frames and temporarily treat them as one long sequence.
Run self-attention across this big sequence so information flows between frames.
Reshape back to per-frame form for the next layers. Why it matters: Without inflated attention, frames can’t share cues and drift off. 🍞 Anchor: The left leg position in frame 3 can nudge the model to keep a compatible pose in frame 4.

🍞 Hook: Think of rhythm marks on sheet music—they tell players when to come in.

🥬 The Concept: Temporal cues with rotary positional embeddings How it works:

Add relative time information into attention computation.
This encodes frame order and spacing.
Helps smooth motion, reducing jitter. Why it matters: Without time encoding, the model might mix up before/after, causing stuttery motion. 🍞 Anchor: A waving hand accelerates and decelerates smoothly rather than snapping between poses.

🍞 Hook: If you already have the first puzzle piece, finish the rest around it.

🥬 The Concept: Masked generation How it works:

Mark some frame latents as clean sources (e.g., from a strong image-to-3D model on a good frame).
Only denoise masked targets; keep sources fixed.
Let targets attend to sources each denoising step. Why it matters: Without masked anchors, consistency drops and you can’t plug in a known mesh. 🍞 Anchor: Use a clean front-view mesh of a lion from a chosen frame; the model completes side and mid-motion frames to match.

Concrete Stage I flow (video-to-4D):

Pick a clear video frame and run a strong image-to-3D (e.g., TripoSG) to get a high-quality reference latent and mesh.
Feed the whole video into the temporal diffusion with inflated attention and time cues.
Keep the chosen frame’s latent clean; denoise the rest so all latents synchronize with the anchor.
Decode latents to meshes per frame—this is a 4D mesh set (topology may differ per frame).

Stage II: Temporal 3D autoencoder

What happens: Convert the 4D mesh set into deformations of one reference mesh (constant topology).
Why it exists: Downstream tasks (texturing, UVs, editing) need one consistent mesh through time.
Example: Turn 16 independent rabbit meshes into one rabbit mesh that bends over 16 frames.

🍞 Hook: One mannequin, many outfits pinned into shape.

🥬 The Concept: Reference mesh How it works:

Choose or compute a base mesh to keep through time.
For each frame, predict a vertex offset field relative to this mesh.
Apply offsets to animate. Why it matters: Without a stable reference, textures slide and editing breaks. 🍞 Anchor: A single astronaut mesh is reused while its arms lift and legs step.

🍞 Hook: If two dots are near but on different sides of a fold, their surface directions differ.

🥬 The Concept: Using positions and normals for queries How it works:

For each query point/vertex, supply both its 3D position and surface normal.
Normals disambiguate nearby-but-topologically-distant points.
The decoder predicts where each query should move at a target time. Why it matters: Without normals, the model can confuse close points across folds, hurting motion accuracy. 🍞 Anchor: A bent elbow’s surface points are close in space but normals help the model keep skin sides distinct.

The secret sauce:

Minimal changes to a strong 3D backbone: Inflate attention and add time cues to synchronize frames; add masked generation to inject anchors.
A compatible temporal 3D autoencoder: It reads the same kind of latents, so Stage I outputs plug into Stage II smoothly.

End-to-end example with numbers:

Input: 16-frame video of a bear walking.
Stage I: Use one clear frame to create a clean latent anchor; denoise the other 15 with inflated attention + time cues; decode 16 meshes (may differ in topology).
Stage II: Pick the reference mesh (e.g., from the anchor). For each frame, predict vertex deformations using the temporal autoencoder with position+normal queries. Result: one animated, topology-consistent bear mesh.

Why each step matters:

Without Stage I synchronization, you get wobble and shape drift.
Without masked anchors, you can’t reuse a known good mesh or do {3D+text} animation.
Without Stage II deformations, you can’t keep textures and UVs consistent.

04Experiments & Results

🍞 Hook: If four runners race, it’s not enough to know times—you want to know by how much and why.

🥬 The Concept: Chamfer distance (CD-3D, CD-4D, CD-M) How it works:

Sample many points on each surface.
For each point, measure the nearest distance to the other surface, and average both directions.
CD-3D checks each frame after per-frame alignment; CD-4D uses one alignment for the whole sequence; CD-M tracks how corresponding points move over time. Why it matters: Without these metrics, we can’t tell if shapes match, stay consistent over time, or move correctly. 🍞 Anchor: If CD-4D is small, the animated mesh stays close to ground truth across the whole sequence, not just one frame.

The test: On an Objaverse benchmark of 32 animated scenes, the team measured three things:

CD-3D: per-frame geometry accuracy
CD-4D: sequence-wide 4D accuracy (temporal consistency)
CD-M: motion fidelity (do points move like they should?)

The competition: LIM, DreamMesh4D, V2M4, plus qualitative comparisons to ShapeGen4D.

The scoreboard (contextualized):

ActionMesh: CD-3D = 0.050, CD-4D = 0.069, CD-M = 0.137; about 3 minutes per 16 frames.
Prior bests were noticeably worse: compared to the best per-metric competitor, ActionMesh improves CD-3D by ~21% (like going from a solid A to A+), CD-4D by ~46% (cutting almost half the error in sequence consistency), and CD-M by ~45% (much more faithful motion) while running roughly 10× faster (3 minutes vs 15–45 minutes).

Qualitative findings (Consistent4D set):

LIM and DreamMesh4D: softer shapes, visible artifacts.
V2M4 and ShapeGen4D: sharper but with artifacts and some temporal drift.
ActionMesh: highest geometric fidelity with strong temporal coherence and smoother motion.

Surprising/Notable results:

Real-video generalization: Trained on synthetic data, ActionMesh handled real DAVIS videos with complex motions and occlusions better than expected.
Motion transfer: Even without special training, it could retarget motion from one object (e.g., a bird) to another (e.g., a dragon) when semantics roughly match.
Ablations: • Removing Stage II: same per-frame accuracy but loses the ability to output a single animated mesh (no consistent topology). • Removing both stages (per-frame image-to-3D): big 4D consistency drop—shows Stage I is crucial. • Changing backbone (Craftsman instead of TripoSG): still competitive but not as strong, indicating benefits from the chosen backbone. • Removing rotary time cues or masked generation hurts both quality and capabilities.

Bottom line: Across geometry, temporal stability, and motion accuracy, ActionMesh wins by a meaningful margin and does so in a fraction of the time.

05Discussion & Limitations

Limitations (be specific):

No true topology changes: If an object splits/merges (like hands clasping then separating into new parts), the method can’t change mesh connectivity mid-sequence.
Strong occlusions: Hidden parts that never appear, or vanish during motion, can be poorly reconstructed.
Quality tied to reference: If the chosen reference frame is blurry or distorted, that flaw can echo through the animation.
Category-agnostic but prior-dependent: Benefits from strong image-to-3D priors; weaker priors reduce fidelity.

Required resources:

A modern GPU for fast inference (minutes for 16 frames); CPU-only would be slow.
An image-to-3D backbone (e.g., TripoSG) for reference mesh extraction.
Video pre-processing (optional) to segment foreground for real-world clips.

When not to use:

Scenes needing mesh surgery (e.g., tearing cloth, opening/closing mouths that change topology drastically, liquids, explosions).
Extreme occlusions or tiny, fast-moving parts with no clear views.
Cases where non-mesh fields (e.g., volumetric effects) are the goal.

Open questions:

Topology evolution: Can we design latent updates that add/remove local parts safely while keeping production-readiness?
Stronger time reasoning: Could future temporal modules capture even longer, more complex actions without autoregression?
Better occlusion handling: Can we fuse multiple cues (e.g., learned priors + multi-view hallucination) for hidden regions?
Material dynamics: How to couple geometry motion with dynamic materials (wrinkles, secondary motion) in a feed-forward way?
Unified retargeting: How to guarantee motion-transfer quality across very dissimilar shapes (e.g., bird to car) with minimal user input?

06Conclusion & Future Work

Three-sentence summary: ActionMesh adds a time axis to powerful 3D diffusion so it generates synchronized per-frame shapes, then converts those shapes into deformations of a single reference mesh to deliver topology-consistent animation. This feed-forward design produces production-ready animated meshes that are rig-free, high quality, and about 10× faster than prior optimization-heavy methods. It works from videos, text, images+text, or 3D+text, and supports motion transfer and longer sequences via autoregression.

Main achievement: A practical, general, and fast pipeline for turning everyday inputs into animated, topology-consistent 3D meshes with state-of-the-art accuracy and temporal consistency.

Future directions:

Topology-aware generation to handle splits/merges without manual surgery.
Stronger temporal modules for even longer, more complex motions.
Improved occlusion reasoning and learning from large in-the-wild video corpora.
Closer integration with material/texture dynamics.

Why remember this: ActionMesh shows that minimal, clever temporal extensions to strong 3D backbones—plus a deformation-based assembly step—can unlock high-quality, production-ready 4D meshes in minutes, opening doors for rapid content creation across games, film, AR/VR, and beyond.

Practical Applications

•Rapid prototyping of animated game assets without manual rigging.
•Previsualization for film and commercials from quick phone videos.
•Animating product models for marketing with consistent textures across motion.
•AR/VR character and prop animation directly from user videos or text prompts.
•Motion transfer to quickly apply a performance to multiple characters or styles.
•Educational content creation: animate scientific or historical 3D models in minutes.
•Social media stickers and 3D emojis that move based on simple text prompts.
•Virtual try-on and fashion previews by animating garments on 3D mannequins.
•Simulation pre-steps: generate plausible deforming meshes for robotics or physics tests.
•3D asset libraries that include both static and animated versions generated consistently.

Version: 1