MoCapAnything: Unified 3D Motion Capture for Arbitrary Skeletons from Monocular Videos

Kehong Gong; Zhengyu Wen; Weixia He; Mingxi Xu; Qi Wang; Ning Zhang; Zhengyu Li; Dongze Lian; Wei Zhao; Xiaoyu He; Mingyuan Zhang

MoCapAnything: Unified 3D Motion Capture for Arbitrary Skeletons from Monocular Videos

Intermediate

Kehong Gong, Zhengyu Wen, Weixia He et al.12/11/2025

arXiv PDF

Key Summary

•MoCapAnything is a system that turns a single regular video into a 3D animation that can drive any rigged character, not just humans or one animal type.
•It solves the problem by first predicting where each joint moves in 3D over time and then converting those paths into the exact rotations your character needs.
•A special Reference Prompt Encoder studies the target character’s skeleton, mesh, and pictures so the system knows how this specific character is built.
•A Video Feature Extractor watches the video and also builds a rough 4D mesh (shape over time) to bridge the gap between pixels and joints.
•A Unified Motion Decoder blends the reference and video cues to make smooth, temporally consistent 3D joint trajectories.
•An IK Fitting step turns these joint paths into clean, rig-respecting joint rotations that follow hierarchy, bone lengths, and joint limits.
•On the Truebones Zoo benchmark (1,038 clips), MoCapAnything beats a top competitor (GenZoo) by a large margin on structural accuracy (CD-Skeleton 0.2549 vs 0.4580).
•It works in the wild and can retarget motions across species and even to non-biological rigs (like robots or toys).
•The approach is modular and scalable, pointing to a future where you can ‘prompt’ motion capture for any 3D asset.
•Limitations include reliance on a good 4D reconstruction and known rig structure, plus limited physics and contact modeling.

Why This Research Matters

This work lets creators animate any character from ordinary videos, shrinking the distance between creative ideas and finished motion. Studios can reuse a single motion across large, diverse asset libraries, saving time and cost. Indie teams and educators can bring mixed casts (animals, robots, toys) to life without building special tools for each species. VTubers and virtual production can quickly bind new avatars to motions captured from live or recorded footage. By factorizing motion into universal trajectories and rig-specific rotations, the method is robust to changing rigs and helps standardize pipelines. It also hints at a future of prompt-based animation where text or images guide motion capture for truly arbitrary assets.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you’re filming your pet parrot doing a funny dance, and you want a dinosaur, a robot, and a cartoon jellybean to copy that exact dance in your game. Today, that usually means different tools, lots of manual work, and often disappointment.

🥬 The Concept (3D motion capture): 3D motion capture is the process of turning real-world movement into 3D joint movements that a computer character can perform. How it works:

Watch the moving subject (with cameras or sensors).
Track key joint positions over time (like shoulders, elbows, knees).
Use those to drive a 3D skeleton so the character moves the same way. Why it matters: Without it, animators would have to hand-animate every frame, which is slow and hard, especially for long or complex motions. 🍞 Anchor: That’s how movies get realistic sword fights or dancers—actors move, and digital characters copy them.

🍞 Hook: You know how you can connect the dots in a flipbook to see a character running? Each dot’s path tells the story of the motion.

🥬 The Concept (Joint trajectories): Joint trajectories are the 3D paths that each joint follows over time, like the curve your wrist draws while waving. How it works:

For every frame, find the 3D position of each joint.
Line them up across frames to form smooth paths.
Use these paths to understand and recreate the motion. Why it matters: Without clean joint trajectories, motions look jittery or wrong, and converting them to joint rotations becomes very unreliable. 🍞 Anchor: Picture drawing a dotted line tracing a cat’s paw as it walks; that dotted line is the paw’s trajectory.

🍞 Hook: Think of sockets for different plugs: a phone charger, a laptop brick, or a toy car battery. Wouldn’t it be great if one smart adapter fit them all?

🥬 The Concept (Category-Agnostic Motion Capture, CAMoCap): CAMoCap is motion capture that works for any kind of skeleton—not just humans or one animal type. How it works:

Take a plain video of something moving.
Take any rigged 3D asset (could be a dragon, robot, or lamp with joints).
Rebuild the motion so that this exact rig can perform it. Why it matters: Without CAMoCap, studios must build special tools per species or template and still struggle to retarget motions cleanly to different characters. 🍞 Anchor: A bird-flapping video can animate a pterosaur, a cat rig, or a paper airplane with a hinge—same motion, new body.

The world before: For years, motion capture from regular videos worked mainly for humans using fixed templates (like SMPL) or for a few animal families (like quadrupeds with SMAL). These systems fit well in their own categories but stumble when you try to apply them to different skeletons or whimsical assets (like mechs or toys). Category-agnostic 2D keypoint methods could find landmarks on many creatures, but they stopped at 2D, didn’t model time well, and didn’t output animation-ready 3D.

The problem: Content creators need to drive huge, mixed libraries of characters—games, films, VTubers—where rigs vary wildly. They want to:

Retarget human or animal motion to non-biological rigs.
Animate crowds of many kinds of characters.
Quickly use new IP-specific creatures without building a whole new template. Existing pipelines tied to one species or one skeleton don’t scale.

Failed attempts: Directly predicting joint rotations from video sounds simple but quickly breaks: angles depend on each rig’s local frames, depth is ambiguous in monocular video, and per-frame angle guessing causes jitter. Using only 2D landmarks or only video pixels misses important 3D structure needed for stable motion.

The gap: We need a system that reads the specific target rig (its skeleton, mesh, and appearance), reads the video, and then finds motion in a rig-agnostic way first (trajectories) before translating it to the rig’s exact rotations.

Real stakes: Picture a YouTuber who swaps avatars weekly, a game studio filling a city with diverse NPCs, a teacher making a zoo of learning characters, or a small indie team reusing one dance across robots, dinosaurs, and stick figures. A unified, prompt-style motion capture pipeline turns days of rig-specific clean-up into minutes and opens playful creativity to everyone.

02Core Idea

🍞 Hook: You know how you first plan a road trip by tracing the route on a map, and only later worry about how to steer the car wheel at every turn?

🥬 The Concept (Aha!): First predict universal joint trajectories from the video, then translate them into the asset’s specific joint rotations with smart IK—guided by a reference prompt that tells the system exactly what the target rig looks like. How it works (big picture):

Read the target asset’s skeleton, mesh, and pictures to make per-joint ‘reference questions.’
Watch the video and also build a rough 4D mesh to get geometry over time.
Fuse everything to predict smooth 3D joint trajectories.
Convert trajectories into per-joint rotations that fit the rig’s rules (bone lengths, limits) via IK. Why it matters: Without splitting the problem, rotations are messy and rig-dependent; by separating ‘where joints go’ from ‘how this rig rotates,’ the method becomes stable across wildly different characters. 🍞 Anchor: It’s like plotting a dancer’s footprints (trajectories) before figuring out each ankle’s twist (rotations) for a kangaroo costume vs. a robot body.

Three analogies:

Translator analogy: First capture the story (trajectories), then translate into each language’s grammar (rotations per rig). You keep meaning first, then adjust words.
Recipe analogy: First gather universal ingredients (joint paths) from the video, then season to each kitchen’s taste (IK per rig), following house rules (joint limits).
GPS analogy: First get the path from A to B (trajectories), then adapt to your vehicle’s steering geometry (rotations), respecting the car’s constraints.

Before vs. after:

Before: Systems locked to humans or a few animals; retargeting was fragile; monocular video made angles noisy.
After: One pipeline works for arbitrary rigs, because it predicts rig-agnostic 3D paths first and only then computes rig-specific rotations.

Why it works (intuition):

Trajectories are a common ‘language’ across skeletons; a paw arc and a hand arc are both paths in space.
Rotations depend on each rig’s local frames, so learn them later with constraints and IK.
A rough 4D mesh gives geometry over time, bridging raw pixels and sparse joints.
A reference prompt (skeleton+mesh+images) anchors the system to the exact rig you will animate, disambiguating symmetrical parts and proportions.

Building blocks (the recipe pieces):

🍞 Hook: Imagine bringing a new board game to friends. You first read the rules so you all know how to play.

🥬 The Concept (Reference Prompt Encoder): It turns the target asset’s skeleton, mesh, and example images into per-joint ‘questions’ the decoder can answer. How it works:

Read the skeleton graph (who’s parent, who’s child).
Read mesh geometry (sample surface points and normals).
Read appearance images (texture cues to tell left from right, etc.).
Mix them so each joint gets a smart, structure-aware query. Why it matters: Without these joint-specific questions, the model won’t know your rig’s proportions or which parts are which, causing mismatches. 🍞 Anchor: Like giving every actor a character card before rehearsal so they know who they are and how to move.

🍞 Hook: Watching a sports clip, your eyes follow players and also the field lines to understand depth and shape.

🥬 The Concept (Video Feature Extractor): It studies the video frames and also builds a rough 4D mesh over time to provide geometry-aware tokens. How it works:

Extract dense visual features from each frame.
Estimate a coarse deforming surface (4D mesh) across time.
Sample points with positions and normals as geometry hints. Why it matters: Without geometry, pixels alone struggle to reveal 3D depth and occlusions, making joint paths wobbly. 🍞 Anchor: It’s like sketching the moving outline of a cheetah so you can place the joints in 3D more confidently.

🍞 Hook: When you cook, you taste sugar, salt, and spice together to decide the final flavor, not each alone.

🥬 The Concept (Unified Motion Decoder): It fuses the reference queries, video features, and 4D geometry to predict smooth 3D joint trajectories. How it works:

Joints talk to their neighbors along the skeleton (graph attention).
Joints look at nearby video frames for short-term clues.
Joints consult the 4D mesh points to resolve depth/shape.
Each joint smooths itself over time to avoid jitter. Why it matters: Without this fusion, you’d either misplace joints (no geometry), confuse symmetric parts (no appearance), or break the skeleton’s logic (no topology). 🍞 Anchor: Think of a conductor blending strings, brass, and percussion so the whole orchestra stays in sync.

🍞 Hook: If you know where your hand should be, you can figure out how to bend your elbow and shoulder to reach it.

🥬 The Concept (Inverse Kinematics, IK Fitting): IK computes the joint rotations that make the bones reach the predicted joint positions while respecting the rig’s rules. How it works:

Start with a geometric guess that aligns bones to the predicted positions.
Optimize rotations a little to reduce errors.
Add rules for bone lengths, joint limits, and smoothness over time. Why it matters: Without IK, rotations might twist weirdly, break constraints, or jitter between frames. 🍞 Anchor: It’s like posing an action figure to match a photo: you first roughly align limbs, then fine-tune the angles so nothing pops out of place.

03Methodology

High-level flow: Input video + reference asset → Reference Prompt Encoder → Video Feature Extractor (visual + 4D geometry) → Unified Motion Decoder → 3D joint trajectories → IK Fitting → asset-specific joint rotations.

Step-by-step, like a recipe:

Inputs

What happens: The system takes a monocular video (regular camera) of something moving and a rigged 3D asset (with skeleton, mesh, and optional images of the asset).
Why it exists: We need both the motion source (video) and the exact target body we want to animate (rig with geometry/appearance) so we can be precise.
Example: A video of a seagull flapping, plus a rigged pterosaur model with skeleton, mesh, and a few renders.

Reference Prompt Encoder 🍞 Hook: Before building a LEGO model, you study the instruction booklet and the shapes of the pieces you actually have.

🥬 The Concept: The encoder turns the target asset’s skeleton, mesh, and images into per-joint queries that ‘explain’ how each joint should behave. How it works:

Skeleton: Treat joints as a graph; use attention that respects parent–child links so information travels along limbs.
Mesh: Sample points and normals from the surface; joints attend to these to learn where they live relative to geometry (like implicit skinning).
Images: Encode appearance tokens to break left/right look-alike confusions and provide texture cues.
Fuse them across layers so each joint query becomes structure- and geometry-aware. Why it matters: Without these queries, the decoder wouldn’t know, for example, how long the forearm is, which side is left, or how the rig’s joints connect. 🍞 Anchor: Like giving each dancer a personalized cue card: who they hold hands with (topology), their costume shape (mesh), and photos of themselves (appearance).

Video Feature Extractor 🍞 Hook: If you want to catch a ball, you watch both the ball and the air path it’s carving.

🥬 The Concept: It extracts two synchronized streams—dense visual features per frame and a coarse 4D deforming mesh over time. How it works:

Visual stream: A frozen image encoder turns each frame into rich feature tokens.
Geometry stream: A pretrained reconstructor estimates a rough surface for every frame; sample points with positions and normals; add time encoding.
Pack both as the video evidence the decoder will attend to. Why it matters: Pixels are dense but flat; joints are sparse but 3D. The 4D mesh bridges this gap by offering shape-over-time hints. 🍞 Anchor: It’s like drawing a moving wireframe around the animal so placing joints in space becomes easier.

🍞 Hook: Picture a flipbook where each page shows not only the drawing, but also a faint grid that helps you keep proportions consistent.

🥬 The Concept (4D mesh sequence): A 4D mesh is a rough, time-varying surface of the subject—3D shape that changes every frame. How it works:

Estimate a coarse mesh for each frame from the video.
Sample points and normals as geometry tokens.
Use them to inform where and how joints can plausibly lie. Why it matters: Without 4D geometry, depth, self-occlusions, and non-rigid deformations become hard, making the trajectories shaky. 🍞 Anchor: Think of tracing paper over each video frame to sketch the subject’s silhouette in 3D as it moves.
Unified Motion Decoder 🍞 Hook: A coach gives play-by-play directions (video), reminds you of team structure (skeleton), and points at the field lines (geometry) so the team runs crisp routes.

🥬 The Concept: The decoder fuses reference queries, visual tokens, and 4D geometry to predict temporally coherent 3D joint trajectories. How it works (per layer):

Intra-frame graph attention: Joints exchange info with neighbors along the skeleton to respect limb couplings.
Cross-attend to a temporal window of video features: short-range cues fight occlusions and blur.
Cross-attend to a temporal window of 4D geometry points: resolves depth and shape, especially during overlaps and bends.
Per-joint temporal self-attention: smooths over time and captures dynamics.
Head MLP outputs each joint’s 3D position per frame. Why it matters: Remove any branch and performance drops—no skeleton logic, no geometry, or no temporal smoothing means noisier, less believable motion. 🍞 Anchor: Like blending close-up footage, the stadium layout, and team formation to predict where each player should be at every second.

Training objective

What happens: The model learns by minimizing the distance between predicted and ground-truth 3D joint positions, with masking for different joint counts across assets.
Why it exists: Focusing on trajectories keeps the task rig-agnostic and improves temporal stability.
Example: If the model puts a knee 4 cm off from ground truth, it gets a penalty to nudge it closer next time.

IK Fitting 🍞 Hook: When posing a marionette, you place the hand in space first, then adjust shoulder, elbow, and wrist so strings aren’t tangled and the pose looks natural.

🥬 The Concept: IK turns predicted 3D joint positions into joint rotations that obey the rig’s hierarchy, bone lengths, and limits—and stay smooth over time. How it works:

Geometric initialization: align bone directions to target joint positions for a good first guess.
Small optimization: refine rotations so forward kinematics matches the predicted positions.
Regularize twists and warm-start from the previous frame to avoid jitter. Why it matters: Without IK, the same positions could map to weird, twisty angles or violate joint rules, causing artifacts. 🍞 Anchor: It’s like tightening the screws of a folding ladder so each step lines up cleanly without wobble.

The secret sauce:

Factorization: Predict rig-agnostic trajectories first, then solve rig-specific rotations.
Multimodal bridge: Use a 4D mesh to connect dense video pixels and sparse joints.
Reference prompting: Per-joint queries packed with skeleton, mesh, and appearance cues tailor motion to this exact asset.
Constraint-aware IK: Enforces anatomy and smoothness, turning good paths into great animations.

04Experiments & Results

🍞 Hook: Imagine a school sports day with three races: speed, smoothness, and accuracy around cones. You don’t just want to win one—you want to do well in all.

🥬 The Concept (Truebones Zoo dataset): A large benchmark of 1,038 motion clips where each example includes a skeleton, a mesh, and a rendered video of the asset doing the motion. How it works:

Train on most clips; keep 60 diverse ones for testing (seen, rare, unseen categories).
Measure how well methods predict 3D joints and overall skeletal structure.
Compare to a strong baseline (GenZoo). Why it matters: Without a consistent dataset bundle (skeleton–mesh–video), it’s hard to fairly judge if a system truly generalizes to many rigs. 🍞 Anchor: It’s like testing a universal charger on phones, tablets, and cameras to prove it really works for all gadgets.

Metrics (turning numbers into meaning):

🍞 Hook: Grading a drawing, you check how close lines are (accuracy), how even the strokes look (smoothness), and how well the shapes match (structure).

🥬 The Concept (MPJPE): Mean Per Joint Position Error is the average 3D distance between each predicted joint and the true joint position. How it works:

For every joint, measure the 3D gap in millimeters.
Average across all joints and frames.
Lower is better. Why it matters: Without MPJPE, we wouldn’t know basic positional accuracy. 🍞 Anchor: If your elbow is 2 cm off on average, MPJPE says so.

🍞 Hook: Think of watching a car’s speedometer over time; jumpy changes feel wrong even if the car ends up at the right place.

🥬 The Concept (MPJVE): Mean Per Joint Velocity Error measures how smooth and realistic the motion is by comparing joint speeds over time. How it works:

Compute each joint’s velocity frame-to-frame.
Compare predicted vs. true velocities.
Average the differences; lower is better. Why it matters: Without MPJVE, a method could be accurate but jittery. 🍞 Anchor: A dancer who snaps strangely between poses would score badly on MPJVE.

🍞 Hook: If two skeletons strike a pose, you can compare how closely their bone layout matches, not just single points.

🥬 The Concept (CD-Skeleton): A Chamfer-like distance between predicted and true skeletons that considers the whole articulated structure. How it works:

For each joint in one skeleton, find the closest spot on the other skeleton’s bone segments.
Average these distances both ways.
Lower is better and reflects structural alignment. Why it matters: Without a structure-aware metric, you might pass point checks but fail the overall pose shape. 🍞 Anchor: It’s like measuring how snugly two puzzle outlines fit, not just comparing a few dots.

The test and competition: The team compared MoCapAnything to GenZoo, a notable animal mocap system that focuses mainly on quadrupeds. Tests covered quadrupeds and non-quadrupeds across seen, rare, and unseen categories.

Scoreboard with context:

CD-Skeleton (all test cases): MoCapAnything ≈ 0.2549 vs. GenZoo ≈ 0.4580. Think of this like getting an A- where the other gets a C: a clear, practical difference in structural faithfulness.
On non-quadrupeds, MoCapAnything’s advantage grows, showing it handles birds, bipeds, reptiles, and even non-biological rigs better.
Ablations (removing parts):
- Without mesh features, errors rise notably, especially on rare/unseen species (geometry is crucial).
- Without graph attention on the skeleton, temporal and structural stability drop (topology matters).
- Without appearance images, confusion increases for symmetric parts (left/right mix-ups).
Architecture sweeps suggest the chosen layer setup gives a strong balance of accuracy and smoothness, especially lowering velocity error on unseen data.

Surprising findings:

Even though it wasn’t trained as a ‘retargeting engine,’ the model naturally transfers motions across very different skeletons thanks to the factorized design and rich prompts.
The 4D mesh bridge is disproportionately helpful in wild cases with occlusions and fast motion: it acts like a safety net for depth and deformation.
A lightweight IK stage, when warm-started and twist-regularized, is enough to produce stable, high-quality rotations from good trajectories.

Qualitative highlights:

Bird videos animating quadrupeds (and vice versa) look surprisingly plausible—flapping becomes bounding-like motions.
Fish-like swimming retargeted to reptiles creates creative but physically consistent tail and spine flows.
Human-to-animal and animal-to-human transfers demonstrate the generality of the approach beyond species lines.

05Discussion & Limitations

Limitations:

Dependence on the 4D reconstruction quality: If the coarse deforming mesh is poor (motion blur, cluttered backgrounds), trajectories can degrade.
Requires a known, rigged skeleton: The system expects a clear hierarchy with bone lengths and joint limits; unrigged meshes need preprocessing.
Camera-space focus and limited physics: No explicit contact or world-grounding means sliding feet or implausible forces can slip through in tough scenes.
Symmetry and rare morphologies: Extremely unusual rigs or perfect bilateral symmetry can still confuse left/right without strong appearance cues.

Required resources:

A rigged asset (skeleton + mesh; optional renders) and a monocular video clip.
Pretrained feature backbones and a 4D reconstruction model.
Modest compute for inference: the decoder plus a small IK optimization per frame.

When NOT to use:

If you need physically accurate contacts (foot friction, ground reaction forces) or world-scale trajectories for robotics control.
If the asset has no rig at all or highly non-standard joints without limits.
If the video is too dark, tiny subject, or extreme motion blur that breaks both features and 4D reconstruction.

Open questions:

Can we reduce reliance on 4D reconstruction by learning stronger video-only geometry priors?
How to add contact, balance, and simple physics so feet stick and tails whip with momentum-aware constraints?
Can text-only or multimodal prompts replace rendered images to describe new rigs and semantics?
How to recover world-grounded motion from monocular video without extra sensors?
Can the pipeline scale to multi-character interactions and hand–object contacts with the same factorized design?

06Conclusion & Future Work

Three-sentence summary: MoCapAnything turns a single video plus any rigged 3D asset into clean, rig-specific joint rotations by first predicting universal 3D joint trajectories and then applying constraint-aware IK. A reference prompt (skeleton, mesh, images) and a 4D mesh bridge let the model fuse structure, appearance, and geometry to produce smooth, accurate motion across many species and even non-biological rigs. On the Truebones Zoo benchmark and in the wild, it delivers strong accuracy and generalization, enabling practical, prompt-based motion capture for arbitrary assets.

Main achievement: The key contribution is the factorized, reference-guided framework that separates trajectory prediction (rig-agnostic) from rotation recovery (rig-specific), bridged by a 4D geometry stream and finished with a lightweight, stable IK stage.

Future directions: Add physics- and contact-aware IK, recover world-grounded trajectories, reduce or replace 4D recon with learned video geometry priors, explore text-only prompts, and extend to multi-character and hand–object scenarios.

Why remember this: It reframes motion capture as a promptable task that works for any skeleton, much like how promptable image models unlocked creativity across styles—opening the door to animating truly ‘anything’ from ordinary videos.

Practical Applications

•Animate a game’s entire zoo of creatures from a few reference videos, reusing motions across rigs.
•Retarget a dancer’s performance to robots, fantasy creatures, or mascots for live shows or ads.
•Rapidly prototype VTuber avatars that switch bodies while keeping consistent performances.
•Generate crowd scenes with varied characters sharing core motions but tailored to their rigs.
•Previsualize film shots by driving placeholder assets from on-set monocular footage.
•Create educational content where animals and objects act out lessons using real-world clips.
•Power quick motion tests for newly rigged assets to validate skinning and joint limits.
•Enable creative mashups (e.g., bird-flap motion on a pterosaur or a paper-plane rig).
•Assist accessibility tools that map human motions onto assistive-device avatars.
•Speed up animation clean-up by providing stable, constraint-aware initial rotations.

Version: 1