Choreographing a World of Dynamic Objects

Yanzhe Lyu; Chen Geng; Karthik Dharmarajan; Yunzhi Zhang; Hadi Alzayer; Shangzhe Wu; Jiajun Wu

Choreographing a World of Dynamic Objects

Intermediate

Yanzhe Lyu, Chen Geng, Karthik Dharmarajan et al.1/7/2026

arXiv PDF

Key Summary

•CHORD is a new way to animate 3D scenes over time (4D) where many objects move and interact, guided only by a text prompt.
•Instead of needing special rules for each object type or huge 4D datasets, CHORD learns motion by listening to powerful video generators and turning their feedback into 3D object movements.
•A key idea is a two-layer “puppet control” for space (coarse then fine control points) so big moves come first and tiny details are added later.
•Another key idea is a time-structure called a Fenwick tree that keeps movements smooth and consistent across many frames.
•The team designed a new way to use guidance (SDS) with modern Rectified Flow video models so video models can “choreograph” 3D motion effectively.
•Extra “smoothness rules” across space and time prevent flicker and weird bends, making motion look natural.
•In tests, people strongly preferred CHORD’s motion for following prompts and looking realistic compared to four strong baselines.
•CHORD can animate scanned real objects and even guide real robots to manipulate rigid, articulated, and deformable items with zero-shot policies.
•It still can’t create brand-new objects that appear from nowhere and depends on the quality of the underlying video model, but it opens a practical path to scalable 4D motion generation.

Why This Research Matters

CHORD makes it practical to animate many different 3D objects interacting in believable ways, guided only by a text prompt, without expert rigging or rare datasets. This lowers the barrier for creators in film, games, AR/VR, and education to build lively scenes quickly. For robotics, CHORD provides dense, physically-grounded motion that can guide real manipulators in zero-shot fashion, improving household help, warehouse automation, and assistive devices. Because the method is category-agnostic and draws on broad video knowledge, it works across everyday objects, not just humans or a few trained classes. The stable space-time hierarchy and RF-compatible guidance make modern video models truly useful for 3D motion, pushing the field toward scalable 4D world-building. As video models improve, CHORD’s capabilities should expand, enabling richer, longer, and more complex interactions.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you have a toy city made of different figures—people, pets, balls, doors, and a robot arm. You can take a picture of the city, but what if you want to see it come alive: the cat stepping on a cushion, the robot picking up a block, or a kid jumping on a seesaw that launches a brick?

🥬 The Concept: 4D scenes mean 3D shapes that change over time—like a mini movie you can view from any angle. How it works (old world):

Artists used hand-crafted rules for each type of object (like special rigs for humans or animals).
Data-driven methods tried to learn from 4D datasets, but most datasets focus on single objects (often humans) and rarely on multi-object interactions.
These approaches either need lots of expert labor or lots of narrow data, which doesn’t cover everyday objects interacting together. Why it matters: Without a better way, we can’t easily make believable scenes where many different things move together naturally. 🍞 Anchor: Think of trying to animate “a dog nudges a ball, and the ball bounces away” when your tools only know how to bend a human arm. The result won’t match the story.

🍞 Hook: You know how streaming videos show you what’s happening frame by frame on a flat screen? That’s great for watching, but not for building a 3D world you can walk around in.

🥬 The Concept: Most video is in 2D (pixels over time), while building useful 3D worlds needs 3D shapes plus time (4D) with consistent geometry from any viewpoint. How it works (the mismatch):

Videos describe appearances on a screen (Eulerian view).
3D worlds describe how objects themselves move (Lagrangian view).
This gap makes it hard to turn video motion into 3D object motion directly. Why it matters: If we can’t bridge this gap, we either get pretty videos that break in 3D, or 3D scenes with stiff, unrealistic motion. 🍞 Anchor: A basketball in a video is just a circle of pixels moving. In 4D, it’s a 3D ball with mass and position that bounces and spins—information you can use from any camera angle.

🍞 Hook: Imagine directing a play with many actors, props, and stunts—but you only have still photos of the stage. You need a choreographer who can imagine and guide the motion.

🥬 The Concept: CHORD uses a video generative model as a high-level choreographer and distills its sense of motion into 3D object deformations over time. How it works:

Start with static 3D objects and a text prompt (e.g., “a man closes a laptop”).
Propose small deformations over time (how each object moves).
Render short videos from random camera views.
Ask a strong video model if the motion looks plausible for the prompt and get a “fix-it” signal.
Adjust the 3D motions and repeat. Why it matters: This lets us borrow the broad knowledge in video models without needing category-specific rigs or giant 4D datasets. 🍞 Anchor: The video model says, “Laptops don’t float—make the lid rotate down and the base stay put.” The 3D motion updates accordingly.

🍞 Hook: You know how a good story needs both big plot beats and tiny character moments?

🥬 The Concept: 4D motion must handle both coarse moves (like picking up a plate) and fine details (like fingers curling), and must stay smooth over time. How it works (what failed before):

High-dimensional deformations were hard to optimize; they jittered or got stuck.
Each frame was treated too independently, causing flicker or drift later in sequences.
Guidance from modern video models (Rectified Flow) didn’t match older algorithms. Why it matters: Without a stable motion representation and compatible guidance, scenes look off—stiff, floaty, or inconsistent. 🍞 Anchor: If you try to animate a cat stepping on a cushion without the right structure, the paw may teleport, or the cushion dents randomly.

🍞 Hook: Think of teaching a robot to move things safely at home—closing a microwave door, folding a towel, or moving a banana onto a plate.

🥬 The Concept: Real stakes mean reliable, physically reasonable motion matters—for robotics, AR/VR, movies, and games. How it works:

We need rich, dense motion (where every point moves) to guide hands, grippers, or tools.
Motions must be general (category-agnostic), realistic, and stable.
A universal pipeline reduces manual work and scales across object types. Why it matters: Better 4D motion can power safer robots, richer virtual worlds, and faster creative workflows. 🍞 Anchor: With good motion, a robot can actually lower a lamp head or fold fabric the way a person would expect.

02Core Idea

🍞 Hook: You know how a dance choreographer watches the whole stage and guides every dancer so they move beautifully together?

🥬 The Concept: The key insight is to use a powerful video model as the “choreographer” and a carefully designed 4D motion representation as the “dancers,” then iteratively refine the motion so the video model approves what it sees from any angle. How it works:

Represent each object’s motion with a space-time hierarchy (coarse-to-fine in space, and a Fenwick tree in time).
Render short videos from the 3D motion.
Ask a modern Rectified Flow video model for guidance (a fix-it signal) using a new SDS-like rule that works with its architecture.
Apply spatial and temporal smoothness regularizers.
Repeat until the motion looks right. Why it matters: This bridges 2D video wisdom and 4D object motion, without needing category-specific rigs or rare 4D datasets. 🍞 Anchor: The laptop closing looks right because the model guides the lid to rotate around its hinge while the base stays on the table.

Multiple analogies:

Orchestra analogy: The video model is the conductor; each object is a section (strings, brass). The hierarchical motion representation lets sections play big themes (coarse) and subtle notes (fine), while the Fenwick tree keeps everyone on tempo over time.
Puppet analogy: Coarse control points are big strings for body parts; fine control points are tiny strings for fingers and fabric wrinkles. The video model tells you if the puppet show looks believable.
Recipe analogy: Start with broad flavors (coarse moves), then season to taste (fine tweaks). The Fenwick tree is your timeline that ensures each step builds smoothly on the last.

Before vs After:

Before: Rely on object-specific rigs or narrow data; get stiff or unrealistic interactions, or limit to single objects.
After: Use a universal choreographer (video model) plus a stable 4D representation to animate many objects interacting naturally from just a prompt.

Why it works (intuition without equations):

The video model has seen countless patterns of motion. If your rendered video looks wrong, it nudges you toward a better direction.
Spatial hierarchy prevents the system from making messy, jittery micro-moves before the big motion is right.
The Fenwick tree shares motion information between nearby frames, avoiding sudden jumps and helping long sequences.
A tailored SDS for Rectified Flow turns the model’s training objective into a usable guidance signal.
Smoothness regularizers act like gentle rules: no teleporting, no rubbery distortions.

Building blocks (with sandwiches):

🍞 Hook: Picture sprinkling noise on a photo and then cleaning it up step by step until it looks clear again. 🥬 The Concept: Diffusion models learn how to turn noise into images or videos by reversing the noising process. How it works:

Add noise to data during training.
Learn to remove it in small steps.
At test time, start from noise and denoise. Why it matters: These models are great at understanding what realistic frames look like. 🍞 Anchor: They can generate a video of a basketball bouncing that looks real.

🍞 Hook: Imagine a river current that pushes leaves downstream along smooth paths. 🥬 The Concept: Rectified Flow models learn a velocity field that moves noisy data toward clean data in one go. How it works:

Learn a direction (velocity) at different noise levels.
Follow the flow to get realistic samples. Why it matters: Many state-of-the-art video generators use this design, so guidance must match it. 🍞 Anchor: The model says which way to push each frame so the motion becomes plausible.

🍞 Hook: If a teacher slightly nudges your drawing again and again, you’ll improve the picture without rewriting everything. 🥬 The Concept: Score Distillation Sampling (SDS) turns a generative model’s feedback into gradients that improve a 3D (or 4D) scene. How it works:

Render your current guess as images/videos.
Add noise and ask the model to denoise.
Use the difference as a “fix-it” signal. Why it matters: It lets us learn from a powerful model without retraining it. 🍞 Anchor: If your ball’s bounce looks wrong, SDS tells you how to tweak the motion.

🍞 Hook: To move a puppet, you first place big joints (shoulder, elbow) then fine-tune fingers. 🥬 The Concept: A hierarchical control-point representation uses coarse control points for big moves and fine control points for details. How it works:

Coarse stage: align the big chunks of motion.
Fine stage: add local deformations. Why it matters: Prevents early overfitting to noise and yields crisp details later. 🍞 Anchor: First, the laptop lid swings; then the edges align neatly.

🍞 Hook: Think of a tidy bookshelf where each shelf covers a time range, and together they describe the whole story. 🥬 The Concept: A Fenwick tree is a time-structure that stores cumulative motion over overlapping frame ranges. How it works:

Each node summarizes a time span.
To get motion at a frame, combine a few nodes. Why it matters: Nearby frames share info, so motion stays smooth and stable over long sequences. 🍞 Anchor: Steps 6 and 7 reuse many of the same time chunks, so the move continues naturally.

🍞 Hook: When you learn to draw, you start with big shapes and slowly refine, not all at once. 🥬 The Concept: Coarse-to-fine plus an annealed noise schedule makes big motions form early and details come later. How it works:

Start at higher noise to explore large changes.
Decrease noise over time to refine details safely. Why it matters: Avoids floaty or chaotic motion and lands on realistic animation. 🍞 Anchor: The cushion first dents broadly, then shows a neat paw imprint.

03Methodology

High-level recipe: Input (static 3D scene + text prompt) → 4D motion proposal → Render short videos from random views → Get guidance from a Rectified Flow video model (SDS for RF) → Update motion with spatial/temporal regularizers → Repeat → Output (a smooth 4D animation of all objects).

Step 0: Prepare the geometry with 3D Gaussian Splatting (3D-GS). 🍞 Hook: Imagine painting each object with many tiny blurry dots that know where they live in 3D. 🥬 The Concept: 3D-GS represents shapes as lots of 3D Gaussians that render fast and give smooth gradients for learning. How it works:

Convert meshes to Gaussians (positions, sizes, colors).
This makes rendering and optimization stable and efficient. Why it matters: We can smoothly adjust motion because the representation is differentiable and fast. 🍞 Anchor: A laptop is a cloud of tiny ellipses that render to a solid-looking shape and can be gently moved.

Step 1: Build a hierarchical spatial motion representation with control points. 🍞 Hook: You know how a marionette has a few main strings for big moves, plus smaller strings for delicate actions? 🥬 The Concept: Control points are spatial anchors that influence nearby Gaussians; we use two layers—coarse (big moves) and fine (details). How it works:

Coarse stage: optimize a sparse set of big controllers to get the main motion right.
Fine stage: add more controllers that only tweak local details (as residuals).
Blend neighboring controllers smoothly (like linear blend skinning) so nothing snaps. Why it matters: Tackles high-dimensional motion safely—avoid early jitter, then capture precise deformations. 🍞 Anchor: First the lid swings down; later the edges and corners align neatly without wobble.

Step 2: Add a temporal hierarchy with a Fenwick tree. 🍞 Hook: Think of describing a long dance by chunks: steps 1–2, 3–4, 1–4, and so on. You can rebuild any step from a few chunks. 🥬 The Concept: A Fenwick tree stores cumulative motion over overlapping time ranges so nearby frames share parameters. How it works:

Each control point keeps a set of time-chunk motions.
To get frame t’s motion, combine a small set of chunks.
This enforces smoothness and helps later frames learn. Why it matters: Prevents drift and flicker, especially in long sequences where independent frames would fall apart. 🍞 Anchor: The paw presses a cushion over several frames; each frame reuses overlapping chunks, so the dent deepens smoothly.

Step 3: Get guidance from a Rectified Flow video model using an SDS-style update compatible with RF. 🍞 Hook: It’s like showing your rehearsal video to a master coach who marks what to fix. 🥬 The Concept: SDS for RF turns the video model’s velocity predictions into gradients that improve your 4D motion. How it works:

Render a video from random views and encode it.
Mix in noise at a chosen level.
The RF model predicts a velocity (how to move toward realism/prompt alignment).
Convert that into a gradient that updates the 4D motion. Why it matters: Modern video models use RF; aligning SDS with RF makes their feedback usable and strong. 🍞 Anchor: If your laptop looks like it’s floating, the guidance will push you to keep its base stable while rotating the lid.

Step 4: Use a smart noise schedule (annealing) and coarse-to-fine timing. 🍞 Hook: Start with big brush strokes, then refine with a thin brush. 🥬 The Concept: Sample higher noise early to allow big changes; gradually lower noise to refine details. How it works:

Early: optimize only coarse control points at higher noise.
Later: introduce fine control points as noise decreases. Why it matters: Big motion forms safely first; details don’t get scrambled by early noise. 🍞 Anchor: A sealion’s body arc forms early; the gentle nudge of a ball appears later.

Step 5: Enforce smoothness with temporal and spatial regularizers.

Temporal regularization. 🍞 Hook: In a good flipbook, each page looks like the last page moved a tiny bit, not a big jump. 🥬 The Concept: Temporal regularization uses a 3D flow map to penalize sudden, jittery changes over time. How it works:
Compute how each point moves from frame t to t+1.
Encourage small, consistent changes unless the scene truly needs big motion. Why it matters: Prevents flicker and jumpy deformations. 🍞 Anchor: A cat’s tail sways smoothly instead of teleporting.
Spatial regularization. 🍞 Hook: Bend a cardboard gently: nearby parts move together; only so much stretch is allowed. 🥬 The Concept: Spatial regularization (ARAP-style) keeps local neighborhoods moving almost-rigidly to avoid weird warps. How it works:
Sample points near the surface.
Encourage local motion to be as rigid-as-possible while still allowing needed bends. Why it matters: Avoids rubbery artifacts and keeps shapes believable. 🍞 Anchor: The laptop lid stays stiff while rotating at its hinge, not melting like taffy.

Step 6: Iterate until convergence.

Randomize camera viewpoints to enforce 360° consistency.
Keep updating motion with RF-SDS guidance and regularization.
Optionally chain segments to make long-horizon motion.

Concrete mini-example (cat on cushion):

Input: cat mesh, cushion mesh, prompt: “A cat steps on a cushion.”
Convert to 3D-GS and place control points.
Early iterations (high noise): coarse points make the cat shift weight; cushion starts to dent.
Mid iterations: Fenwick tree keeps the paw-down sequence smooth; the video model adjusts timing.
Later iterations (low noise): fine points add paw and fabric detail; temporal/spatial regularizers prevent flicker/distortion.
Output: from any camera, the paw press looks natural and consistent.

The secret sauce:

Space-time hierarchy (coarse→fine, Fenwick tree) matches how real motion is structured.
RF-compatible SDS makes modern video models truly useful as choreographers.
Smoothness regularizers keep everything stable, even with noisy guidance.

04Experiments & Results

The test: Can CHORD animate multi-object scenes so they follow text prompts and look physically sensible, across different categories? The authors used diverse scenes, like “a man petting a dog,” “a cat stepping on a cushion,” “a sealion nudging a ball,” “a block falling on a trampoline,” “two men shaking hands,” and “a robot picking up a block.”

The competition (baselines):

Animate3D: makes multi-view videos first, then reconstructs 4D.
AnimateAnyMesh: directly predicts deformations with a pretrained Rectified Flow network.
MotionDreamer: generates a video then fits mesh motion to match diffusion features.
TrajectoryCrafter + 4D reconstruction: creates multiple camera paths and reconstructs 4D from them.

What was measured and why:

Prompt Alignment (user study): Does the motion do what the text asked?
Motion Realism (user study): Does it look natural and believable?
Semantic Adherence (VideoPhy-2): Is the action consistent with the prompt content?
Physical Commonsense (VideoPhy-2): Does it follow everyday physics?

The scoreboard (with context):

User studies (99 people) strongly preferred CHORD: about 87.7% chose CHORD for prompt alignment and about 87.4% for realism—like getting an A+ when most others got around a B- or lower.
VideoPhy-2 scores: CHORD was best on Semantic Adherence and second-best on Physical Commonsense. One baseline scored high on physics sometimes by barely moving (staying still can look physically plausible), but that failed prompt following.

Qualitative takeaways:

CHORD followed prompts closely (e.g., lids closing, paws pressing, hands shaking) and produced smooth, consistent motion from multiple views.
Baselines often had artifacts: floating objects, mismatched timing, or inconsistent motion from different camera paths.

Surprising findings:

Noise-level sampling matters: Uniform noise sampling led to odd artifacts (like a laptop floating); the tailored annealed schedule produced much more realistic motion.
Representation ablations:
- Without the Fenwick tree, later frames broke down (hard to learn long sequences).
- Without fine control points, details like grasping were missing.
- Without coarse control first, early noise caused distortions.
Regularization ablations:
- Removing temporal regularization led to flicker.
- Removing spatial regularization led to rubbery bends or distortions.

Extensions and demos:

Long-horizon motion by chaining segments.
Real-world object animation from scans (robust to real/synthetic gap).
Robot manipulation: CHORD’s dense object flow guided a real robot to pick/place, close lids, lower lamps, and fold fabric—rigid, articulated, and deformable cases—using a grasp planner and a motion optimizer that matched end-effector moves to the generated flows.

Bottom line: Compared to four strong approaches, CHORD produced motions that people overwhelmingly judged as better aligned with prompts and more realistic, while also enabling real robot behaviors.

05Discussion & Limitations

Limitations:

No new objects appear mid-scene: CHORD deforms what exists at the start; it can’t conjure new geometry (e.g., liquid pouring that wasn’t modeled initially).
Dependent on the video model: If the video model can’t imagine the action well, its guidance misleads motion (garbage in, garbage out).
Training time: Backprop through the video VAE is costly; runs can take many hours on a high-end GPU.

Required resources:

A capable Rectified Flow video generator (e.g., Wan 2.2) and its VAE.
GPU memory and time for rendering and optimization (hours per scene).
3D meshes or scans to initialize scenes; optional grasp/motion planning tools for robotics demos.

When not to use:

When the prompt needs new objects to appear (smoke, liquid, new tool entering) that don’t exist in the initial scene.
When precise physics constraints or simulations are mandatory (e.g., engineering-grade contact forces) beyond plausibility.
When extremely fast turnaround is needed and long optimization is impractical.

Open questions:

Can we generate or insert new objects during motion (e.g., emitters for liquids/cloth splits) while keeping the pipeline stable?
Can we avoid backprop through the VAE to speed training dramatically?
How to integrate lightweight physics priors without losing universality or scalability?
Can we provide uncertainty estimates for motion to help robots plan safer interactions?
How to co-train or adapt the video choreographer to better understand rare actions or corner cases?

06Conclusion & Future Work

3-sentence summary: CHORD turns powerful video models into choreographers that guide a stable, hierarchical 4D motion representation, producing natural multi-object animations from a simple text prompt. A spatial coarse-to-fine control-point system and a temporal Fenwick tree keep motion smooth and learnable over long sequences, while a new RF-compatible SDS and smoothness regularizers make modern video guidance effective. The result outperforms strong baselines in user studies, works on real scans, and even drives zero-shot robot manipulation.

Main achievement: Unifying a universal choreographer (video model) with a carefully engineered 4D representation (space-time hierarchy) and a matching guidance rule (SDS for Rectified Flow) to reliably generate category-agnostic multi-object 4D motion.

Future directions:

Add the ability to create new objects mid-scene (e.g., liquids, tools, particles).
Greatly speed optimization by avoiding VAE backprop or using learned surrogates.
Light physics priors for contact and deformation could improve realism without heavy simulation.
Better uncertainty handling and safety for robotic execution.

Why remember this: CHORD shows a practical, scalable path to animate the 3D world by listening to video models—no special rigs, no rare datasets—unlocking richer AR/VR, faster creative pipelines, and more capable robots in everyday environments.

Practical Applications

•Text-to-animation for multi-object 3D scenes (rapid prototyping for film, games, and AR/VR).
•Animating scanned real-world objects for digital twins and scene previews.
•Zero-shot robot manipulation guidance for pick-and-place, folding fabric, or adjusting articulated parts.
•Previsualization of product interactions (e.g., lids, hinges, buttons) without manual rigging.
•Educational simulations that show cause-and-effect (e.g., seesaw dynamics, bounces) from simple prompts.
•Rapid motion ideation for designers: generate multiple plausible action variants of the same scene.
•Augmented reality experiences where virtual and scanned objects move and interact believably.
•Generative storyboard creation: preview interactions from many camera angles consistently.
•Game modding tools: bring static assets to life with prompt-driven behaviors.
•Assistive content creation for accessibility, letting users describe desired actions in plain language.

Version: 1