Dream2Flow: Bridging Video Generation and Open-World Manipulation with 3D Object Flow

Karthik Dharmarajan; Wenlong Huang; Jiajun Wu; Li Fei-Fei; Ruohan Zhang

Dream2Flow: Bridging Video Generation and Open-World Manipulation with 3D Object Flow

Intermediate

Karthik Dharmarajan, Wenlong Huang, Jiajun Wu et al.12/31/2025

arXiv PDF

Key Summary

•Dream2Flow lets a robot watch a short, AI-generated video of a task and then do that task in real life by following object motion in 3D.
•The key idea is to focus on what the object should do (its path through space) instead of copying how a human’s hand moves.
•It builds a 3D “object flow” by combining video frames, depth estimation, object masks, and point tracking.
•Robots then turn this 3D object flow into actions using planning or reinforcement learning—no task-specific demos needed.
•This approach works on many object types: rigid (blocks), articulated (doors, ovens), deformable (scarves), and granular (pasta).
•In real-world tests, Dream2Flow beat methods that only track rigid transforms or optical flow from video.
•It learned policies with a new reward made directly from 3D object flow, performing as well as handcrafted rewards across robot bodies.
•Results depend on the video generator: some models make better trajectories in simulation, others in the real world.
•Main bottlenecks are occasional video artifacts, tracking dropouts from occlusion, grasp selection mismatches, and runtime (3–11 minutes).
•Overall, 3D object flow is a simple, general bridge that connects video imagination to real robot manipulation.

Why This Research Matters

This work shows a practical way to turn short, AI-made videos into real robot actions without special training for each new task. It boosts robots’ ability to understand open-ended language and act in new homes, kitchens, and offices. By focusing on the object’s 3D motion, it works across rigid, articulated, deformable, and even granular items. It can train policies using a reward made directly from object flow, reducing hand-engineered supervision. As video models and tracking get better, this bridge will only get stronger, making helpful, versatile home and workplace robots more realistic.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Imagine you want to teach a friend how to open a toaster oven, but instead of explaining every finger and wrist move, you just show a short video where the oven door goes from closed to open. Your friend focuses on the door’s motion, not your exact hand movements. That’s the spirit of this paper.

🥬 The Concept (Generative Models and Video Generation Models):

What it is: A generative video model is an AI that can create a realistic video from a single image and a text instruction, like “Open the oven.”
How it works (recipe):
1. Start with a picture of the scene.
2. Read a short instruction.
3. Predict a sequence of future images (frames) that look like the task is being done.
4. Stitch those frames into a short video.
Why it matters: Without this, a robot has no “movie in its head” of what the task should look like; it lacks a visual plan of the world changing. 🍞 Anchor: When asked “Put bread in the bowl,” the model can show bread moving from the table into the bowl in that exact scene.

🍞 You know how robots are great at repeating the same factory move but can be confused by a new kitchen or a different bowl? That’s because real homes are messy and surprising.

🥬 The Concept (Robotic Manipulation in the Open World):

What it is: Robotic manipulation means moving and changing objects—pushing, grasping, opening, folding—in lots of different settings.
How it works (recipe):
1. See the scene (camera).
2. Decide what to change (goal).
3. Plan and send motor commands.
4. Adjust based on feedback.
Why it matters: Without solid manipulation in new places, robots struggle with everyday tasks like opening drawers or covering bowls. 🍞 Anchor: A home robot that can open a new brand of toaster oven without prior training is closer to being a helpful assistant.

🍞 Think of a soccer coach describing a move to two players: one with long legs, one with short legs. If the coach gives step-by-step leg angles, it won’t fit both bodies.

🥬 The Concept (Embodiment Gap):

What it is: The embodiment gap is the mismatch between how humans do actions (hands, arms) and how robots can do them (grippers, joints).
How it works (recipe):
1. Video shows a human doing the task.
2. Robot can’t copy that directly (different body).
3. We need a shared description that both understand.
Why it matters: If we try to copy human motions exactly, robots fail because their bodies and action spaces are different. 🍞 Anchor: Instead of telling the robot “curl your finger,” we tell it “make the oven door rotate open by 60°,” which any robot can try to achieve.

🍞 Picture earlier attempts where we tried to teach robots by giving them either pure words (“Open the oven!”) or just final pictures of success. It’s like giving a destination with no map.

🥬 The Concept (Past Approaches and Their Limits):

What it is: Previous methods used language-to-action policies, optical flow or rigid pose tracking from videos, or goal images.
How it works (recipe):
1. Map words straight to robot actions, or
2. Track motion in 2D (optical flow) or rigid 6D poses, or
3. Aim for an end picture (goal image).
Why it matters: Without a robust, object-centric 3D motion plan, robots often get confused by deformable objects, occlusions, or articulation. 🍞 Anchor: Relying only on rigid transforms struggles with cloth covering a bowl; the cloth bends, so a rigid-only description breaks.

🍞 Imagine needing a single “language” that both a video-maker AI and a robot can understand. Like using a map that works on phones, cars, and bikes.

🥬 The Concept (The Missing Piece—An Interface):

What it is: We need a mid-level representation that turns video imagination into robot-ready instructions.
How it works (recipe):
1. Watch the generated video.
2. Extract how the important object moves in 3D over time.
3. Hand that motion to a planner or policy to execute.
Why it matters: Without this bridge, videos stay pretty pictures; with it, videos become action plans. 🍞 Anchor: A short clip showing “bread sliding into bowl” becomes a 3D path the robot can actually follow.

🍞 If a babysitter app could teach your home robot new chores by showing tiny clips, your mornings might be easier.

🥬 The Concept (Real Stakes):

What it is: Making robots follow object motion from videos lets them handle new instructions in new places with less training.
How it works (recipe):
1. Use everyday language.
2. Let a video model dream up a plausible outcome.
3. Convert that dream into a 3D path.
4. Execute safely and smoothly.
Why it matters: This reduces hand-crafted demos, speeds deployment, and boosts generalization to new objects and scenes. 🍞 Anchor: From “Open the drawer” to a robot smoothly pulling a random kitchen drawer—no task-specific demos required.

02Core Idea

🍞 You know how a GPS doesn’t tell you how to move your legs or hands; it just shows the route the car should take? That’s the breakthrough here.

🥬 The Concept (3D Object Flow):

What it is: 3D object flow is the path that key points on the task-relevant object take through 3D space over time.
How it works (recipe):
1. Generate a task video from the initial scene and instruction.
2. Mask the important object and track points on it across frames.
3. Estimate depth to lift those tracked points into 3D.
4. Assemble a time-ordered 3D trajectory (the flow).
Why it matters: Without 3D object flow, robots either try to imitate a human body they don’t have or rely on rigid-only assumptions that fail on cloth, doors, or messy scenes. 🍞 Anchor: For “Open oven,” the 3D flow shows the oven door’s edge moving out and down along a curved arc—the robot just follows that path.

Aha! Moment in one sentence: Separate what should happen to the object from how the robot’s body makes it happen.

Three analogies:

Puppet strings: The flow is like the puppet’s path; any puppeteer (any robot) can move in its own way to match the puppet’s motion.
Treasure map: The flow is the dotted line; whether you walk or crawl, you’re following the same route.
Music sheet: The flow is the notes; different instruments (robots) can play them in their own style but reach the same melody (goal).

Before vs After:

Before: Robots tried to copy human hand motions from video or used brittle, rigid-only object poses.
After: Robots track an object’s 3D flow and choose their own body motions to realize it—works across rigid, articulated, deformable, and granular objects.

🍞 You’ve probably folded a scarf many ways to cover a bowl; the final scarf path is what matters, not your exact finger dance.

🥬 The Concept (Why It Works—Intuition):

What it is: Video models are surprisingly good at predicting physically sensible object motion, even if hands or textures look imperfect.
How it works (recipe):
1. Use video generation as a world simulator to propose likely object motion for the task.
2. Distill only the object’s 3D motion (ignore the exact human moves).
3. Give that motion to planners or RL policies tailored to the robot’s body.
Why it matters: This “decoupling” avoids the embodiment gap and turns noisy videos into clean, actionable goals. 🍞 Anchor: Even if a generated hand looks a bit blurry, the oven door still moves along a reasonable hinge arc—that arc is all the robot needs.

🍞 Think of building a LEGO set: you follow sub-steps (bag 1, bag 2). Here are the building blocks of Dream2Flow.

🥬 The Concept (Building Blocks):

What it is: The pipeline components are: video generation, object masking, point tracking, depth estimation with scale alignment, 3D lifting, then planning/control.
How it works (recipe):
1. Generate video from the scene image and instruction.
2. Detect and mask the task object (Grounding DINO + SAM 2).
3. Track object pixels across frames (CoTracker3) and estimate per-frame depth (SpatialTrackerV2).
4. Lift 2D tracks to 3D using camera intrinsics/extrinsics and calibrate depth scale.
5. Get a 3D object flow trajectory P1:T.
6. Feed P1:T to a planner or RL policy to produce robot actions.
Why it matters: Each piece removes ambiguity; together, they turn a dreamed video into a precise 3D motion plan. 🍞 Anchor: For “Put bread in bowl,” mask the bread, track its pixels, lift to 3D using depth, then have the robot move the bread along that 3D path into the bowl.

03Methodology

At a high level: Instruction + RGB-D image → Generate video → Extract 3D object flow → Plan/learn actions → Low-level robot commands.

🍞 Imagine flipping through a flipbook that shows an object moving, then using a ruler to measure where it is in 3D each page. That’s the start of the recipe.

🥬 The Concept (Video Generation Models):

What it is: Off-the-shelf image-to-video models create a short clip of the task in the same scene.
How it works (recipe):
1. Input the initial scene image and a text instruction.
2. Produce T video frames that visualize the task.
3. Keep camera still to help later depth estimation.
Why it matters: Without a plausible future clip, we have no guess for the object’s motion. 🍞 Anchor: “Open oven by one hand. The camera holds a still pose.” The produced video shows the door opening with minimal camera shake.

🍞 You know how sunglasses help you see depth on a 3D movie? Here we add depth to every video frame to understand distance.

🥬 The Concept (Video Depth Estimation and Scale Alignment):

What it is: Estimate depth for each generated frame and align its scale to the real camera’s first depth.
How it works (recipe):
1. Use SpatialTrackerV2 to get per-frame depth (monocular depth has unknown scale).
2. Align the first video depth to the real RGB-D depth via a global scale and shift (s*, b*).
3. Apply this calibration to all frames for consistent 3D geometry.
Why it matters: Without correct scale, 3D positions would be too big/small, making robot targets wrong. 🍞 Anchor: If the bread looks twice as far as it truly is, the robot might reach into thin air; scale alignment prevents that.

🍞 Think of putting a sticker on the object you care about so you don’t lose it in a crowd.

🥬 The Concept (Object Masking and Point Tracking):

What it is: Find the task object, then track its pixels across the video.
How it works (recipe):
1. Use Grounding DINO to locate the object from language and the initial image.
2. Prompt SAM 2 to get a clean object mask.
3. Sample n pixels on the mask and track them through frames with CoTracker3, yielding 2D point paths and visibilities.
Why it matters: Without robust object isolation and tracking, the 3D trajectory would mix background and object points. 🍞 Anchor: For “Put bread in bowl,” the mask stays on the bread slice, and tracking follows its corner and crust points.

🍞 If you know the camera’s “glasses prescription” (intrinsics) and where it sits (extrinsics), you can map any pixel into real-world coordinates.

🥬 The Concept (3D Lifting via Camera Projection):

What it is: Convert 2D tracked points plus depth into 3D points in the robot’s frame.
How it works (recipe):
1. For each visible tracked pixel, read its calibrated depth.
2. Use camera intrinsics/extrinsics to back-project to 3D.
3. Collect sequences of 3D points over time as the 3D object flow P1:T.
Why it matters: Robots operate in 3D; 2D motion alone isn’t enough to move hardware precisely. 🍞 Anchor: A pixel on the oven handle becomes a 3D dot that travels along a hinge arc the robot can follow.

🍞 Suppose you have the object’s route; now the robot needs to drive its own body to make that route happen.

🥬 The Concept (Trajectory Optimization):

What it is: Choose robot actions that make the object’s current 3D points match the next flow targets, while keeping motion smooth and reachable.
How it works (recipe):
1. Define a cost for deviation from the target 3D points at each time and a control cost for smoothness/reachability.
2. Predict next states using a dynamics model.
3. Minimize total cost over a short horizon to get actions.
Why it matters: Without optimization, the robot could jerk, exceed limits, or miss the object path. 🍞 Anchor: To open an oven, a planned sequence of end-effector poses keeps the grasped handle moving along the flow arc without collisions.

🍞 If practice makes perfect, then learning a policy from rewards is like coaching the robot with points for doing the right thing.

🥬 The Concept (Reinforcement Learning with 3D Object Flow Rewards):

What it is: Train a policy (e.g., SAC) to maximize a reward that measures how well the object matches the 3D flow.
How it works (recipe):
1. Use the simulator as a dynamics model.
2. Reward gets higher as current object points match the reference flow and the end-effector stays near the object.
3. Learn a policy that generalizes across embodiments (e.g., arm, quadruped, humanoid arm).
Why it matters: Without a good reward, learned behaviors can be brittle or miss the goal. 🍞 Anchor: In door opening, the policy learns to rotate and pull so the door’s particles march through the flow, even using different strategies per robot body.

Secret Sauce details across domains:

Non-prehensile push (Push-T in simulation): Use a push skill primitive (start, direction, distance), then plan by random shooting with a learned particle dynamics model that predicts per-point motion. Choose the push that best advances toward a time-aligned subgoal in the flow.
Real-world grasp-and-move: Use AnyGrasp to propose grasps and HaMer to detect the video hand’s thumb; pick the grasp closest to the thumb (often on the correct part, e.g., handle). Assume rigid grasp for the grasped subset and optimize end-effector poses (PyRoki) with flow-following and smoothness costs.
Articulated opening (RL in simulation): Train SAC with a flow-based reward that measures particle-match progress and end-effector proximity; this yields effective but embodiment-specific strategies.

Why these steps exist and what breaks without them:

No scale alignment: robot reaches to wrong depth.
No object mask/tracks: background motion pollutes the plan.
No dynamics/optimization: jerky or unreachable motions.
No grasp selection: grabbing the wrong part ruins articulation.
No time-aligned subgoals: large rotations stall or time out.

Concrete data example:

Task: “Open oven.”
1. Video shows oven door opening ~60°.
2. Tracks on the door edge lift to 3D curve.
3. AnyGrasp proposes grasps; thumb cue picks handle grasp.
4. Optimizer plans poses so the handle follows the 3D arc, with smoothness and reachability penalties.
5. Controller executes safe, smooth motions that open the door.

Secret Sauce in one line: Express the task as object-only 3D motion (what) and let planning or learning figure out the robot-specific body moves (how).

04Experiments & Results

🍞 Imagine testing a new recipe across different kitchens and tools to see if it still tastes great—that’s what these experiments do for Dream2Flow.

🥬 The Concept (The Test):

What it is: Evaluate if 3D object flow is a good bridge from video to robot control across object types and embodiments.
How it works (recipe):
1. Measure success rates on several tasks (simulation and real world).
2. Compare to alternative interfaces.
3. Probe robustness to object instances, backgrounds, viewpoints.
4. Test different video generators and dynamics models.
Why it matters: Without thorough testing, we can’t trust this as a general interface. 🍞 Anchor: Tasks include pushing a T-block, putting bread in a bowl, opening an oven, covering a bowl with a scarf, and opening a door.

Baselines compared: 🍞 Think of classmates who also took the test so you can see if your score is really impressive.

🥬 The Concept (AVDC and RIGVID Baselines):

What it is: Methods that turn videos into rigid object trajectories using optical flow (AVDC) or rigid pose transforms (RIGVID).
How it works (recipe):
1. Track dense 2D correspondences and lift with depth (AVDC), or estimate 6D rigid transforms (RIGVID/adapted).
2. Replay the rigid trajectory after grasp.
Why it matters: If the object is deformable or visibility is low, rigid-only estimates get noisy or wrong. 🍞 Anchor: For covering a bowl with a scarf, assuming rigid motion makes the plan crumble; cloth bends and folds.

Scoreboard with context:

Real Robot (10 trials each): • Put Bread in Bowl: Dream2Flow 8/10 vs AVDC 7/10 vs RIGVID 6/10. That’s like scoring an 80 when others get 60–70. • Open Oven: Dream2Flow 8/10 vs AVDC 0/10 vs RIGVID 6/10. Big jump—flow handles articulation better than rigid-only. • Cover Bowl (deformable): Dream2Flow 3/10 vs AVDC 2/10 vs RIGVID 1/10. Still tough, but flow is least brittle.
Simulation Push-T (100 trials): Using Wan2.1, 52/100 successes with particle dynamics, while other dynamics variants lag far behind (pose 12/100, heuristic 17/100). Per-point prediction really matters for rotation.
RL with Flow Rewards (Door Opening, 100 evals each): Flow-reward policies match handcrafted-state rewards across embodiments (Franka 100/100 vs 99/100, Spot 100/100 vs 99/100, GR1 94/100 vs 96/100). Flow-as-reward works!

Surprising findings:

Video model choice matters: In sim Push-T, Wan2.1 outperforms Kling 2.1 (52/100 vs 31/100). In real oven opening, Veo 3 excels (8/10), while others suffer from camera motion or wrong articulation axes.
Multiple tasks in the same scene: With different instructions, Dream2Flow adapts and targets different objects (e.g., bread vs mug vs donut), thanks to language-conditioned video.
Robustness: Performance holds under varied instances, backgrounds, and angles—with expected dips for large or out-of-distribution objects.

Failure analysis (real world, 60 trials total):

Video generation artifacts (morphing or hallucinated objects) cause about half the video failures; e.g., bread turns into crackers, or a new bowl appears.
Flow extraction dropouts occur with severe rotations or when objects go out of view.
Execution hiccups show up when grasp selection misses the correct part (e.g., scarf corner or handle alignment).

Takeaway: Compared to rigid-only video-to-trajectory methods, Dream2Flow’s 3D object flow travels better across rigid, articulated, deformable, and granular tasks, and even doubles as a strong RL reward.

05Discussion & Limitations

🍞 Think of this like a new bike: it rides great, but you should know the limits before taking it down a rocky mountain.

Limitations:

Rigid-grasp assumption in real robot planning: Works well for many tasks but limits fine deformable manipulation where the grasped part isn’t rigid.
Processing time: 3–11 minutes to generate video and extract 3D flow slows rapid iteration; video generation is the main bottleneck.
Occlusions: A single generated viewpoint struggles when hands or objects block each other heavily; long occlusions can break tracks.
Video artifacts: Morphing and hallucinated objects in generated clips can mislead tracking and planning.

Required resources:

RGB-D sensing for initial alignment, calibrated camera intrinsics/extrinsics.
Access to a strong image-to-video generator and vision foundation models (Grounding DINO, SAM 2, CoTracker3, SpatialTrackerV2).
A planner/optimizer (e.g., PyRoki) and/or an RL framework (e.g., SAC), plus dynamics models when needed.
Compute: GPU for video/depth/seg/tracking and training; real robot with impedance control/IK.

When not to use:

Safety-critical tasks where any video artifact could be dangerous (e.g., surgery robots) without strict safeguards.
Tiny, fast, or highly occluded objects where single-view tracking is unreliable.
Ultra time-sensitive settings (3–11 minute prep is too slow) unless pipelined/asynchronous.

Open questions:

How to make video generation more physically accurate and multi-view consistent to reduce artifacts?
Can we learn deformable/soft-body dynamics for real-world grasp-and-drape without the rigid-grasp assumption?
How to do robust multi-view or 4D tracking directly to survive occlusions?
Can the system actively pick grasps that best realize the flow (closed-loop with affordances)?
How to fuse language constraints (e.g., “don’t spill”) into flow-following costs for safer behavior?

Overall judgment: 3D object flow is a clean, scalable bridge from visual imagination to control. It’s not magic yet—occlusions and video artifacts remain—but it’s a big step toward robots that can learn new chores by watching short, task-specific videos.

06Conclusion & Future Work

Three-sentence summary: Dream2Flow turns a short, text-conditioned video into a 3D object flow—the path an object should take—then uses planning or reinforcement learning to make a robot follow that path. By separating what must happen to the object from how the robot body moves, it closes the embodiment gap without task-specific demonstrations. Experiments across rigid, articulated, deformable, and granular settings show consistent gains over rigid-only baselines and validate 3D object flow as a robust reward for policy learning.

Main achievement: Establishing 3D object flow as a simple, general interface that reliably translates video imagination into executable robot actions.

Future directions:

Speed up the pipeline (faster video generation, parallelized depth/flow) and add multi-view or 4D tracking to beat occlusions.
Go beyond rigid-grasp assumptions with real-world deformable dynamics and contact-rich skills.
Tighten integration with language to respect extra constraints and preferences (safe, tidy, energy-efficient).
Improve video model physics and articulation accuracy to reduce morphing/hallucination errors.

Why remember this: It reframes the problem—don’t mimic the human, track the object. That simple shift makes video models immediately useful for open-world manipulation and points the way to robots that learn new tasks from short, descriptive clips.

Practical Applications

•Home assistance: Place food items into bowls or containers after a simple voice instruction.
•Kitchen chores: Open ovens, drawers, and cabinets using object-flow-guided motions.
•Laundry and linens: Drape or cover items (e.g., cover a bowl with a cloth) with better flow-aware strategies.
•Tidying tasks: Sweep small debris (e.g., dried pasta) into bins by following flow over tools like brushes.
•Recycling workflows: Pick and place cans into the correct bin after watching a generated guidance clip.
•Warehouse picking: Move items to target totes by following flow from instruction-conditioned videos.
•Hospitality setup: Arrange tableware or supplies by translating video-suggested object paths into actions.
•Factory changeover: Adjust fixtures or levers based on flow extracted from instruction videos.
•Education/training: Quickly prototype new robot skills from short, descriptive prompts and scene images.
•Assistive robotics: Help users with mobility challenges by carrying out simple manipulation tasks guided by video.

Version: 1