Future Optical Flow Prediction Improves Robot Control & Video Generation

Kanchana Ranasinghe; Honglu Zhou; Yu Fang; Luyu Yang; Le Xue; Ran Xu; Caiming Xiong; Silvio Savarese; Michael S Ryoo; Juan Carlos Niebles

Future Optical Flow Prediction Improves Robot Control & Video Generation

Intermediate

Kanchana Ranasinghe, Honglu Zhou, Yu Fang et al.1/15/2026

arXiv PDF

Key Summary

•FOFPred is a new AI that reads one or two images plus a short instruction like “move the bottle left to right,” and then predicts how every pixel will move in the next moments.
•It uses a single backbone that combines a Vision-Language Model (to understand the instruction and scene) with a Diffusion Transformer (to generate precise, pixel-level motion).
•To learn from messy web videos, the authors remove camera shake from optical flow so the model learns true object motion, not just the phone moving.
•They turn optical flow into colorful, image-like pictures (via HSV mapping) so a pretrained image VAE and diffusion model can handle motion naturally.
•On robot tasks (CALVIN and RoboTwin), FOFPred beats strong baselines, finishing longer instruction chains and handling two-arm coordination better.
•For text-to-video, FOFPred first predicts motion, then a video model draws frames that follow that motion, giving more accurate, instruction-following movement than text-only generators.
•Ablations show three big wins: web pretraining on human videos helps; the unified VLM+diffusion backbone matters; and separating camera motion is critical.
•FOFPred can often make good motion in as little as one diffusion step, hinting that motion fields are simpler than full-color images.
•Limits include sensitivity to wording (“move left” vs “from right to left”), a large model size (~7B params), and compute needs for training and preprocessing.
•The takeaway: teach AI to predict future motion explicitly, in a language-aware way, and you get smarter robots and more controllable videos.

Why This Research Matters

Robots that understand “how things should move next” can follow everyday instructions more safely and precisely, from tidying a table to assisting in the kitchen. Video tools gain true motion control, letting creators describe movement (“swoop, then loop”) and see it realized faithfully. Training on web videos means we can scale learning without expensive robot data collection, and camera-motion compensation keeps the learning signal honest. The motion-first approach is also inspectable: you can check the flow map before acting or rendering, improving trust. Over time, this could lead to more dependable home assistants, better training simulators, and creative tools that do exactly what we describe.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

You know how when you watch a sports replay, you can tell where the ball will go next because you’ve seen how it was moving? That sense of motion helps you predict the future. Computers need that too.

Before this work, many robot controllers and video generators mostly used two kinds of signals: pictures (RGB frames) and words (text prompts). Vision-Language-Action models tried to understand, “What do I see?” and “What does the user want?” and then act or generate a video. But they often missed one key ingredient: a clean, explicit picture of how things should move in the future. Some tried to predict entire future frames, which mix up looks (colors, textures) with motion, making learning harder. Others predicted only a few point tracks (sparse motion), which can miss important details like how a whole hand or arm is moving.

The problem was clear: to control robots or create believable videos that follow instructions, we need reliable, dense future motion—how every pixel will shift—especially when the training videos are messy and filmed with moving cameras. Predicting that future motion is tricky: you must handle camera shake, understand the instruction, and generalize across many scenes and objects.

People tried a few approaches. Some forecast RGB frames, but the model wastes effort modeling static appearance (like background textures) instead of focusing on “where things will move.” Others used sparse trajectories (a few points), which are easier to learn but throw away much of the motion detail. Some recent VLM-based methods could connect language to motion, yet they usually predicted only a handful of tracks or needed heavy, hand-crafted corrections.

The missing piece was a way to teach a model to predict dense, future optical flow—from language and images—while learning from huge, noisy web videos. That requires two tricks: (1) clean motion targets that separate object motion from camera motion, and (2) a model that’s great at language reasoning and pixel-level generation. Also, if we could represent optical flow like images, we could reuse powerful, pretrained VAEs and diffusion backbones instead of training everything from scratch.

Why this matters in daily life: imagine telling a home robot, “slide the bowl to the right, then push it forward,” and it smoothly does both steps because it pictures the future motion first. Or asking a video tool, “make the kite swoop down then loop,” and getting a clip that follows your motion exactly, not just vaguely. Better motion prediction means safer, more precise robots, clearer teaching from humans to machines, and video tools that do what you describe, not what they guess. It can speed up training (motion is a simpler target than whole images), work better with less robot data (by learning from human web videos), and make results more trustworthy (because you can inspect the predicted motion field before acting or generating frames).

02Core Idea

🍞 Top Bread (Hook): Imagine giving directions to a friend on a bike: “Go left, then forward.” It helps if your friend can picture the path before moving.

🥬 The Concept: FOFPred teaches AI to picture the path of every pixel—the future optical flow—based on an image (or two) and a short instruction.

How it works (big picture):
1. Understand the instruction and scene with a Vision-Language Model.
2. Use a Diffusion Transformer to generate colorful “motion pictures” that show how every pixel will move next.
3. Clean the training signal by removing camera shake so the model learns true object motion.
Why it matters: If you can foresee motion, you can control robots precisely and make videos that actually follow the motion you asked for. 🍞 Bottom Bread (Anchor): Tell the system “move the bottle from left to right,” and it draws a future flow map with arrows for every pixel, which a robot can then turn into correct hand motions—or a video model can turn into frames.

Now, let’s explain the key building blocks using the Sandwich pattern, in the best learning order.

Optical Flow 🍞 Hook: You know how wind maps show arrows for which way the air moves? Optical flow is like a wind map for pixels. 🥬 Concept: Optical flow is a picture that shows how every pixel moves between two frames.

How it works:
1. Take two frames (now and a bit later).
2. Match where each pixel went.
3. Draw a tiny arrow per pixel for direction and speed.
Why it matters: It separates motion from appearance, focusing on “where things moved,” not “what color they are.” 🍞 Anchor: If a ball rolls right, optical flow arrows around the ball point right.

Future Optical Flow 🍞 Hook: Imagine predicting where a thrown paper plane will go next just by seeing it now. 🥬 Concept: Future optical flow predicts how pixels will move in the next moments, using the current view (and sometimes the previous one) plus a hint in words.

How it works:
1. Look at the scene now (and maybe just before).
2. Read the instruction (e.g., “push forward”).
3. Forecast the arrows for the next steps.
Why it matters: Robots and videos need what happens next, not what already happened. 🍞 Anchor: “Move cup toward me”—the model predicts forward-pointing arrows on the cup region.

Vision-Language Model (VLM) 🍞 Hook: Imagine a guide who can look at a map and read your note to figure out your route. 🥬 Concept: A VLM understands images and text together to ground words in the scene.

How it works:
1. Read the instruction.
2. Look at the image(s).
3. Create features that connect words like “left” to actual places and objects.
Why it matters: Without grounding, “left” is just a word; with grounding, it applies to the bottle you’re pointing at. 🍞 Anchor: The VLM links “move the red bowl” to the actual red bowl pixels.

Diffusion Model 🍞 Hook: Think of starting with TV static and slowly sculpting a clear picture. 🥬 Concept: A diffusion model starts from noise and denoises step by step to generate an image—here, a motion image.

How it works:
1. Add noise to the target image during training.
2. Learn to remove the noise.
3. At test time, start from noise and remove it to draw the motion field.
Why it matters: Diffusion is great at high-fidelity, pixel-precise generation. 🍞 Anchor: The model denoises into a smooth optical-flow picture that shows motion clearly.

Unified VLM–Diffusion Backbone 🍞 Hook: Imagine a duet where one singer understands the lyrics (VLM) and the other hits perfect notes (diffusion). Together, they perform the whole song. 🥬 Concept: FOFPred unifies a frozen VLM (for understanding) and a frozen VAE with a trainable Diffusion Transformer that generates future optical flow.

How it works:
1. VLM encodes text+images, VAE encodes images into latents.
2. Project these features to the diffusion transformer.
3. The transformer performs spatio-temporal attention and denoising to produce future flow latents, then the VAE decodes them to colorful flow images.
Why it matters: Language grounding plus pixel-accurate generation yields motion that follows instructions and fits the scene. 🍞 Anchor: For “move bottle left,” the backbone outputs left-pointing flow across the bottle region, not the table.

Data Preprocessing (Camera Motion Compensation) 🍞 Hook: If your camera shakes while filming, everything seems to move—even when objects don’t. 🥬 Concept: FOFPred computes relative optical flow that subtracts camera motion, leaving only object motion.

How it works:
1. Estimate raw flow between frames.
2. Fit a homography to explain camera movement.
3. Subtract that camera flow from raw flow.
Why it matters: Clean motion targets prevent the model from learning the wrong thing (like “phone panning” instead of “cup sliding”). 🍞 Anchor: In a hand-held video, the table’s fake motion vanishes after compensation; only the moving spoon remains.

Multimodal Reasoning 🍞 Hook: A good coach watches the player and listens to the play call. 🥬 Concept: The model reasons over vision and language together to choose motion that matches both.

How it works:
1. Align words (e.g., “left,” “toward camera”) with image regions and viewpoint.
2. Use joint attention to guide motion generation.
Why it matters: Without this, the motion might be smooth but not what you asked for. 🍞 Anchor: “Move the blue block toward the camera” affects the blue block, not the red one.

Multiple analogies for the core idea:

Weather map: predict tomorrow’s wind arrows (future flow) from today’s sky and a weather note.
Choreography: turn “step left, turn, step forward” into exact dancer foot arrows.
GPS: given the map (image) and instruction, draw the car’s path (flow) before driving.

Before vs. After: Before, models guessed motion indirectly (frames or sparse points). After, FOFPred forecasts dense, language-grounded motion first, then acts or draws frames, leading to better control and generation.

Why it works intuitively: Motion is simpler than full images; diffusion draws it cleanly; the VLM makes sure it’s the right motion; and camera compensation gives honest labels.

03Methodology

At a high level: (Image(s) + Instruction) -> VLM and VAE encoders -> Project to a common space -> Diffusion Transformer denoises into future optical flow latents -> VAE decoder turns latents into colorful flow frames -> Use them for robot actions or video generation.

Step 1: Inputs and Feature Extraction

What happens: The model takes one or two frames (xt-1, xt) and a caption like “move the bottle left,” feeds them to a frozen Vision-Language Model (Qwen2.5-VL) to get text+vision features, and passes images through a frozen VAE encoder (Flux.1) to get visual latents.
Why this step: The VLM grounds words in the scene; the VAE puts images into a compact form the diffusion model knows well.
Example: Frame shows a bottle on a table; caption says “move bottle left.” The VLM highlights the bottle region and the meaning of “left.”

Step 2: Project to a Common Space

What happens: Two small MLPs map VLM features and VAE latents to the same channel size and append them to the diffusion transformer’s token sequence.
Why this step: Diffusion wants a single, consistent feature format to condition on.
Example: Think of putting both the instruction and image context on the same “power strip” so the transformer can plug into both.

Step 3: Diffusion Transformer with Spatio-Temporal Attention

What happens: A DiT (OmniGen-style) operates over the noisy optical-flow latents, using full spatio-temporal attention and a time-aware RoPE so it knows which tokens belong to which frame.
Why this step: Motion is about how things change across time. The transformer must connect pixels across frames and future steps.
Example: To predict “left then forward,” the attention links the object’s pixels now to where they should be in the next flow frames.

Step 4: Colorful Flow Representation (HSV-as-RGB)

What happens: True flow vectors (angle + magnitude) are mapped to HSV colors and then converted to RGB-like images. The VAE/DiT operate in this image-like space.
Why this step: Reusing strong pretrained image VAEs/diffusion is easier and more stable than training a new flow-specific autoencoder.
Example: Rightward motion becomes a certain hue; faster motion becomes more saturated, making flows look like animated images.

Step 5: Training Targets—Relative Flow (Camera Compensation)

What happens: For each training pair, compute raw flow (e.g., RAFT), fit a homography via RANSAC to estimate camera motion, synthesize camera flow, and subtract it to get object-only (relative) flow. Filter frames using a fast Lucas–Kanade proxy so we only train on segments with enough motion.
Why this step: Phone shake or panning can swamp the real object motion; removing it gives clean supervision. Motion-guided sampling speeds training and avoids static clips.
Example: If the phone moves right but the cup doesn’t, raw flow shows everything moving; after compensation, the table’s flow is near zero, and only the true movers remain.

Step 6: Loss and Guidance

What happens: Train with a flow-matching diffusion loss on the flow latents. Use classifier-free guidance by occasionally dropping text or visual conditions so the model learns to rely on both.
Why this step: Flow matching stabilizes diffusion training; guidance sharpens control at inference.
Example: At test time, you can nudge the generator to follow the instruction more strictly by increasing guidance scale.

Step 7: Inference (Sampling Steps)

What happens: Start from noise latents; run k denoising steps (often small k works well for flow) to get future flow latents; decode to flow images.
Why this step: Diffusion sampling draws the future motion. Fewer steps can be enough because flow is simpler than full RGB.
Example: With k=4, you already get a smooth, correct “left then forward” flow stack.

Downstream Use Case I: Robot Control

Pipeline: Future flow -> Diffusion Policy Head (DPN) with state and text -> action vectors for the arm(s).
Why this works: The action head maps “how pixels will move” into “how joints should move,” making control explicit and precise.
Special touches: Cross-view conditioning (fixed and wrist cameras) and fine-tuning on robot videos improve embodiment understanding.
Example: “Push the blue block toward the edge.” The predicted forward flow near the block guides the gripper push direction.

Downstream Use Case II: Text-to-Video (Two-Stage)

Pipeline: Stage 1: FOFPred predicts future flow from the first frame + text. Interpolate flows to a dense motion field. Stage 2: Feed motion + first frame into a video generator (GWTF/CogVideoX-based) to synthesize frames that follow the motion.
Why this works: Separating “decide motion” from “paint frames” boosts adherence to the instruction and makes the process interpretable.
Example: “The kite swoops down then loops.” Stage 1 draws the swoop+loop flow; Stage 2 paints the video that follows it.

The Secret Sauce

Unified backbone: Language grounding (VLM) plus pixel-accurate diffusion (DiT) in one place.
Clean supervision: Camera-compensated relative flow and motion-guided sampling from web videos.
Reuse of strong priors: HSV-as-RGB makes flow look like images so pretrained VAE/diffusion skills transfer.
Minimal training: Freeze VLM and VAE; train only the DiT, keeping things stable and efficient to fine-tune.

04Experiments & Results

The Tests and Why They Matter

CALVIN ABC→D (long-horizon, language-conditioned robotics): Can an agent follow chains of natural-language tasks it never saw in that exact environment? Success here reflects real-world generalization and planning.
RoboTwin 2.0 (bimanual robot tasks): Do two arms coordinate under language guidance? This tests richer motion and embodiment.
SSv2-based Text-to-Video motion control: Do generated videos follow complex motion instructions? Motion fidelity beats vague text-only generation.

Competitors

Robotics: RT-1, Diffusion Policy, Robo-Flamingo, Uni-Pi, MDT, Susie, GR-1, Vidman, RoboUniview, LTM, DreamVLA, VPP.
Video: CogVideoX and prior motion-control baselines.

Scoreboard with Context

CALVIN (full data, zero-shot to env D): FOFPred tops all five sequential tasks, ending with Task 5 success 78.7% and the highest Average Length 4.48. Think of this as getting an A when others are around A- or B+. It slightly edges DreamVLA on overall chain completion, showing that explicit motion planning helps long sequences.
CALVIN (10% data): FOFPred still leads with Avg Length 3.43, showing data efficiency—like learning the game plan faster than others.
RoboTwin 2.0: Average success 68.6% vs 61.8% for a strong baseline (VPP), with consistent gains across all chosen dual-arm tasks. That’s like a basketball team improving both offense and defense—not just one stat.
Text-to-Video (SSv2 protocol): FOFPred’s two-stage pipeline beats CogVideoX across common quality and motion metrics (e.g., better SSIM/PSNR, lower FVD/KVD, higher motion fidelity). This is like following choreography exactly instead of freestyle guessing.

Surprising and Noteworthy Findings

One-step or few-step sampling: FOFPred often makes good motion with very few diffusion steps, hinting that motion fields are simpler than full-color images (less texture, lower dimensionality).
Language-only motion control: Even without extra masks or hand-drawn paths, the model’s motion forecasts make video generation follow instructions more faithfully.
Clean targets matter a lot: Ablations show that removing camera-motion compensation or using raw flow hurts performance notably—like trying to learn to dance by watching a shaky video.

Ablations (What changed what)

Pretraining on web human videos (SSv2) vs robot demos (DROID) for the same steps: SSv2 pretraining improved downstream results (Avg Length +0.35), suggesting that motion variety from humans scales well to robots.
Backbone: Diffusion-only < VLM+Diffusion < VLM+Diffusion with image-editing pretraining. The full setup had the best results, meaning language grounding + pretrained pixel skills are both key.
Motion targets: Disentangled (camera-compensated) flow beats static-frame targets and raw flow by a large margin. The model needs honest motion to learn honest motion.
Dense vs sparse motion: Dense flow clearly outperforms sparse tracks (ATM or naive subsampling) on CALVIN Avg Length, showing that pixel-level detail is crucial for precise manipulation.
Motion as input: Removing motion or replacing it with static visual embeddings collapses performance. Future flow provides unique, critical dynamic cues that nothing else substitutes.

05Discussion & Limitations

Limitations

Prompt sensitivity: Small wording changes (“left” vs “from right to left”) can flip predictions. The language-to-motion mapping can be brittle.
Model size and compute: Around 7B parameters; training and even inference need strong GPUs (≥24 GB for inference). Preprocessing for relative flow is also compute-heavy (though done offline).
Residual camera artifacts: Despite compensation, some cases still leak camera-like motion into predictions.
Two-stage T2V cost: Splitting “predict motion” then “draw frames” is more compute than single-stage text-to-video, though it’s more controllable.

Required Resources

Hardware: Multi-GPU servers for training (e.g., 8×H200 in the paper) and at least a 24 GB GPU for inference.
Data: Web-scale videos with captions (e.g., SSv2, EgoDex) plus optional robot-domain videos for fine-tuning.
Preprocessing: Offline optical flow estimation and homography compensation for large datasets.

When Not to Use

Real-time, ultra-low-latency control on tiny devices; model size and sampling steps may be too heavy today.
Scenarios requiring numerically exact physical flow (e.g., scientific fluid measurements); FOFPred predicts plausible visual motion fields, not precise physics-grade vectors.
Tasks with extremely ambiguous or contradictory language prompts; clarity matters.

Open Questions

How to make language-to-motion robust to phrasing (“move left” vs “from right to left”)?
Can we distill FOFPred into a smaller, real-time model without losing control quality?
How far can web pretraining go: more datasets, more viewpoints, more activities?
Can we quantify and expand motion diversity and explore why few diffusion steps suffice for optical flow?
What’s the best way to fuse multi-view robot cameras and long-horizon plans end-to-end?

06Conclusion & Future Work

Three-Sentence Summary FOFPred predicts dense, future optical flow from images and language, unifying a VLM for understanding with a diffusion transformer for pixel-precise motion generation. By cleaning camera motion from training targets and representing flow as colorful images, it learns from noisy web videos and generalizes to robots and video generation. This motion-first approach boosts long-horizon manipulation and makes text-to-video follow motion instructions more faithfully.

Main Achievement A single, unified VLM–diffusion backbone that turns language and images into accurate, inspectable future motion fields—powerful enough to improve both robot control and controllable video generation.

Future Directions Smaller, faster distilled models for real-time use; better language robustness through prompt paraphrasing and augmentation; richer multi-view and long-horizon reasoning; and deeper analysis of why motion can be generated in so few diffusion steps.

Why Remember This When AI predicts how things will move—explicitly and under language control—it becomes more reliable at doing and showing what we ask. FOFPred’s idea of “motion-first, then act or draw” is a simple, strong recipe that can ripple across robotics, creative tools, and future world models.

Practical Applications

•Voice-guided home robots that can slide, push, and place objects exactly as instructed.
•Warehouse picking and sorting robots that plan precise pushes and pulls from language tasks.
•Surgical or lab-assist robots that predict tool motion paths before acting, improving safety.
•Video editing tools where users type motion directions and the clip follows them faithfully.
•Animation pre-visualization: block out character and camera motion using text prompts.
•AR/VR experiences with realistic, controllable object motions driven by narration or scripts.
•Sports training tools that forecast player or ball motion from a still plus a coaching cue.
•Traffic or crowd simulators that test “what if” motion scenarios from natural-language inputs.
•Assistive devices that interpret caregiver instructions to plan safe object movements.
•Education demos that visualize motion fields (flows) to teach physics and robotics concepts.

Version: 1