Papers12

#optical flow

RealWonder: Real-Time Physical Action-Conditioned Video Generation

RealWonder is a system that turns a single picture and 3D physical actions (like pushes, wind, and robot gripper moves) into a realistic video in real time.

#action-conditioned video generation#physics simulation#optical flow

Not triaged yet

Track4World: Feedforward World-centric Dense 3D Tracking of All Pixels

Intermediate

Jiahao Lu, Jiayi Xu et al.Mar 3arXiv

Track4World is a fast, feedforward AI that can follow the 3D path of every pixel in a video using just one camera.

#dense 3D tracking#scene flow#2D-to-3D correlation

Not triaged yet

DreamWorld: Unified World Modeling in Video Generation

Intermediate

Boming Tan, Xiangdong Zhang et al.Feb 28arXiv

DreamWorld is a new way to make videos that not only look real but also follow common-sense rules about motion, space, and meaning.

#video diffusion transformer#world model#optical flow

Not triaged yet

Rethinking Video Generation Model for the Embodied World

Beginner

Yufan Deng, Zilin Pan et al.Jan 21arXiv

Robots need videos that not only look pretty but also follow real-world physics and finish the task asked of them.

#robot video generation#embodied AI#benchmark

Not triaged yet

Future Optical Flow Prediction Improves Robot Control & Video Generation

Intermediate

Kanchana Ranasinghe, Honglu Zhou et al.Jan 15arXiv

FOFPred is a new AI that reads one or two images plus a short instruction like “move the bottle left to right,” and then predicts how every pixel will move in the next moments.

#optical flow#future optical flow prediction#vision-language model

Not triaged yet

Motion Attribution for Video Generation

Intermediate

Xindi Wu, Despoina Paschalidou et al.Jan 13arXiv

Motive is a new way to figure out which training videos teach an AI how to move things realistically, not just how they look.

#motion attribution#video diffusion#optical flow

Not triaged yet

RadarGen: Automotive Radar Point Cloud Generation from Cameras

Intermediate

Tomer Borreda, Fangqiang Ding et al.Dec 19arXiv

RadarGen is a tool that learns to generate realistic car radar point clouds just from multiple camera views.

#automotive radar#radar point cloud generation#latent diffusion

Not triaged yet

InsertAnywhere: Bridging 4D Scene Geometry and Diffusion Models for Realistic Video Object Insertion

Intermediate

Hoiyeong Jin, Hyojin Jang et al.Dec 19arXiv

InsertAnywhere is a two-stage system that lets you add a new object into any video so it looks like it was always there.

#video object insertion#4D scene geometry#diffusion video generation

Not triaged yet

4D-RGPT: Toward Region-level 4D Understanding via Perceptual Distillation

Intermediate

Chiao-An Yang, Ryo Hachiuma et al.Dec 18arXiv

This paper teaches a video-understanding AI to think in 3D plus time (4D) so it can answer questions about specific objects moving in videos.

#4D perception#multimodal large language models#perceptual distillation

Not triaged yet

CRISP: Contact-Guided Real2Sim from Monocular Video with Planar Scene Primitives

Intermediate

Zihan Wang, Jiashun Wang et al.Dec 16arXiv

CRISP turns a normal phone video of a person into a clean 3D world and a virtual human that can move in it without breaking physics.

#real-to-sim#human-scene interaction#planar primitives

Not triaged yet

MotionEdit: Benchmarking and Learning Motion-Centric Image Editing

Beginner

Yixin Wan, Lei Ke et al.Dec 11arXiv

This paper creates MotionEdit, a high-quality dataset that teaches AI to change how people and objects move in a picture without breaking their looks or the scene.

#motion-centric image editing#optical flow#MotionEdit dataset

Not triaged yet

UnityVideo: Unified Multi-Modal Multi-Task Learning for Enhancing World-Aware Video Generation

Beginner

Jiehui Huang, Yuechen Zhang et al.Dec 8arXiv

UnityVideo is a single, unified model that learns from many kinds of video information at once—like colors (RGB), depth, motion (optical flow), body pose, skeletons, and segmentation—to make smarter, more realistic videos.

#multimodal video generation#multi-task learning#dynamic noise scheduling

Not triaged yet