Papers9

#video generation

Self-Refining Video Sampling

Sangwon Jang, Taekyung Ki et al.Jan 26arXiv

This paper shows how a video generator can improve its own videos during sampling, without extra training or outside checkers.

#video generation#flow matching#denoising autoencoder

SkyReels-V3 Technique Report

Intermediate

Debang Li, Zhengcong Fei et al.Jan 24arXiv

SkyReels-V3 is a single AI model that can make videos in three ways: from reference images, by extending an existing video, and by creating talking avatars from audio.

#video generation#diffusion transformer#multimodal in-context learning

A Mechanistic View on Video Generation as World Models: State and Dynamics

Intermediate

Luozhou Wang, Zhifei Chen et al.Jan 22arXiv

This paper says modern video generators are starting to act like tiny "world simulators," not just pretty video painters.

#world models#video generation#state representation

Inference-time Physics Alignment of Video Generative Models with Latent World Models

Intermediate

Jianhao Yuan, Xiaofeng Zhang et al.Jan 15arXiv

This paper teaches video-making AIs to follow real-world physics better without retraining them.

#video generation#physics plausibility#latent world model

DreamID-V:Bridging the Image-to-Video Gap for High-Fidelity Face Swapping via Diffusion Transformer

Intermediate

Xu Guo, Fulong Ye et al.Jan 4arXiv

DreamID-V is a new AI method that swaps faces in videos while keeping the body movements, expressions, lighting, and background steady and natural.

#video face swapping#image face swapping#diffusion transformer

DreaMontage: Arbitrary Frame-Guided One-Shot Video Generation

Intermediate

Jiawei Liu, Junqiao Li et al.Dec 24arXiv

DreaMontage is a new AI method that makes long, single-shot videos that feel smooth and connected, even when you give it scattered images or short clips in the middle.

#arbitrary frame conditioning#one-shot video generation#Diffusion Transformer

Kling-Omni Technical Report

Intermediate

Kling Team, Jialu Chen et al.Dec 18arXiv

Kling-Omni is a single, unified model that can understand text, images, and videos together and then make or edit high-quality videos from those mixed instructions.

#multimodal visual language#MVL#prompt enhancer

Spatia: Video Generation with Updatable Spatial Memory

Intermediate

Jinjing Zhao, Fangyun Wei et al.Dec 17arXiv

Spatia is a video generator that keeps a live 3D map of the scene (a point cloud) as its memory while making videos.

#video generation#spatial memory#3D point cloud

UniUGP: Unifying Understanding, Generation, and Planing For End-to-end Autonomous Driving

Intermediate

Hao Lu, Ziyang Liu et al.Dec 10arXiv

UniUGP is a single system that learns to understand road scenes, explain its thinking, plan safe paths, and even imagine future video frames.

#UniUGP#vision-language-action#world model