Papers20

#classifier-free guidance

PixelGen: Pixel Diffusion Beats Latent Diffusion with Perceptual Loss

PixelGen is a new image generator that works directly with pixels and uses what-looks-good-to-people guidance (perceptual loss) to improve quality.

#pixel diffusion#perceptual loss#LPIPS

One-step Latent-free Image Generation with Pixel Mean Flows

Beginner

Yiyang Lu, Susie Lu et al.Jan 29arXiv

This paper shows how to make a whole picture in one go, directly in pixels, without using a hidden “latent” space or many tiny steps.

#pixel MeanFlow#one-step generation#x-prediction

Self-Refining Video Sampling

Intermediate

Sangwon Jang, Taekyung Ki et al.Jan 26arXiv

This paper shows how a video generator can improve its own videos during sampling, without extra training or outside checkers.

#video generation#flow matching#denoising autoencoder

Alterbute: Editing Intrinsic Attributes of Objects in Images

Intermediate

Tal Reiss, Daniel Winter et al.Jan 15arXiv

Alterbute is a diffusion-based method that changes an object's intrinsic attributes (color, texture, material, shape) in a photo while keeping the object's identity and the scene intact.

#intrinsic attribute editing#visual named entities#identity preservation

Future Optical Flow Prediction Improves Robot Control & Video Generation

Intermediate

Kanchana Ranasinghe, Honglu Zhou et al.Jan 15arXiv

FOFPred is a new AI that reads one or two images plus a short instruction like “move the bottle left to right,” and then predicts how every pixel will move in the next moments.

#optical flow#future optical flow prediction#vision-language model

Transition Matching Distillation for Fast Video Generation

Intermediate

Weili Nie, Julius Berner et al.Jan 14arXiv

Big video makers (diffusion models) create great videos but are too slow because they use hundreds of tiny clean-up steps.

#video diffusion#distillation#transition matching

VideoAR: Autoregressive Video Generation via Next-Frame & Scale Prediction

Intermediate

Longbin Ji, Xiaoxiong Liu et al.Jan 9arXiv

VideoAR is a new way to make videos with AI that writes each frame like a story, one step at a time, while painting details from coarse to fine.

#autoregressive video generation#visual autoregression#next-frame prediction

Boosting Latent Diffusion Models via Disentangled Representation Alignment

Intermediate

John Page, Xuesong Niu et al.Jan 9arXiv

This paper shows that the best VAEs for image generation are the ones whose latents neatly separate object attributes, a property called semantic disentanglement.

#Send-VAE#semantic disentanglement#latent diffusion

LTX-2: Efficient Joint Audio-Visual Foundation Model

Intermediate

Yoav HaCohen, Benny Brazowski et al.Jan 6arXiv

LTX-2 is an open-source model that makes video and sound together from a text prompt, so the picture and audio match in time and meaning.

#text-to-video#text-to-audio#audiovisual generation

VINO: A Unified Visual Generator with Interleaved OmniModal Context

Beginner

Junyi Chen, Tong He et al.Jan 5arXiv

VINO is a single AI model that can make and edit both images and videos by listening to text and looking at reference pictures and clips at the same time.

#VINO#unified visual generator#multimodal diffusion transformer

LiveTalk: Real-Time Multimodal Interactive Video Diffusion via Improved On-Policy Distillation

Beginner

Ethan Chern, Zhulin Hu et al.Dec 29arXiv

LiveTalk turns slow, many-step video diffusion into a fast, 4-step, real-time system for talking avatars that listen, think, and respond with synchronized video.

#real-time video diffusion#on-policy distillation#multimodal conditioning

Over++: Generative Video Compositing for Layer Interaction Effects

Intermediate

Luchao Qi, Jiaye Wu et al.Dec 22arXiv

Over++ is a video AI that adds realistic effects like shadows, splashes, dust, and smoke between a foreground and a background without changing the original footage.

#augmented compositing#video diffusion#video inpainting

1 2