Papers8

All Beginner Intermediate Advanced

All Sources arXiv

#DiT

Accelerating Diffusion via Hybrid Data-Pipeline Parallelism Based on Conditional Guidance Scheduling

Intermediate

Euisoo Jung, Byunghyun Kim et al.Feb 25arXiv

Diffusion models make great images but are slow because they fix noise step by step many times.

#diffusion inference#multi-GPU acceleration#data parallelism

PixelGen: Pixel Diffusion Beats Latent Diffusion with Perceptual Loss

Intermediate

Zehong Ma, Ruihan Xu et al.Feb 2arXiv

PixelGen is a new image generator that works directly with pixels and uses what-looks-good-to-people guidance (perceptual loss) to improve quality.

#pixel diffusion#perceptual loss#LPIPS

Scaling Text-to-Image Diffusion Transformers with Representation Autoencoders

Intermediate

Shengbang Tong, Boyang Zheng et al.Jan 22arXiv

Before this work, most text-to-image models used VAEs (small, squished image codes) and struggled with slow training and overfitting on high-quality fine-tuning sets.

#Representation Autoencoder#RAE#Variational Autoencoder

Future Optical Flow Prediction Improves Robot Control & Video Generation

Intermediate

Kanchana Ranasinghe, Honglu Zhou et al.Jan 15arXiv

FOFPred is a new AI that reads one or two images plus a short instruction like “move the bottle left to right,” and then predicts how every pixel will move in the next moments.

#optical flow#future optical flow prediction#vision-language model

FlashPortrait: 6x Faster Infinite Portrait Animation with Adaptive Latent Prediction

Intermediate

Shuyuan Tu, Yueming Pan et al.Dec 18arXiv

FlashPortrait makes talking-portrait videos that keep a person’s identity steady for as long as you want—minutes or even hours.

#FlashPortrait#portrait animation#identity consistency

Towards Scalable Pre-training of Visual Tokenizers for Generation

Intermediate

Jingfeng Yao, Yuda Song et al.Dec 15arXiv

The paper tackles a paradox: visual tokenizers that get great pixel reconstructions often make worse images when used for generation.

#visual tokenizer#latent space#Vision Transformer

Structure From Tracking: Distilling Structure-Preserving Motion for Video Generation

Intermediate

Yang Fei, George Stoica et al.Dec 12arXiv

The paper teaches a video generator to move things realistically by borrowing motion knowledge from a strong video tracker.

#video diffusion#structure-preserving motion#SAM2

UniUGP: Unifying Understanding, Generation, and Planing For End-to-end Autonomous Driving

Intermediate

Hao Lu, Ziyang Liu et al.Dec 10arXiv

UniUGP is a single system that learns to understand road scenes, explain its thinking, plan safe paths, and even imagine future video frames.

#UniUGP#vision-language-action#world model