Papers20

All Beginner Intermediate Advanced

All Sources arXiv

#FID

DREAM: Where Visual Understanding Meets Text-to-Image Generation

Beginner

Chao Li, Tianhong Li et al.Mar 3arXiv

DREAM is one model that both understands images (like CLIP) and makes images from text (like top text-to-image models).

#DREAM#contrastive learning#masked autoregressive modeling

Unified Latents (UL): How to train your latents

Intermediate

Jonathan Heek, Emiel Hoogeboom et al.Feb 19arXiv

Unified Latents (UL) is a way to learn the hidden code (latents) for images and videos by training three parts together: an encoder, a diffusion prior, and a diffusion decoder.

#Unified Latents#diffusion prior#diffusion decoder

DDiT: Dynamic Patch Scheduling for Efficient Diffusion Transformers

Intermediate

Dahye Kim, Deepti Ghadiyaram et al.Feb 19arXiv

This paper speeds up image and video generators called diffusion transformers by changing how big their puzzle pieces (patches) are at each step.

#Diffusion Transformer#Dynamic Tokenization#Patch Scheduling

Image Generation with a Sphere Encoder

Beginner

Kaiyu Yue, Menglin Jia et al.Feb 16arXiv

The Sphere Encoder is a new way to make images fast by teaching an autoencoder to place all images evenly on a big imaginary sphere and then decode random spots on that sphere back into pictures.

#Sphere Encoder#Spherical Latent Space#RMS Normalization

Efficient Text-Guided Convolutional Adapter for the Diffusion Model

Intermediate

Aryan Das, Koushik Biswas et al.Feb 16arXiv

This paper introduces Nexus Adapters, tiny helper networks that let a diffusion model follow both a text prompt and a structure map (like edges or depth) at the same time.

#Nexus Adapter#text-guided adapter#cross-attention

iFSQ: Improving FSQ for Image Generation with 1 Line of Code

Intermediate

Bin Lin, Zongjian Li et al.Jan 23arXiv

This paper fixes a hidden flaw in a popular image tokenizer (FSQ) with a simple one-line change to its activation function.

#image generation#finite scalar quantization#iFSQ

OpenVision 3: A Family of Unified Visual Encoder for Both Understanding and Generation

Intermediate

Letian Zhang, Sucheng Ren et al.Jan 21arXiv

OpenVision 3 is a single vision encoder that learns one set of image tokens that work well for both understanding images (like answering questions) and generating images (like making new pictures).

#Unified Visual Encoder#VAE#Vision Transformer

FrankenMotion: Part-level Human Motion Generation and Composition

Beginner

Chuqiao Li, Xianghui Xie et al.Jan 15arXiv

FrankenMotion is a new AI that makes human motion by controlling each body part over time, like a careful puppeteer.

#Human motion generation#Part-level control#Hierarchical conditioning

Efficient Camera-Controlled Video Generation of Static Scenes via Sparse Diffusion and 3D Rendering

Intermediate

Jieying Chen, Jeffrey Hu et al.Jan 14arXiv

This paper shows how to make long, camera-controlled videos much faster by generating only a few smart keyframes with diffusion, then filling in the rest using a 3D scene and rendering.

#camera-controlled video generation#sparse keyframes#3D reconstruction

Boosting Latent Diffusion Models via Disentangled Representation Alignment

Intermediate

John Page, Xuesong Niu et al.Jan 9arXiv

This paper shows that the best VAEs for image generation are the ones whose latents neatly separate object attributes, a property called semantic disentanglement.

#Send-VAE#semantic disentanglement#latent diffusion

Guiding a Diffusion Transformer with the Internal Dynamics of Itself

Beginner

Xingyu Zhou, Qifan Li et al.Dec 30arXiv

This paper shows a simple way to make image-generating AIs (diffusion Transformers) produce clearer, more accurate pictures by letting the model guide itself from the inside.

#Internal Guidance#Diffusion Transformer#Intermediate Supervision

Is There a Better Source Distribution than Gaussian? Exploring Source Distributions for Image Flow Matching

Intermediate

Junho Lee, Kwanseok Kim et al.Dec 20arXiv

Flow Matching is like teaching arrows to push points from a simple cloud (source) to real pictures (target); most people start from a Gaussian cloud because it points equally in all directions.

#flow matching#conditional flow matching#source distribution

1 2