Papers12

#FID

iFSQ: Improving FSQ for Image Generation with 1 Line of Code

This paper fixes a hidden flaw in a popular image tokenizer (FSQ) with a simple one-line change to its activation function.

#image generation#finite scalar quantization#iFSQ

OpenVision 3: A Family of Unified Visual Encoder for Both Understanding and Generation

Intermediate

Letian Zhang, Sucheng Ren et al.Jan 21arXiv

OpenVision 3 is a single vision encoder that learns one set of image tokens that work well for both understanding images (like answering questions) and generating images (like making new pictures).

#Unified Visual Encoder#VAE#Vision Transformer

Efficient Camera-Controlled Video Generation of Static Scenes via Sparse Diffusion and 3D Rendering

Intermediate

Jieying Chen, Jeffrey Hu et al.Jan 14arXiv

This paper shows how to make long, camera-controlled videos much faster by generating only a few smart keyframes with diffusion, then filling in the rest using a 3D scene and rendering.

#camera-controlled video generation#sparse keyframes#3D reconstruction

Boosting Latent Diffusion Models via Disentangled Representation Alignment

Intermediate

John Page, Xuesong Niu et al.Jan 9arXiv

This paper shows that the best VAEs for image generation are the ones whose latents neatly separate object attributes, a property called semantic disentanglement.

#Send-VAE#semantic disentanglement#latent diffusion

Is There a Better Source Distribution than Gaussian? Exploring Source Distributions for Image Flow Matching

Intermediate

Junho Lee, Kwanseok Kim et al.Dec 20arXiv

Flow Matching is like teaching arrows to push points from a simple cloud (source) to real pictures (target); most people start from a Gaussian cloud because it points equally in all directions.

#flow matching#conditional flow matching#source distribution

Alchemist: Unlocking Efficiency in Text-to-Image Model Training via Meta-Gradient Data Selection

Intermediate

Kaixin Ding, Yang Zhou et al.Dec 18arXiv

Alchemist is a smart data picker for training text-to-image models that learns which pictures and captions actually help the model improve.

#meta-gradient#data selection#text-to-image

REGLUE Your Latents with Global and Local Semantics for Entangled Diffusion

Intermediate

Giorgos Petsangourakis, Christos Sgouropoulos et al.Dec 18arXiv

Latent diffusion models are great at making images but learn the meaning of scenes slowly because their training goal mostly teaches them to clean up noise, not to understand objects and layouts.

#latent diffusion#REGLUE#representation entanglement

Towards Scalable Pre-training of Visual Tokenizers for Generation

Intermediate

Jingfeng Yao, Yuda Song et al.Dec 15arXiv

The paper tackles a paradox: visual tokenizers that get great pixel reconstructions often make worse images when used for generation.

#visual tokenizer#latent space#Vision Transformer

Bidirectional Normalizing Flow: From Data to Noise and Back

Intermediate

Yiyang Lu, Qiao Sun et al.Dec 11arXiv

Normalizing Flows are models that learn how to turn real images into simple noise and then back again.

#Normalizing Flow#Bidirectional Normalizing Flow#Hidden Alignment

What matters for Representation Alignment: Global Information or Spatial Structure?

Intermediate

Jaskirat Singh, Xingjian Leng et al.Dec 11arXiv

This paper asks whether generation training benefits more from an encoder’s big-picture meaning (global semantics) or from how features are arranged across space (spatial structure).

#representation alignment#REPA#iREPA

One Layer Is Enough: Adapting Pretrained Visual Encoders for Image Generation

Intermediate

Yuan Gao, Chen Chen et al.Dec 8arXiv

This paper shows that we can turn big, smart vision features into a small, easy-to-use code for image generation with just one attention layer.

#Feature Auto-Encoder#FAE#Self-Supervised Learning

SpaceControl: Introducing Test-Time Spatial Control to 3D Generative Modeling

Intermediate

Elisabetta Fedele, Francis Engelmann et al.Dec 5arXiv

SpaceControl lets you steer a powerful 3D generator with simple shapes you draw, without retraining the model.

#3D generative modeling#test-time guidance#latent space intervention