🎓How I Study AIHISA
📖Read
📄Papers📰Blogs🎬Courses
💡Learn
🛤️Paths📚Topics💡Concepts🎴Shorts
🎯Practice
📝Daily Log🎯Prompts🧠Review
SearchSettings
How I Study AI - Learn AI Papers & Lectures the Easy Way

Papers15

AllBeginnerIntermediateAdvanced
All SourcesarXiv
#FID

Unified Latents (UL): How to train your latents

Intermediate
Jonathan Heek, Emiel Hoogeboom et al.Feb 19arXiv

Unified Latents (UL) is a way to learn the hidden code (latents) for images and videos by training three parts together: an encoder, a diffusion prior, and a diffusion decoder.

#Unified Latents#diffusion prior#diffusion decoder

DDiT: Dynamic Patch Scheduling for Efficient Diffusion Transformers

Intermediate
Dahye Kim, Deepti Ghadiyaram et al.Feb 19arXiv

This paper speeds up image and video generators called diffusion transformers by changing how big their puzzle pieces (patches) are at each step.

#Diffusion Transformer#Dynamic Tokenization#Patch Scheduling

Efficient Text-Guided Convolutional Adapter for the Diffusion Model

Intermediate
Aryan Das, Koushik Biswas et al.Feb 16arXiv

This paper introduces Nexus Adapters, tiny helper networks that let a diffusion model follow both a text prompt and a structure map (like edges or depth) at the same time.

#Nexus Adapter#text-guided adapter#cross-attention

iFSQ: Improving FSQ for Image Generation with 1 Line of Code

Intermediate
Bin Lin, Zongjian Li et al.Jan 23arXiv

This paper fixes a hidden flaw in a popular image tokenizer (FSQ) with a simple one-line change to its activation function.

#image generation#finite scalar quantization#iFSQ

OpenVision 3: A Family of Unified Visual Encoder for Both Understanding and Generation

Intermediate
Letian Zhang, Sucheng Ren et al.Jan 21arXiv

OpenVision 3 is a single vision encoder that learns one set of image tokens that work well for both understanding images (like answering questions) and generating images (like making new pictures).

#Unified Visual Encoder#VAE#Vision Transformer

Efficient Camera-Controlled Video Generation of Static Scenes via Sparse Diffusion and 3D Rendering

Intermediate
Jieying Chen, Jeffrey Hu et al.Jan 14arXiv

This paper shows how to make long, camera-controlled videos much faster by generating only a few smart keyframes with diffusion, then filling in the rest using a 3D scene and rendering.

#camera-controlled video generation#sparse keyframes#3D reconstruction

Boosting Latent Diffusion Models via Disentangled Representation Alignment

Intermediate
John Page, Xuesong Niu et al.Jan 9arXiv

This paper shows that the best VAEs for image generation are the ones whose latents neatly separate object attributes, a property called semantic disentanglement.

#Send-VAE#semantic disentanglement#latent diffusion

Is There a Better Source Distribution than Gaussian? Exploring Source Distributions for Image Flow Matching

Intermediate
Junho Lee, Kwanseok Kim et al.Dec 20arXiv

Flow Matching is like teaching arrows to push points from a simple cloud (source) to real pictures (target); most people start from a Gaussian cloud because it points equally in all directions.

#flow matching#conditional flow matching#source distribution

Alchemist: Unlocking Efficiency in Text-to-Image Model Training via Meta-Gradient Data Selection

Intermediate
Kaixin Ding, Yang Zhou et al.Dec 18arXiv

Alchemist is a smart data picker for training text-to-image models that learns which pictures and captions actually help the model improve.

#meta-gradient#data selection#text-to-image

REGLUE Your Latents with Global and Local Semantics for Entangled Diffusion

Intermediate
Giorgos Petsangourakis, Christos Sgouropoulos et al.Dec 18arXiv

Latent diffusion models are great at making images but learn the meaning of scenes slowly because their training goal mostly teaches them to clean up noise, not to understand objects and layouts.

#latent diffusion#REGLUE#representation entanglement

Towards Scalable Pre-training of Visual Tokenizers for Generation

Intermediate
Jingfeng Yao, Yuda Song et al.Dec 15arXiv

The paper tackles a paradox: visual tokenizers that get great pixel reconstructions often make worse images when used for generation.

#visual tokenizer#latent space#Vision Transformer

Bidirectional Normalizing Flow: From Data to Noise and Back

Intermediate
Yiyang Lu, Qiao Sun et al.Dec 11arXiv

Normalizing Flows are models that learn how to turn real images into simple noise and then back again.

#Normalizing Flow#Bidirectional Normalizing Flow#Hidden Alignment
12