Papers3

#Latent Space

Scaling Text-to-Image Diffusion Transformers with Representation Autoencoders

Shengbang Tong, Boyang Zheng et al.Jan 22arXiv

Before this work, most text-to-image models used VAEs (small, squished image codes) and struggled with slow training and overfitting on high-quality fine-tuning sets.

#Representation Autoencoder#RAE#Variational Autoencoder

Not triaged yet

FantasyVLN: Unified Multimodal Chain-of-Thought Reasoning for Vision-Language Navigation

Intermediate

Jing Zuo, Lingzhou Mu et al.Jan 20arXiv

FantasyVLN teaches a robot to follow language instructions while looking around, using a smart, step-by-step thinking style during training but not at test time.

#Vision-and-Language Navigation#Chain-of-Thought#Multimodal CoT

Not triaged yet

One Layer Is Enough: Adapting Pretrained Visual Encoders for Image Generation

Intermediate

Yuan Gao, Chen Chen et al.Dec 8arXiv

This paper shows that we can turn big, smart vision features into a small, easy-to-use code for image generation with just one attention layer.

#Feature Auto-Encoder#FAE#Self-Supervised Learning

Not triaged yet