Papers3

#RAE

Beyond Language Modeling: An Exploration of Multimodal Pretraining

Shengbang Tong, David Fan et al.Mar 3arXiv

The paper trains one model from scratch to both read text and see images/videos, instead of starting from a language-only model.

#multimodal pretraining#representation autoencoder#RAE

Not triaged yet

Scaling Text-to-Image Diffusion Transformers with Representation Autoencoders

Intermediate

Shengbang Tong, Boyang Zheng et al.Jan 22arXiv

Before this work, most text-to-image models used VAEs (small, squished image codes) and struggled with slow training and overfitting on high-quality fine-tuning sets.

#Representation Autoencoder#RAE#Variational Autoencoder

Not triaged yet

OpenVision 3: A Family of Unified Visual Encoder for Both Understanding and Generation

Intermediate

Letian Zhang, Sucheng Ren et al.Jan 21arXiv

OpenVision 3 is a single vision encoder that learns one set of image tokens that work well for both understanding images (like answering questions) and generating images (like making new pictures).

#Unified Visual Encoder#VAE#Vision Transformer

Not triaged yet