Beyond Language Modeling: An Exploration of Multimodal Pretraining
IntermediateShengbang Tong, David Fan et al.Mar 3arXiv
The paper trains one model from scratch to both read text and see images/videos, instead of starting from a language-only model.
#multimodal pretraining#representation autoencoder#RAE