The paper trains one model from scratch to both read text and see images/videos, instead of starting from a language-only model.
The paper shows that the popular PPO method for training language models is unfair to rare words and too gentle with very common words, which makes learning slow and unstable.