The paper shows that teaching a language model with a special “reward-shaped” next-token objective can make later reinforcement learning (RL) work much better.
Autoregressive (AR) image models make pictures by choosing tokens one-by-one, but they were judged only on picking likely tokens, not on how good the final picture looks in pixels.
EMMA is a single AI model that can understand images, write about them, create new images from text, and edit images—all in one unified system.