Text-to-image models using GRPO used to give the same final reward to every step, which is like giving the whole team the same grade no matter who did what.
This paper fixes a common problem in multimodal AI: models can understand pictures and words well but stumble when asked to create matching images.
StageVAR makes image-generating AI much faster by recognizing that early steps set the meaning and structure, while later steps just polish details.
Sparse-LaViDa makes diffusion-style AI models much faster by skipping unhelpful masked tokens during generation while keeping quality the same.