This paper fixes a common problem in multimodal AI: models can understand pictures and words well but stumble when asked to create matching images.
Sparse-LaViDa makes diffusion-style AI models much faster by skipping unhelpful masked tokens during generation while keeping quality the same.