Recursive transformers save memory by reusing the same layer over and over, but that makes them less expressive and hurts accuracy.
MetaCanvas lets a multimodal language model (MLLM) sketch a plan inside the generator’s hidden canvas so diffusion models can follow it patch by patch.
Omni-Attribute is a new image encoder that learns just the parts of a picture you ask for (like hairstyle or lighting) and ignores the rest.
InfiniteVL is a vision-language model that mixes two ideas: local focus with Sliding Window Attention and long-term memory with a linear module called Gated DeltaNet.
The paper shows that many AI image generators are trained to prefer one popular idea of beauty, even when a user clearly asks for something messy, dark, blurry, or emotionally heavy.