The paper shows that using information from many layers of a language model (not just one) helps text-to-image diffusion transformers follow prompts much better.
Before this work, most text-to-image models used VAEs (small, squished image codes) and struggled with slow training and overfitting on high-quality fine-tuning sets.
This paper shows how to make powerful image‑generating Transformers run fast on phones without needing the cloud.
Transformers are powerful but slow because regular self-attention compares every token with every other token, which grows too fast for long sequences.
TAG-MoE is a new way to steer Mixture-of-Experts (MoE) models using clear task hints, so the right “mini-experts” handle the right parts of an image job.
Image-to-Video models often keep the picture looking right but ignore parts of the text instructions.
This paper makes video editing easier by teaching an AI to spread changes from the first frame across the whole video smoothly and accurately.
Robots often get confused on long, multi-step tasks when they only see the final goal image and try to guess the next move directly.
SpotEdit is a training‑free way to edit only the parts of an image that actually change, instead of re-generating the whole picture.
DreaMontage is a new AI method that makes long, single-shot videos that feel smooth and connected, even when you give it scattered images or short clips in the middle.
This paper protects your photos from being misused by new AI image editors that can copy your face or style from just one picture.
MetaCanvas lets a multimodal language model (MLLM) sketch a plan inside the generator’s hidden canvas so diffusion models can follow it patch by patch.