Vision-Language-Action (VLA) robots are powerful but too big and slow for many real-world devices.
This paper speeds up image and video generators called diffusion transformers by changing how big their puzzle pieces (patches) are at each step.
FastVMT is a faster way to copy motion from one video to another without training a new model for each video.
The paper shows that using information from many layers of a language model (not just one) helps text-to-image diffusion transformers follow prompts much better.
Before this work, most text-to-image models used VAEs (small, squished image codes) and struggled with slow training and overfitting on high-quality fine-tuning sets.
This paper shows how to make powerful image‑generating Transformers run fast on phones without needing the cloud.
Transformers are powerful but slow because regular self-attention compares every token with every other token, which grows too fast for long sequences.
TAG-MoE is a new way to steer Mixture-of-Experts (MoE) models using clear task hints, so the right “mini-experts” handle the right parts of an image job.
Image-to-Video models often keep the picture looking right but ignore parts of the text instructions.
This paper makes video editing easier by teaching an AI to spread changes from the first frame across the whole video smoothly and accurately.
Robots often get confused on long, multi-step tasks when they only see the final goal image and try to guess the next move directly.
SpotEdit is a training‑free way to edit only the parts of an image that actually change, instead of re-generating the whole picture.