Vision-Language-Action (VLA) robots are powerful but too big and slow for many real-world devices.
This paper speeds up image and video generators called diffusion transformers by changing how big their puzzle pieces (patches) are at each step.
DeepGen 1.0 is a small 5B-parameter model that can both make new images and smartly edit existing ones from text instructions.
FastVMT is a faster way to copy motion from one video to another without training a new model for each video.
The paper shows that using information from many layers of a language model (not just one) helps text-to-image diffusion transformers follow prompts much better.
Before this work, most text-to-image models used VAEs (small, squished image codes) and struggled with slow training and overfitting on high-quality fine-tuning sets.
This paper shows how to make powerful image‑generating Transformers run fast on phones without needing the cloud.
Transformers are powerful but slow because regular self-attention compares every token with every other token, which grows too fast for long sequences.
TAG-MoE is a new way to steer Mixture-of-Experts (MoE) models using clear task hints, so the right “mini-experts” handle the right parts of an image job.
Image-to-Video models often keep the picture looking right but ignore parts of the text instructions.
This paper makes video editing easier by teaching an AI to spread changes from the first frame across the whole video smoothly and accurately.
This paper shows a simple way to make image-generating AIs (diffusion Transformers) produce clearer, more accurate pictures by letting the model guide itself from the inside.