This paper studies Vision–Language–Action (VLA) robots under one fair setup to find which design choices truly matter.
This paper teaches image models to copy a change shown in one image pair and apply it to a new image, like saying 'hat added here, add a similar hat there.'
DreamZero is a robot brain that learns actions by predicting short videos of the future and the matching moves at the same time.
Video generators are slow because attention looks at everything, which takes a lot of time.
ArcFlow is a new way to make text-to-image models draw great pictures in only 2 steps instead of 50, giving about a 40× speed boost.
MOVA is an open-source AI that makes videos and sounds at the same time so mouths, actions, and noises match perfectly.
The paper solves a big problem in fast image generators: they got quick, but they lost variety and kept making similar pictures.
PixelGen is a new image generator that works directly with pixels and uses what-looks-good-to-people guidance (perceptual loss) to improve quality.
FSVideo is a new image-to-video generator that runs about 42× faster than popular open-source models while keeping similar visual quality.
This paper teaches talking avatars not just to speak, but to look around their scene and handle nearby objects exactly as a text instruction says.
PromptRL teaches a language model to rewrite prompts while a flow-based image model learns to draw, and both are trained together using the same rewards.
This paper shows how to make a whole picture in one go, directly in pixels, without using a hidden “latent” space or many tiny steps.