This paper studies Vision–Language–Action (VLA) robots under one fair setup to find which design choices truly matter.
This paper teaches image models to copy a change shown in one image pair and apply it to a new image, like saying 'hat added here, add a similar hat there.'
DreamZero is a robot brain that learns actions by predicting short videos of the future and the matching moves at the same time.
Video generators are slow because attention looks at everything, which takes a lot of time.
ArcFlow is a new way to make text-to-image models draw great pictures in only 2 steps instead of 50, giving about a 40× speed boost.
MOVA is an open-source AI that makes videos and sounds at the same time so mouths, actions, and noises match perfectly.
The paper solves a big problem in fast image generators: they got quick, but they lost variety and kept making similar pictures.
PixelGen is a new image generator that works directly with pixels and uses what-looks-good-to-people guidance (perceptual loss) to improve quality.
FSVideo is a new image-to-video generator that runs about 42× faster than popular open-source models while keeping similar visual quality.
PromptRL teaches a language model to rewrite prompts while a flow-based image model learns to draw, and both are trained together using the same rewards.
This paper shows a simple, one-model way to dub videos that makes the new voice and the lips move together naturally.
DenseGRPO teaches image models using lots of small, timely rewards instead of one final score at the end.