The paper solves a big problem in fast image generators: they got quick, but they lost variety and kept making similar pictures.
PixelGen is a new image generator that works directly with pixels and uses what-looks-good-to-people guidance (perceptual loss) to improve quality.
FSVideo is a new image-to-video generator that runs about 42× faster than popular open-source models while keeping similar visual quality.
This paper teaches talking avatars not just to speak, but to look around their scene and handle nearby objects exactly as a text instruction says.
PromptRL teaches a language model to rewrite prompts while a flow-based image model learns to draw, and both are trained together using the same rewards.
This paper shows how to make a whole picture in one go, directly in pixels, without using a hidden “latent” space or many tiny steps.
This paper shows a simple, one-model way to dub videos that makes the new voice and the lips move together naturally.
DenseGRPO teaches image models using lots of small, timely rewards instead of one final score at the end.
This paper shows how a video generator can improve its own videos during sampling, without extra training or outside checkers.
Robots often learn a bad habit called the vision shortcut: they guess the task just by looking, and ignore the words you tell them.
TwinBrainVLA is a robot brain with two halves: a frozen generalist that keeps world knowledge safe and a trainable specialist that learns to move precisely.
ShapeR builds clean, correctly sized 3D objects from messy, casual phone or glasses videos by using images, camera poses, sparse SLAM points, and short text captions together.