This paper shows a simple, one-model way to dub videos that makes the new voice and the lips move together naturally.
DenseGRPO teaches image models using lots of small, timely rewards instead of one final score at the end.
This paper shows how a video generator can improve its own videos during sampling, without extra training or outside checkers.
Robots often learn a bad habit called the vision shortcut: they guess the task just by looking, and ignore the words you tell them.
TwinBrainVLA is a robot brain with two halves: a frozen generalist that keeps world knowledge safe and a trainable specialist that learns to move precisely.
ShapeR builds clean, correctly sized 3D objects from messy, casual phone or glasses videos by using images, camera poses, sparse SLAM points, and short text captions together.
This paper teaches video-making AIs to follow real-world physics, so rolling balls roll right and collisions look believable.
HeartMuLa is a family of open-source music AI models that can understand and generate full songs with clear lyrics and strong musical structure.
This paper turns a video model into a step-by-step visual thinker that makes one final, high-quality picture from a text prompt.
APOLLO is a single, unified model that can make video and audio together or separately, and it keeps them tightly in sync.
ThinkRL-Edit teaches an image editor to think first and draw second, which makes tricky, reasoning-heavy edits much more accurate.
DreamStyle is a single video-stylizing model that can follow text, copy a style image, or continue from a stylized first frame—without switching tools.