This paper turns a video model into a step-by-step visual thinker that makes one final, high-quality picture from a text prompt.
APOLLO is a single, unified model that can make video and audio together or separately, and it keeps them tightly in sync.
DreamStyle is a single video-stylizing model that can follow text, copy a style image, or continue from a stylized first frame—without switching tools.
NitroGen is a vision-to-action AI that learns to play many video games by watching 40,000 hours of gameplay videos from over 1,000 titles with on-screen controller overlays.
DreamID-V is a new AI method that swaps faces in videos while keeping the body movements, expressions, lighting, and background steady and natural.
Computers usually click like a woodpecker, but they struggle to drag smoothly like a human hand; this paper fixes that.
FlowBlending is a simple way to speed up video diffusion models by smartly choosing when to use a big model and when a small one is enough.
This paper teaches text-to-video models to follow real-world physics, so people, balls, water, glass, and fire act the way they should.
GR-Dexter is a full package—new robot hands, a smart AI brain, and lots of carefully mixed data—that lets a two-handed robot follow language instructions to do long, tricky tasks.
Transparent and shiny objects confuse normal depth cameras, but video diffusion models already learned how light bends and reflects through them.
Robots often get confused on long, multi-step tasks when they only see the final goal image and try to guess the next move directly.
SurgWorld teaches surgical robots using videos plus text, then guesses the missing robot moves so we can train good policies without collecting tons of real robot-action data.