FlowBlending is a simple way to speed up video diffusion models by smartly choosing when to use a big model and when a small one is enough.
This paper teaches text-to-video models to follow real-world physics, so people, balls, water, glass, and fire act the way they should.
GR-Dexter is a full package—new robot hands, a smart AI brain, and lots of carefully mixed data—that lets a two-handed robot follow language instructions to do long, tricky tasks.
Transparent and shiny objects confuse normal depth cameras, but video diffusion models already learned how light bends and reflects through them.
Robots often get confused on long, multi-step tasks when they only see the final goal image and try to guess the next move directly.
SurgWorld teaches surgical robots using videos plus text, then guesses the missing robot moves so we can train good policies without collecting tons of real robot-action data.
Dream-VL and Dream-VLA use a diffusion language model backbone to understand images, talk about them, and plan actions better than many regular (autoregressive) models.
HiStream makes 1080p video generation much faster by removing repeated work across space, time, and steps.
Flow Matching is like teaching arrows to push points from a simple cloud (source) to real pictures (target); most people start from a Gaussian cloud because it points equally in all directions.
SAM Audio is a new AI that can pull out exactly the sound you want from a noisy mix using text, clicks on a video, and time ranges—together or separately.
ReCo is a new way to edit videos just by telling the computer what to change with words, no extra masks needed.
Robots learn best from what they would actually see, which is a first-person (egocentric) view, but most AI models are trained on third-person videos and get confused.