Transparent and shiny objects confuse normal depth cameras, but video diffusion models already learned how light bends and reflects through them.
Robots often get confused on long, multi-step tasks when they only see the final goal image and try to guess the next move directly.
SurgWorld teaches surgical robots using videos plus text, then guesses the missing robot moves so we can train good policies without collecting tons of real robot-action data.
Dream-VL and Dream-VLA use a diffusion language model backbone to understand images, talk about them, and plan actions better than many regular (autoregressive) models.
HiStream makes 1080p video generation much faster by removing repeated work across space, time, and steps.
Flow Matching is like teaching arrows to push points from a simple cloud (source) to real pictures (target); most people start from a Gaussian cloud because it points equally in all directions.
SAM Audio is a new AI that can pull out exactly the sound you want from a noisy mix using text, clicks on a video, and time ranges—together or separately.
ReCo is a new way to edit videos just by telling the computer what to change with words, no extra masks needed.
Robots learn best from what they would actually see, which is a first-person (egocentric) view, but most AI models are trained on third-person videos and get confused.
Spatia is a video generator that keeps a live 3D map of the scene (a point cloud) as its memory while making videos.
IC-Effect is a new way to add special effects to existing videos by following a text instruction while keeping everything else unchanged.
The paper turns one flat picture into a neat stack of see‑through layers, so you can edit one thing without messing up the rest.