Dream-VL and Dream-VLA use a diffusion language model backbone to understand images, talk about them, and plan actions better than many regular (autoregressive) models.
HiStream makes 1080p video generation much faster by removing repeated work across space, time, and steps.
Flow Matching is like teaching arrows to push points from a simple cloud (source) to real pictures (target); most people start from a Gaussian cloud because it points equally in all directions.
SAM Audio is a new AI that can pull out exactly the sound you want from a noisy mix using text, clicks on a video, and time ranges—together or separately.
ReCo is a new way to edit videos just by telling the computer what to change with words, no extra masks needed.
Robots learn best from what they would actually see, which is a first-person (egocentric) view, but most AI models are trained on third-person videos and get confused.
Spatia is a video generator that keeps a live 3D map of the scene (a point cloud) as its memory while making videos.
IC-Effect is a new way to add special effects to existing videos by following a text instruction while keeping everything else unchanged.
The paper turns one flat picture into a neat stack of see‑through layers, so you can edit one thing without messing up the rest.
Steer3D lets you change a 3D object just by typing what you want, like “add a roof rack,” and it does it in one quick pass.
Robots often see the world as flat pictures but must move in a 3D world, which makes accurate actions hard.
Big text-to-image models make amazing pictures but are slow because they take hundreds of tiny steps to turn noise into an image.