SAM Audio is a new AI that can pull out exactly the sound you want from a noisy mix using text, clicks on a video, and time ranges—together or separately.
ReCo is a new way to edit videos just by telling the computer what to change with words, no extra masks needed.
Robots learn best from what they would actually see, which is a first-person (egocentric) view, but most AI models are trained on third-person videos and get confused.
Spatia is a video generator that keeps a live 3D map of the scene (a point cloud) as its memory while making videos.
IC-Effect is a new way to add special effects to existing videos by following a text instruction while keeping everything else unchanged.
The paper turns one flat picture into a neat stack of see‑through layers, so you can edit one thing without messing up the rest.
Steer3D lets you change a 3D object just by typing what you want, like “add a roof rack,” and it does it in one quick pass.
Robots often see the world as flat pictures but must move in a 3D world, which makes accurate actions hard.
Big text-to-image models make amazing pictures but are slow because they take hundreds of tiny steps to turn noise into an image.
This paper shows you can train a big text-to-image diffusion model directly on the features of a vision foundation model (like DINOv3) without using a VAE.
Omni-Attribute is a new image encoder that learns just the parts of a picture you ask for (like hairstyle or lighting) and ignores the rest.
UniUGP is a single system that learns to understand road scenes, explain its thinking, plan safe paths, and even imagine future video frames.