This paper shows that making short videos can help AI plan and reason in pictures better than writing out steps in text.
VideoMaMa is a model that turns simple black-and-white object masks into soft, precise cutouts (alpha mattes) for every frame of a video.
Cities are full of places defined by people, like schools and parks, which are hard to see clearly from space without extra clues.
Video models can now be told what physical result you want (like “make this ball move left with a strong push”) using Goal Force, instead of just vague text or a final picture.
Transparent and shiny objects confuse normal depth cameras, but video diffusion models already learned how light bends and reflects through them.
Robots often get confused on long, multi-step tasks when they only see the final goal image and try to guess the next move directly.
This paper builds a foundation model called DAP that estimates real-world (metric) depth from any 360° panorama, indoors or outdoors.
SHARP turns a single photo into a 3D scene you can look around in, and it does this in under one second on a single GPU.
UnityVideo is a single, unified model that learns from many kinds of video information at once—like colors (RGB), depth, motion (optical flow), body pose, skeletons, and segmentation—to make smarter, more realistic videos.