This paper shows that making short videos can help AI plan and reason in pictures better than writing out steps in text.
VideoMaMa is a model that turns simple black-and-white object masks into soft, precise cutouts (alpha mattes) for every frame of a video.
Cities are full of places defined by people, like schools and parks, which are hard to see clearly from space without extra clues.
Transparent and shiny objects confuse normal depth cameras, but video diffusion models already learned how light bends and reflects through them.
Robots often get confused on long, multi-step tasks when they only see the final goal image and try to guess the next move directly.
This paper builds a foundation model called DAP that estimates real-world (metric) depth from any 360° panorama, indoors or outdoors.