Transparent and shiny objects confuse normal depth cameras, but video diffusion models already learned how light bends and reflects through them.
WorldCanvas lets you make videos where things happen exactly how you ask by combining three inputs: text (what happens), drawn paths called trajectories (when and where it happens), and reference images (who it is).
Robots learn best from what they would actually see, which is a first-person (egocentric) view, but most AI models are trained on third-person videos and get confused.
This paper fixes a common problem in video-making AIs where tiny mistakes snowball over time and ruin long videos.
IC-Effect is a new way to add special effects to existing videos by following a text instruction while keeping everything else unchanged.
MMGR is a new benchmark that checks whether AI image and video generators follow real-world rules, not just whether their outputs look pretty.
LongVie 2 is a video world model that can generate controllable videos for 3–5 minutes while keeping the look and motion steady over time.
VideoSSM is a new way to make long, stable, and lively videos by giving the model two kinds of memory: a short-term window and a long-term state-space memory.