NOVA is a new video editor that lets you change a few key frames (sparse control) while it carefully keeps the original motion and background details (dense synthesis).
The paper asks a simple question: when an AI sees a picture and some text but the instructions say 'only trust the picture,' how does it decide which one to follow?