JavisDiT++ is a new AI that makes short videos and matching sounds from a text prompt, keeping sight and sound in sync.
SurgWorld teaches surgical robots using videos plus text, then guesses the missing robot moves so we can train good policies without collecting tons of real robot-action data.
The paper teaches a video generator to move things realistically by borrowing motion knowledge from a strong video tracker.