JavisGPT is a single AI that can both understand sounding videos (audio + video together) and also create new ones that stay in sync.
Latent diffusion models are great at making images but learn the meaning of scenes slowly because their training goal mostly teaches them to clean up noise, not to understand objects and layouts.