DREAM is one model that both understands images (like CLIP) and makes images from text (like top text-to-image models).
SkyReels-V4 is a single, unified model that makes videos and matching sounds together, while also letting you fix or change parts of a video.
Mobile-O is a small but smart AI that can both understand pictures and make new images, and it runs right on your phone.
JavisDiT++ is a new AI that makes short videos and matching sounds from a text prompt, keeping sight and sound in sync.
CoDance is a new way to animate many characters in one picture using just one pose video, even if the picture and the video do not line up perfectly.
VINO is a single AI model that can make and edit both images and videos by listening to text and looking at reference pictures and clips at the same time.
SemanticGen is a new way to make videos that starts by planning in a small, high-level 'idea space' (semantic space) and then adds the tiny visual details later.
Latent diffusion models are great at making images but learn the meaning of scenes slowly because their training goal mostly teaches them to clean up noise, not to understand objects and layouts.
EgoX turns a regular third-person video into a first-person video that looks like it was filmed from the actor’s eyes.