Papers2

#alignment loss

JavisGPT: A Unified Multi-modal LLM for Sounding-Video Comprehension and Generation

JavisGPT is a single AI that can both understand sounding videos (audio + video together) and also create new ones that stay in sync.

#multimodal large language model#audio-video synchronization#SyncFusion

Not triaged yet

REGLUE Your Latents with Global and Local Semantics for Entangled Diffusion

Intermediate

Giorgos Petsangourakis, Christos Sgouropoulos et al.Dec 18arXiv

Latent diffusion models are great at making images but learn the meaning of scenes slowly because their training goal mostly teaches them to clean up noise, not to understand objects and layouts.

#latent diffusion#REGLUE#representation entanglement

Not triaged yet