APOLLO is a single, unified model that can make video and audio together or separately, and it keeps them tightly in sync.
The paper teaches language models using extra 'language homework' made from the same raw text so they learn grammar and meaning, not just next-word guessing.
The paper tackles a paradox: visual tokenizers that get great pixel reconstructions often make worse images when used for generation.
UnityVideo is a single, unified model that learns from many kinds of video information at once—like colors (RGB), depth, motion (optical flow), body pose, skeletons, and segmentation—to make smarter, more realistic videos.