APOLLO is a single, unified model that can make video and audio together or separately, and it keeps them tightly in sync.
The paper tackles a paradox: visual tokenizers that get great pixel reconstructions often make worse images when used for generation.