This paper introduces XDLM, a single model that blends two popular diffusion styles (masked and uniform) so it both understands and generates text and images well.
Big video makers (diffusion models) create great videos but are too slow because they use hundreds of tiny clean-up steps.
Big text-to-image models make amazing pictures but are slow because they take hundreds of tiny steps to turn noise into an image.