This paper introduces XDLM, a single model that blends two popular diffusion styles (masked and uniform) so it both understands and generates text and images well.
Autoregressive (AR) image models make pictures by choosing tokens one-by-one, but they were judged only on picking likely tokens, not on how good the final picture looks in pixels.
This paper studies how a newer kind of language model, called a discrete diffusion language model (DLM), gets better as we give it more data, bigger models, and more compute.