This paper builds DiRL, a fast and careful way to finish training diffusion language models so they reason better.
Long texts make standard attention in large language models very slow because it checks every word against every other word.