Diffusion Language Models (DLMs) write by polishing whole sentences in several passes instead of one token at a time.
The paper studies why two opposite-sounding tricks in RL for reasoning—adding random (spurious) rewards and reducing randomness (entropy)—can both seem to help large language models think better.