Language is lumpy: easy stretches and tricky jumps are mixed together, but old models spend the same effort on every word.
This paper asks if large language models (LLMs) can act like "world models" that predict what happens next in text-based environments, not just the next word in a sentence.
The paper tackles a paradox: visual tokenizers that get great pixel reconstructions often make worse images when used for generation.
Different programming languages scale differently when training code AI models, so treating them all the same wastes compute and lowers performance.
This paper studies how a newer kind of language model, called a discrete diffusion language model (DLM), gets better as we give it more data, bigger models, and more compute.