ConceptMoE teaches a language model to group easy, similar tokens into bigger ideas called concepts, so it spends more brainpower on the hard parts.
This paper builds DiRL, a fast and careful way to finish training diffusion language models so they reason better.
Long texts make standard attention in large language models very slow because it checks every word against every other word.