SageBwd is a way to make the Transformer's attention both fast and trainable by doing most big multiplications in 8-bit instead of full precision.
This paper introduces Log-linear Sparse Attention (LLSA), a new way for Diffusion Transformers to focus only on the most useful information using a smart, layered search.
This paper speeds up diffusion language models (dLLMs) by changing the order in which they fill in missing words.