CASA is a new way to mix images and text inside a language model that keeps speed and memory low while keeping accuracy high.
Kling-Omni is a single, unified model that can understand text, images, and videos together and then make or edit high-quality videos from those mixed instructions.
This paper introduces Log-linear Sparse Attention (LLSA), a new way for Diffusion Transformers to focus only on the most useful information using a smart, layered search.
This paper speeds up diffusion language models (dLLMs) by changing the order in which they fill in missing words.
Recursive transformers save memory by reusing the same layer over and over, but that makes them less expressive and hurts accuracy.
The paper shows that video AIs do not need long, human-like chains of thought to reason well.
Big language models use RoPE to remember word order, but it throws away the imaginary half of a complex number during attention.