This paper finds that about 1 out of every 4 attention heads in autoregressive video diffusion models mostly looks only at the current frame and almost ignores the past, wasting memory and time.
This paper shows how to add a tiny helper (a probe) to a big language model so it can classify things like safety or sentiment during the same pass it already does to answer you.