NanoQuant is a new way to shrink large language models down to 1-bit and even less than 1-bit per weight without retraining on huge datasets.
OmniSIFT is a new way to shrink (compress) audio and video tokens so omni-modal language models can think faster without forgetting important details.
Transformers slow down on very long inputs because standard attention looks at every token pair, which is expensive.