Papers3

#straight-through estimator

NanoQuant: Efficient Sub-1-Bit Quantization of Large Language Models

Hyochan Chong, Dongkyu Kim et al.Feb 6arXiv

NanoQuant is a new way to shrink large language models down to 1-bit and even less than 1-bit per weight without retraining on huge datasets.

#post-training quantization#sub-1-bit quantization#binary LLMs

Not triaged yet

OmniSIFT: Modality-Asymmetric Token Compression for Efficient Omni-modal Large Language Models

Intermediate

Yue Ding, Yiyan Ji et al.Feb 4arXiv

OmniSIFT is a new way to shrink (compress) audio and video tokens so omni-modal language models can think faster without forgetting important details.

#Omni-LLM#token compression#modality-asymmetric

Not triaged yet

Elastic Attention: Test-time Adaptive Sparsity Ratios for Efficient Transformers

Beginner

Zecheng Tang, Quantong Qiu et al.Jan 24arXiv

Transformers slow down on very long inputs because standard attention looks at every token pair, which is expensive.

#elastic attention#sparse attention#full attention

Not triaged yet