NanoQuant is a new way to shrink large language models down to 1-bit and even less than 1-bit per weight without retraining on huge datasets.
Mixture-of-Experts (MoE) models often send far more tokens to a few βfavoriteβ experts, which overloads some GPUs while others sit idle.
Transformers are powerful but slow because regular self-attention compares every token with every other token, which grows too fast for long sequences.