LycheeDecode: Accelerating Long-Context LLM Inference via Hybrid-Head Sparse Decoding
IntermediateGang Lin, Dongfang Li et al.Feb 4arXiv
Long texts make language models slow because they must keep and re-check a huge memory called the KV cache for every new word they write.
#long-context LLM#sparse attention#head specialization