๐ŸŽ“How I Study AIHISA
๐Ÿ“–Read
๐Ÿ“„Papers๐Ÿ“ฐBlogs๐ŸŽฌCourses
๐Ÿ’กLearn
๐Ÿ›ค๏ธPaths๐Ÿ“šTopics๐Ÿ’กConcepts๐ŸŽดShorts
๐ŸŽฏPractice
๐ŸงฉProblems๐ŸŽฏPrompts๐Ÿง Review
Search
How I Study AI - Learn AI Papers & Lectures the Easy Way

Papers2

AllBeginnerIntermediateAdvanced
All SourcesarXiv
#retrieval heads

LycheeDecode: Accelerating Long-Context LLM Inference via Hybrid-Head Sparse Decoding

Intermediate
Gang Lin, Dongfang Li et al.Feb 4arXiv

Long texts make language models slow because they must keep and re-check a huge memory called the KV cache for every new word they write.

#long-context LLM#sparse attention#head specialization

Elastic Attention: Test-time Adaptive Sparsity Ratios for Efficient Transformers

Beginner
Zecheng Tang, Quantong Qiu et al.Jan 24arXiv

Transformers slow down on very long inputs because standard attention looks at every token pair, which is expensive.

#elastic attention#sparse attention#full attention