Papers5

#Sliding Window Attention

Arcee Trinity Large Technical Report

Varun Singh, Lucas Krauss et al.Feb 19arXiv

Trinity is a family of open language models that are huge on the inside but only wake up a few 'experts' for each word, so they are fast and affordable to run.

#Mixture-of-Experts#SMEBU#Gated Attention

HySparse: A Hybrid Sparse Attention Architecture with Oracle Token Selection and KV Cache Sharing

Intermediate

Yizhao Gao, Jianyu Wei et al.Feb 3arXiv

HySparse is a new way for AI models to pay attention that mixes a few full attention layers with many fast, memory‑saving sparse layers.

#Hybrid Sparse Attention#Oracle Token Selection#KV Cache Sharing

MiMo-V2-Flash Technical Report

Intermediate

Xiaomi LLM-Core Team, : et al.Jan 6arXiv

MiMo-V2-Flash is a giant but efficient language model that uses a team-of-experts design to think well while staying fast.

#Mixture-of-Experts#Sliding Window Attention#Global Attention

K-EXAONE Technical Report

Intermediate

Eunbi Choi, Kibong Choi et al.Jan 5arXiv

K-EXAONE is a super-sized language model that speaks six languages and can read very long documents (up to 256,000 tokens) without forgetting important details.

#Mixture-of-Experts#Hybrid Attention#Sliding Window Attention

SWAA: Sliding Window Attention Adaptation for Efficient Long-Context LLMs Without Pretraining

Intermediate

Yijiong Yu, Jiale Liu et al.Dec 11arXiv

Long texts make standard attention in large language models very slow because it checks every word against every other word.

#Sliding Window Attention#SWAA#FA Decode