Papers3

#KV-Cache

DualPath: Breaking the Storage Bandwidth Bottleneck in Agentic LLM Inference

Yongtong Wu, Shaoyuan Chen et al.Feb 25arXiv

Agent-style LLMs chat with tools over many short turns, so most tokens are repeats and the system spends more time fetching old memories (KV-Cache) than computing new answers.

#KV-Cache#prefill-decode disaggregation#dual-path loading

Not triaged yet

MiniCPM-SALA: Hybridizing Sparse and Linear Attention for Efficient Long-Context Modeling

Intermediate

MiniCPM Team, Wenhao An et al.Feb 12arXiv

MiniCPM-SALA is a 9B-parameter language model that mixes two kinds of attention—sparse and linear—to read very long texts quickly and accurately.

#long-context modeling#sparse attention#linear attention

Not triaged yet

Sparse-LaViDa: Sparse Multimodal Discrete Diffusion Language Models

Beginner

Shufan Li, Jiuxiang Gu et al.Dec 16arXiv

Sparse-LaViDa makes diffusion-style AI models much faster by skipping unhelpful masked tokens during generation while keeping quality the same.

#Masked Discrete Diffusion#Sparse Parameterization#Register Tokens

Not triaged yet