Papers14

#FlashAttention

HySparse: A Hybrid Sparse Attention Architecture with Oracle Token Selection and KV Cache Sharing

HySparse is a new way for AI models to pay attention that mixes a few full attention layers with many fast, memory‑saving sparse layers.

#Hybrid Sparse Attention#Oracle Token Selection#KV Cache Sharing

Token Sparse Attention: Efficient Long-Context Inference with Interleaved Token Selection

Intermediate

Dongwon Jo, Beomseok Kang et al.Feb 3arXiv

This paper speeds up how AI models read very long texts by carefully choosing which words (tokens) to focus on at each step.

#Token Sparse Attention#Dynamic Token Coverage#Representation Drift

Fast Autoregressive Video Diffusion and World Models with Temporal Cache Compression and Sparse Attention

Intermediate

Dvir Samuel, Issar Tzachor et al.Feb 2arXiv

The paper makes long video generation much faster and lighter on memory by cutting out repeated work in attention.

#autoregressive video diffusion#KV cache compression#sparse attention

LRAgent: Efficient KV Cache Sharing for Multi-LoRA LLM Agents

Intermediate

Hyesung Jeon, Hyeongju Ha et al.Feb 1arXiv

Multi-agent LLM systems often use LoRA adapters so each agent has a special role, but they all rebuild almost the same KV cache, wasting memory and time.

#LoRA#Multi-LoRA#KV cache

Towards Automated Kernel Generation in the Era of LLMs

Intermediate

Yang Yu, Peiyu Zang et al.Jan 22arXiv

AI programs called LLMs can now help write the tiny, super-fast pieces of code (kernels) that make GPUs run AI models efficiently.

#LLM-driven kernel generation#GPU kernels#CUDA

HeartMuLa: A Family of Open Sourced Music Foundation Models

Intermediate

Dongchao Yang, Yuxin Xie et al.Jan 15arXiv

HeartMuLa is a family of open-source music AI models that can understand and generate full songs with clear lyrics and strong musical structure.

#music generation#audio tokenizer#residual vector quantization

MHLA: Restoring Expressivity of Linear Attention via Token-Level Multi-Head

Intermediate

Kewei Zhang, Ye Huang et al.Jan 12arXiv

Transformers are powerful but slow because regular self-attention compares every token with every other token, which grows too fast for long sequences.

#Multi-Head Linear Attention#Linear Attention#Self-Attention

FocusUI: Efficient UI Grounding via Position-Preserving Visual Token Selection

Intermediate

Mingyu Ouyang, Kevin Qinghong Lin et al.Jan 7arXiv

FOCUSUI makes computer-using AI faster and still accurate by looking only at the important parts of a screen.

#UI grounding#vision-language models#visual token pruning

InfiniteVGGT: Visual Geometry Grounded Transformer for Endless Streams

Intermediate

Shuai Yuan, Yantai Yang et al.Jan 5arXiv

InfiniteVGGT is a streaming 3D vision system that can keep working forever on live video without running out of memory.

#InfiniteVGGT#rolling memory#causal attention

CASA: Cross-Attention via Self-Attention for Efficient Vision-Language Fusion

Intermediate

Moritz Böhle, Amélie Royer et al.Dec 22arXiv

CASA is a new way to mix images and text inside a language model that keeps speed and memory low while keeping accuracy high.

#CASA#cross-attention#self-attention

Kling-Omni Technical Report

Intermediate

Kling Team, Jialu Chen et al.Dec 18arXiv

Kling-Omni is a single, unified model that can understand text, images, and videos together and then make or edit high-quality videos from those mixed instructions.

#multimodal visual language#MVL#prompt enhancer

Improving Recursive Transformers with Mixture of LoRAs

Intermediate

Mohammadmahdi Nouriborji, Morteza Rohanian et al.Dec 14arXiv

Recursive transformers save memory by reusing the same layer over and over, but that makes them less expressive and hurts accuracy.

#Mixture of LoRAs#recursive transformers#parameter sharing

1 2