Papers3

#throughput

DualPath: Breaking the Storage Bandwidth Bottleneck in Agentic LLM Inference

Yongtong Wu, Shaoyuan Chen et al.Feb 25arXiv

Agent-style LLMs chat with tools over many short turns, so most tokens are repeats and the system spends more time fetching old memories (KV-Cache) than computing new answers.

#KV-Cache#prefill-decode disaggregation#dual-path loading

Not triaged yet

VoxServe: Streaming-Centric Serving System for Speech Language Models

Intermediate

Keisuke Kamahori, Wei-Tzu Lee et al.Jan 30arXiv

VoxServe is a new serving system that makes voice AIs respond fast and smoothly when streaming audio to users.

#Speech Language Models#streaming#Time-To-First-Audio

Not triaged yet

Efficient-DLM: From Autoregressive to Diffusion Language Models, and Beyond in Speed

Intermediate

Yonggan Fu, Lexington Whalen et al.Dec 16arXiv

Autoregressive (AR) models write one word at a time, which is accurate but slow, especially when your computer or GPU can’t keep many tasks in memory at once.

#diffusion language models#autoregressive models#AR-to-dLM conversion

Not triaged yet