Agent-style LLMs chat with tools over many short turns, so most tokens are repeats and the system spends more time fetching old memories (KV-Cache) than computing new answers.
VoxServe is a new serving system that makes voice AIs respond fast and smoothly when streaming audio to users.
Autoregressive (AR) models write one word at a time, which is accurate but slow, especially when your computer or GPU canβt keep many tasks in memory at once.