The paper fixes a big problem in long video generation: models either forget what happened or slowly drift off-topic over time.
Long texts make language models slow because they must keep and re-check a huge memory called the KV cache for every new word they write.
Multi-agent LLM systems often use LoRA adapters so each agent has a special role, but they all rebuild almost the same KV cache, wasting memory and time.
Robots used to copy actions from videos without truly understanding how the world changes, so they often messed up long, multi-step jobs.
This paper fixes a big problem in long video-making AIs where the video keeps snapping back to the beginning, like a movie stuck on rewind.
HERMES is a training-free way to make video-language models understand live, streaming video quickly and accurately.
The paper proposes Diffusion in Diffusion, a draft-then-revise method that brings back global coherence to fast, block-based diffusion language models.
This paper introduces PCED, a way to use many documents as separate 'experts' in parallel so an AI can stitch answers together without stuffing everything into one giant prompt.
This paper shows how to get strong text embeddings from decoder-only language models without any training.
SpotEdit is a training‑free way to edit only the parts of an image that actually change, instead of re-generating the whole picture.
HiStream makes 1080p video generation much faster by removing repeated work across space, time, and steps.
Kling-Omni is a single, unified model that can understand text, images, and videos together and then make or edit high-quality videos from those mixed instructions.