Proact-VL is a video-talking AI that knows not only what to say but also when to say it, like a great sports commentator.
dLLM is a single, open-source toolbox that standardizes how diffusion language models are trained, run, and tested.
The paper studies Mamba-2 (a fast, linear-time attention method) and pares it down to the pieces that truly boost accuracy.
Voxtral Realtime is a speech-to-text model that types what you say almost instantly, while keeping accuracy close to the best offline systems.
The paper fixes a common problem in video world models: scenes slowly change or “drift” when the camera moves and comes back.
The paper fixes a big problem in long video generation: models either forget what happened or slowly drift off-topic over time.
Long texts make language models slow because they must keep and re-check a huge memory called the KV cache for every new word they write.
Multi-agent LLM systems often use LoRA adapters so each agent has a special role, but they all rebuild almost the same KV cache, wasting memory and time.
Robots used to copy actions from videos without truly understanding how the world changes, so they often messed up long, multi-step jobs.
This paper fixes a big problem in long video-making AIs where the video keeps snapping back to the beginning, like a movie stuck on rewind.
HERMES is a training-free way to make video-language models understand live, streaming video quickly and accurately.
The paper proposes Diffusion in Diffusion, a draft-then-revise method that brings back global coherence to fast, block-based diffusion language models.