Multi-step RAG systems often struggle with long documents because their memory is just a pile of isolated facts, not a connected understanding.
The paper teaches a video model to squeeze long video history into a tiny memory while still keeping sharp details in single frames.
This paper makes diffusion-based video super-resolution (VSR) practical for live, low-latency use by removing the need for future frames and cutting denoising from ~50 steps down to just 4.
The paper teaches AI to write strong research plans by letting it grade its own work using checklists (rubrics) pulled from real scientific papers.
Transparent and shiny objects confuse normal depth cameras, but video diffusion models already learned how light bends and reflects through them.
This paper introduces Web World Models (WWMs), a way to build huge, explorable worlds by putting strict rules in code and letting AI write the fun details.
This paper shows how a language model can keep learning while you use it, so it handles very long inputs without slowing down.
This paper teaches AI helpers to browse the web more like people do, not just by grabbing static snippets.
This paper introduces OmniAgent, a smart video-and-audio detective that actively decides when to listen and when to look.
LiveTalk turns slow, many-step video diffusion into a fast, 4-step, real-time system for talking avatars that listen, think, and respond with synchronized video.
ProGuard is a safety guard for text and images that doesn’t just spot known problems—it can also recognize and name new, never-seen-before risks.
Robots often get confused on long, multi-step tasks when they only see the final goal image and try to guess the next move directly.