Multi-step RAG systems often struggle with long documents because their memory is just a pile of isolated facts, not a connected understanding.
The paper teaches a video model to squeeze long video history into a tiny memory while still keeping sharp details in single frames.
This paper makes diffusion-based video super-resolution (VSR) practical for live, low-latency use by removing the need for future frames and cutting denoising from ~50 steps down to just 4.
The paper teaches AI to write strong research plans by letting it grade its own work using checklists (rubrics) pulled from real scientific papers.
Transparent and shiny objects confuse normal depth cameras, but video diffusion models already learned how light bends and reflects through them.
This paper introduces Web World Models (WWMs), a way to build huge, explorable worlds by putting strict rules in code and letting AI write the fun details.
This paper shows how a language model can keep learning while you use it, so it handles very long inputs without slowing down.
This paper introduces OmniAgent, a smart video-and-audio detective that actively decides when to listen and when to look.
ProGuard is a safety guard for text and images that doesn’t just spot known problems—it can also recognize and name new, never-seen-before risks.
Robots often get confused on long, multi-step tasks when they only see the final goal image and try to guess the next move directly.
Mixture-of-Experts (MoE) models use many small specialist networks (experts) and a router to pick which experts handle each token, but the router isn’t explicitly taught what each expert is good at.
MindWatcher is a smart AI agent that can think step by step and decide when to use tools like web search, image zooming, and a code calculator to solve tough, multi-step problems.