GR-Dexter is a full package—new robot hands, a smart AI brain, and lots of carefully mixed data—that lets a two-handed robot follow language instructions to do long, tricky tasks.
GARDO is a new way to fine-tune text-to-image diffusion models with reinforcement learning without getting tricked by bad reward signals.
This paper teaches video-language models to first find when the proof happens in a video and then answer with that proof, instead of mixing both steps together.
Multi-step RAG systems often struggle with long documents because their memory is just a pile of isolated facts, not a connected understanding.
The paper teaches a video model to squeeze long video history into a tiny memory while still keeping sharp details in single frames.
This paper makes diffusion-based video super-resolution (VSR) practical for live, low-latency use by removing the need for future frames and cutting denoising from ~50 steps down to just 4.
The paper teaches AI to write strong research plans by letting it grade its own work using checklists (rubrics) pulled from real scientific papers.
Transparent and shiny objects confuse normal depth cameras, but video diffusion models already learned how light bends and reflects through them.
This paper introduces Web World Models (WWMs), a way to build huge, explorable worlds by putting strict rules in code and letting AI write the fun details.
This paper shows how a language model can keep learning while you use it, so it handles very long inputs without slowing down.
This paper introduces OmniAgent, a smart video-and-audio detective that actively decides when to listen and when to look.
ProGuard is a safety guard for text and images that doesn’t just spot known problems—it can also recognize and name new, never-seen-before risks.