SenseNova-MARS is a vision-language model that can think step-by-step and use three tools—text search, image search, and image cropping—during its reasoning.
FIGR is a new way for AI to ‘think by drawing,’ using code to build clean, editable diagrams while it reasons.
Multimodal Large Language Models (MLLMs) often hallucinate on videos by trusting words and common sense more than what the frames really show.
GR-Dexter is a full package—new robot hands, a smart AI brain, and lots of carefully mixed data—that lets a two-handed robot follow language instructions to do long, tricky tasks.
GARDO is a new way to fine-tune text-to-image diffusion models with reinforcement learning without getting tricked by bad reward signals.
This paper teaches video-language models to first find when the proof happens in a video and then answer with that proof, instead of mixing both steps together.
Multi-step RAG systems often struggle with long documents because their memory is just a pile of isolated facts, not a connected understanding.
The paper teaches a video model to squeeze long video history into a tiny memory while still keeping sharp details in single frames.
This paper makes diffusion-based video super-resolution (VSR) practical for live, low-latency use by removing the need for future frames and cutting denoising from ~50 steps down to just 4.
The paper teaches AI to write strong research plans by letting it grade its own work using checklists (rubrics) pulled from real scientific papers.
Transparent and shiny objects confuse normal depth cameras, but video diffusion models already learned how light bends and reflects through them.
This paper introduces Web World Models (WWMs), a way to build huge, explorable worlds by putting strict rules in code and letting AI write the fun details.