RoboVIP is a plug-and-play tool that turns ordinary robot videos into many new, realistic, multi-view training videos without changing the original robot actions.
COMPASS is a new framework that turns a company’s rules into thousands of smart test questions to check if chatbots follow those rules.
This paper teaches video-language models to first find when the proof happens in a video and then answer with that proof, instead of mixing both steps together.
The paper teaches a video model to squeeze long video history into a tiny memory while still keeping sharp details in single frames.
Transparent and shiny objects confuse normal depth cameras, but video diffusion models already learned how light bends and reflects through them.
Robots often get confused on long, multi-step tasks when they only see the final goal image and try to guess the next move directly.
JavisGPT is a single AI that can both understand sounding videos (audio + video together) and also create new ones that stay in sync.
Large Multimodal Models (LMMs) are great at reading text and looking at pictures, but they usually do most of their thinking in words, which limits deep visual reasoning.
StoryMem is a new way to make minute‑long, multi‑shot videos that keep the same characters, places, and style across many clips.
ReCo is a new way to edit videos just by telling the computer what to change with words, no extra masks needed.
InsertAnywhere is a two-stage system that lets you add a new object into any video so it looks like it was always there.
AniX is a system that lets you place any character into any 3D world and control them with plain language, like “run forward” or “play a guitar.”