This paper shows how to keep training a language model while it is solving one hard, real problem, so it can discover a single, truly great answer instead of many average ones.
FutureOmni is the first benchmark that tests if multimodal AI models can predict what happens next from both sound and video, not just explain what already happened.
CoDance is a new way to animate many characters in one picture using just one pose video, even if the picture and the video do not line up perfectly.
MoCha is a new AI that swaps a person in a video with a new character using only one mask on one frame and a few reference photos.
Large Vision-Language Models (LVLMs) look great on single images but often stumble when they must reason across multiple images.
MeepleLM is a special AI that reads a board game’s rulebook and pretends to be different kinds of players to give helpful, honest feedback.
RoboVIP is a plug-and-play tool that turns ordinary robot videos into many new, realistic, multi-view training videos without changing the original robot actions.
COMPASS is a new framework that turns a company’s rules into thousands of smart test questions to check if chatbots follow those rules.
This paper teaches video-language models to first find when the proof happens in a video and then answer with that proof, instead of mixing both steps together.
The paper teaches a video model to squeeze long video history into a tiny memory while still keeping sharp details in single frames.
Transparent and shiny objects confuse normal depth cameras, but video diffusion models already learned how light bends and reflects through them.
Robots often get confused on long, multi-step tasks when they only see the final goal image and try to guess the next move directly.