Robust-R1 teaches vision-language models to notice how a picture is damaged, think through what that damage hides, and then answer as if the picture were clear.
Reasoning Palette gives a language or vision-language model a tiny hidden “mood” (a latent code) before it starts answering, so it chooses a smarter plan rather than just rolling dice on each next word.
AuditDM is a friendly 'auditor' model that hunts for where vision-language models get things wrong and then creates the right practice to fix them.
AdaTooler-V teaches an image-and-video AI to first ask, “Do I really need a tool?” before using one, which saves time and boosts accuracy.
RePlan is a plan-then-execute system that first figures out exactly where to edit in a picture and then makes clean changes there.
Skyra is a detective-style AI that spots tiny visual mistakes (artifacts) in videos to tell if they are real or AI-generated, and it explains its decision with times and places in the video.
This paper teaches vision-language models to reason about pictures using puzzles instead of expensive human labels.
CRISP turns a normal phone video of a person into a clean 3D world and a virtual human that can move in it without breaking physics.
Robots usually learn by copying many demonstrations, which is expensive and makes them brittle when things change.
SAGE is a smart video-watching agent that decides when to answer quickly and when to take multiple steps, just like how people skim or rewind long videos.
Diffusion Preview is a two-step “preview-then-refine” workflow that shows you a fast draft image first and only spends full compute after you like the draft.
ShowTable is a new way for AI to turn a data table into a beautiful, accurate infographic using a think–make–check–fix loop.