Memory-T1 teaches chatty AI agents to keep track of when things happened across many conversations.
Autoregressive (AR) image models make pictures by choosing tokens one-by-one, but they were judged only on picking likely tokens, not on how good the final picture looks in pixels.
This paper builds a tough new test called O3-BENCH to check if AI can truly think with images, not just spot objects.
Robust-R1 teaches vision-language models to notice how a picture is damaged, think through what that damage hides, and then answer as if the picture were clear.
Reasoning Palette gives a language or vision-language model a tiny hidden “mood” (a latent code) before it starts answering, so it chooses a smarter plan rather than just rolling dice on each next word.
AuditDM is a friendly 'auditor' model that hunts for where vision-language models get things wrong and then creates the right practice to fix them.
AdaTooler-V teaches an image-and-video AI to first ask, “Do I really need a tool?” before using one, which saves time and boosts accuracy.
RePlan is a plan-then-execute system that first figures out exactly where to edit in a picture and then makes clean changes there.
Skyra is a detective-style AI that spots tiny visual mistakes (artifacts) in videos to tell if they are real or AI-generated, and it explains its decision with times and places in the video.
This paper teaches vision-language models to reason about pictures using puzzles instead of expensive human labels.
CRISP turns a normal phone video of a person into a clean 3D world and a virtual human that can move in it without breaking physics.
Robots usually learn by copying many demonstrations, which is expensive and makes them brittle when things change.