Long Video Understanding (LVU) is hard because the important clues are tiny, far apart in time, and buried in hours of mostly unimportant footage.
Reinforcement learning (RL) can make big language models smarter, but off-policy training often pushes updates too far from the “safe zone,” causing unstable learning.
ProPhy is a new two-step method that helps video AIs follow real-world physics, not just make pretty pictures.
BEAVER is a new way to check, with guaranteed certainty, how likely a language model is to give answers that obey important rules.
This paper argues that the fastest and safest path to super-smart AI is for humans and AIs to improve together, not for AI to improve alone.
SpaceControl lets you steer a powerful 3D generator with simple shapes you draw, without retraining the model.
This paper builds TAD, a brand-new test that checks if AI can understand what happens over time in real driving videos.
This paper teaches a computer to turn one single picture into a moving 3D scene that stays consistent from every camera angle.
ARBITRAGE makes AI solve step-by-step problems faster by only using the big, slow model when it is predicted to truly help.
EMMA is a single AI model that can understand images, write about them, create new images from text, and edit images—all in one unified system.
Large language models forget or misuse new facts if you only poke their weights once; EtCon fixes this with a two-step plan.
COOPER is a single AI model that both “looks better” (perceives depth and object boundaries) and “thinks smarter” (reasons step by step) to answer spatial questions about images.