MAI-UI is a family of AI agents that can see, understand, and control phone and computer screens using plain language.
SmartSnap teaches an agent not only to finish a phone task but also to prove it with a few perfect snapshots it picks itself.
Coding agents used to fix software rely on feedback; unit tests give only pass/fail signals that are often noisy or missing.
TimeBill is a way to help big AI models finish their answers on time without ruining answer quality.
The paper shows that when vision-language models write captions, only a small set of uncertain words (about 20%) act like forks that steer the whole sentence.
This paper introduces Knot Forcing, a way to make talking-head videos that look great while being generated live, frame by frame.
The paper shows that many AI systems work best when a small 'compressor' model first shrinks long text into a short, info-packed summary and a bigger 'predictor' model then reasons over that summary.
This paper teaches AI to notice not just what is in a picture, but how the picture looks and feels to people.
SVBench is the first benchmark that checks whether video generation models can show realistic social behavior, not just pretty pictures.
HiStream makes 1080p video generation much faster by removing repeated work across space, time, and steps.
The paper builds YearGuessr, a giant, worldwide photo-and-text dataset of 55,546 buildings with their construction years (1001–2024), GPS, and popularity (page views).
Streamo is a real-time video assistant that knows when to stay quiet, when to wait, and when to speak—while a video is still playing.