EmbodMocap is a low-cost, portable way to capture people moving inside real places using just two iPhones, so computers and robots can learn from real life instead of studios.
This paper shows that letting an AI search many places at the same time (in parallel) can beat making it think in long, slow chains.
GUI-Libra is a training recipe that helps computer-using AI agents both think carefully and click precisely on screens.
LongVideo-R1 is a smart video-watching agent that jumps to the right moments in long videos instead of scanning everything.
PyVision-RL teaches vision-language models to act like curious agents that think in multiple steps and use Python tools to inspect images and videos.
The paper builds a Computer-Using World Model (CUWM) that lets an AI “imagine” what a desktop app (like Word/Excel/PowerPoint) will look like after a click or keystroke—before doing it for real.
This paper teaches AI agents to make smart choices about when to explore for more information and when to act right away.
Big idea: Make image-making AIs stop, think, check, and fix their own work so they get better at both creating pictures and understanding them.
This paper builds GUI-Owl-1.5, an AI that can use phones, computers, and web browsers like a careful human helper.
Sci-CoE is a two-stage training method that helps one language model learn to both solve science problems and check those solutions with very little labeled data.
This paper introduces P-GenRM, a personalized generative reward model that judges AI answers using a custom scorecard built just for each user and situation.
DataChef teaches a large language model to be a smart data chef: it plans and codes full data pipelines that turn messy datasets into great training meals for other models.