CoVe is a way to create training conversations for AI agents that use tools, while guaranteeing the conversations are both challenging and correct.
This paper asks a simple question: does reinforcement learning (RL) truly make medical vision-language models (VLMs) smarter, or just help them pick better from answers they already know?
This paper tests whether AI can realistically guess what a specific social media user would comment when they see a new post.
This paper shows that letting an AI search many places at the same time (in parallel) can beat making it think in long, slow chains.
GUI-Libra is a training recipe that helps computer-using AI agents both think carefully and click precisely on screens.
This paper shows that you can vastly improve a model’s command-line (terminal) skills by carefully engineering the training data, not just by using a bigger model.
LongVideo-R1 is a smart video-watching agent that jumps to the right moments in long videos instead of scanning everything.
Big language models can get stuck after fine-tuning because they become too sure of themselves, so normal training stops helping.
This paper builds an AI team that can make real full‑stack websites (frontend, backend, and database) from plain English instructions.
SpatiaLab is a new test that checks if vision-language models (VLMs) can understand real-world spatial puzzles, like what’s in front, behind, bigger, or reachable.
This paper teaches AI how to fix broken Lean math proofs by learning from the compiler’s feedback, not just from finished, perfect proofs.
Re-TRAC is a new way for AI search agents to learn from each try, write a clean summary of what happened, and then use that summary to do better on the next try.