CoVe is a way to create training conversations for AI agents that use tools, while guaranteeing the conversations are both challenging and correct.
This paper asks a simple question: does reinforcement learning (RL) truly make medical vision-language models (VLMs) smarter, or just help them pick better from answers they already know?
CHIMERA is a small (about 9,000 examples) but very carefully built synthetic dataset that teaches AI to solve hard problems step by step.
This paper tests whether AI can realistically guess what a specific social media user would comment when they see a new post.
This paper shows that letting an AI search many places at the same time (in parallel) can beat making it think in long, slow chains.
GUI-Libra is a training recipe that helps computer-using AI agents both think carefully and click precisely on screens.
This paper shows that you can vastly improve a model’s command-line (terminal) skills by carefully engineering the training data, not just by using a bigger model.
LongVideo-R1 is a smart video-watching agent that jumps to the right moments in long videos instead of scanning everything.
Big language models can get stuck after fine-tuning because they become too sure of themselves, so normal training stops helping.
This paper builds an AI team that can make real full‑stack websites (frontend, backend, and database) from plain English instructions.
SpatiaLab is a new test that checks if vision-language models (VLMs) can understand real-world spatial puzzles, like what’s in front, behind, bigger, or reachable.
AOrchestra is like a smart conductor that builds the right mini-helpers (sub-agents) on demand to solve big, multi-step tasks.