The paper fixes a big flaw in test-time reinforcement learning (TTRL): when many wrong answers agree, the model rewards the mistake and gets stuck.
TAROT teaches code-writing AI the way good teachers teach kids: start at the right level and raise the bar at the right time.
Gaia2 is a new test that measures how well AI agents handle real-life messiness like changing events, deadlines, and team coordination.
GameDevBench is a new test that checks if AI agents can actually make parts of video games, not just write code in one file.
Big AI reasoning models often keep thinking long after they already found the right answer, wasting time and tokens.
This paper teaches AI how to fix broken Lean math proofs by learning from the compiler’s feedback, not just from finished, perfect proofs.
Nemotron-Math is a giant math dataset with 7.5 million step-by-step solutions created in three thinking styles and with or without Python help.
ReFusion is a new way for AI to write text faster by planning in chunks (called slots) and then filling each chunk carefully.