The paper shows that when a model compares two of its own answers head-to-head, it picks the right one more often than when it judges each answer alone.
OmniGAIA is a new test that checks if AI can watch videos, look at images, listen to audio, and use web and code tools in several steps to find a verified answer.
ThinkRouter teaches a model to switch how it βthinksβ based on how sure it feels, so it stays accurate without talking forever.
The paper introduces UCoder, a way to teach a code-generating AI to get better without using any outside datasets, not even unlabeled code.
JustRL shows that a tiny, steady recipe for reinforcement learning (RL) can make a 1.5B-parameter language model much better at math without fancy tricks.
Different programming languages scale differently when training code AI models, so treating them all the same wastes compute and lowers performance.