The paper shows that when a model compares two of its own answers head-to-head, it picks the right one more often than when it judges each answer alone.
PRISM is a new way to help AI think through hard problems by checking each step, not just the final answer.
Modern music AIs can follow text, lyrics, and even example audio, but judges that score these songs have not kept up.
AgentVista is a new test (benchmark) that checks whether AI agents can solve tough, real-life picture-based problems by using multiple tools over many steps.
Diffusion Large Language Models (dLLMs) can write many parts of an answer at once, not just left to right like usual chatbots.
This paper introduces P-GenRM, a personalized generative reward model that judges AI answers using a custom scorecard built just for each user and situation.
Most image search systems judge each photo by itself, which fails when clues are split across many photos taken over time.
This paper fixes a big problem in long video generation: tiny mistakes that snowball over time and make the video drift and flicker.
This paper teaches AI to pay attention better by training its focus, not just its words.
Parallel-Probe is a simple add-on that lets many AI “thought paths” think at once but stop early when they already agree.
SWE-World lets code-fixing AI agents practice and learn without heavy Docker containers by using smart models that pretend to be the computer and tests.
Re-TRAC is a new way for AI search agents to learn from each try, write a clean summary of what happened, and then use that summary to do better on the next try.