The paper shows that when a model compares two of its own answers head-to-head, it picks the right one more often than when it judges each answer alone.
Multi-agent systems are like teams of smart helpers, but one bad message can mislead the whole team.
TAROT teaches code-writing AI the way good teachers teach kids: start at the right level and raise the bar at the right time.
GameDevBench is a new test that checks if AI agents can actually make parts of video games, not just write code in one file.
Big language models can get stuck after fine-tuning because they become too sure of themselves, so normal training stops helping.
AIRS-Bench is a new test suite that checks whether AI research agents can do real machine learning research from start to finish, not just answer questions.
Most language models are trained on compressed tokens, which makes training fast but ties the model to a specific tokenizer.
LatentMem is a new memory system that helps teams of AI agents remember the right things for their specific jobs without overloading them with text.
Diffusion language models (dLLMs) can write text in any order, but common decoding methods still prefer left-to-right, which wastes their superpower.
BatCoder teaches a code model to write both code and its documentation by doing a round trip: from code to docs and back to code.
Stable-DiffCoder is a code-focused diffusion language model that learns to write and edit programs by filling in masked pieces, not just predicting the next token.