This paper studies how AI agents that use tools talk about how sure they are and finds a split: some tools make them too sure, others help them be honest.
MeepleLM is a special AI that reads a board game’s rulebook and pretends to be different kinds of players to give helpful, honest feedback.
The paper shows that when we give AI lots of extra text, even harmless extra text, it can get badly confused—sometimes losing up to 80% of its accuracy.
Dr. Zero is a pair of AI agents (a Proposer and a Solver) that teach each other to do web-search-based reasoning without any human-written training data.
Solar Open is a giant bilingual AI (102 billion parameters) that focuses on helping underserved languages like Korean catch up with English-level AI quality.
RealMem is a new benchmark that tests how well AI assistants remember and manage long, ongoing projects across many conversations.
X-Coder shows that models can learn expert-level competitive programming using data that is 100% synthetic—no real contest problems needed.
VideoDR is a new benchmark that tests if AI can watch a video, pull out key visual clues, search the open web, and chain the clues together to find one verifiable answer.
ET-Agent is a training framework that teaches AI agents to use tools (like search and code) more wisely, not just to get the right answer.
This paper introduces Laser, a new way for vision-language models to think in their hidden space before speaking, so they see the whole “forest” before picking out the “trees.”
MemGovern teaches code agents to learn from past human fixes on GitHub by turning messy discussions into clean, reusable 'experience cards.'
This paper teaches AI models not just how to solve problems but also how to tell when their own answers might be wrong.