Big AI reasoning models often keep thinking long after they already found the right answer, wasting time and tokens.
This paper teaches an AI to invent its own 'break-and-fix' strategies (called LNS operators) for tough puzzles like delivery routes and city tours.
Visual spatial reasoning often fails when a model only looks at one picture and must imagine new viewpoints.
Big language models can get stuck after fine-tuning because they become too sure of themselves, so normal training stops helping.
Agents in vast, open-ended games often learn a little and then get stuck because the next good practice steps are missing.
VidVec shows that video-capable multimodal language models already hide strong matching signals between videos and sentences inside their middle layers.
LLMs can think for many steps, but when they keep every step forever, the extra tokens turn into noise and make answers worse, not better.
MIND is a new benchmark that fairly tests two core skills of world models: remembering the world over time (memory consistency) and following controls exactly (action control).
LOCA-bench is a test that challenges AI agents to work correctly as their to-do list and background information grow very, very long.
Bielik Guard is a pair of small but strong Polish language safety models that check text for five kinds of risky content: hate/aggression, vulgar language, sexual content, crime, and self-harm.
This paper asks a simple question: do tests written by AI coding agents actually help them fix real software bugs, or do they just look helpful?
The paper fixes a common problem in video world models: scenes slowly change or “drift” when the camera moves and comes back.