Longer explanations are not always better; the shape of thinking matters.
RubricBench is a new benchmark that checks whether AI judges can use clear, checklist-style rules (rubrics) the way humans do.
ReGFT is a simple pre-RL step that shows the model partial human hints, then makes it solve problems in its own words, creating correct, model-style solutions for hard questions.
This paper builds a new test called Ref-Adv to check if AI can truly match tricky sentences to the right thing in a picture.
The paper shows that when training reasoning AIs with reinforcement learning, treating every wrong answer the same makes the AI overconfident in some bad paths and less diverse overall.
The paper teaches language models to explore more ideas while thinking, so they can solve harder problems.
The paper shows that, when teaching a reasoning AI with step-by-step examples, repeating a small set many times can beat using a huge set only once.
LatentChem lets AI do chemistry thinking quietly inside continuous vectors instead of writing long step-by-step sentences.
SwimBird is a multimodal AI that can switch how it thinks: only in text, only in vision (with hidden picture-like thoughts), or a mix of both.
BABE is a new benchmark that tests if AI can read real biology papers and reason from experiments like a scientist, not just recall facts.
This paper teaches AI to pay attention better by training its focus, not just its words.
The paper shows how to train a language model with special extra hints (privileged information) during practice so it can still do well later without any hints.