This paper teaches AI to name things in pictures very specifically (like “golden retriever” instead of just “dog”) without making more mistakes.
This paper introduces HACRL, a way for different kinds of AI agents to learn together during training but still work alone during use.
Tool-R0 teaches a language model to use software tools (like APIs) with zero human-made training data.
DeepVision-103K is a new 103,000-example picture-and-text math dataset designed to help AI think better using rewards that can be checked automatically.
This paper teaches a computer to find buttons, text, and icons on screens so it can click and type in the right places, a skill called GUI grounding.
TRIT is a new training method that teaches AI to translate and think at the same time so it can solve hard problems in many languages without extra helper models.
The paper discovers that popular RLVR methods for training language and vision-language models secretly prefer certain answer lengths, which can hurt learning.
This paper introduces Foundation-Sec-8B-Reasoning, a small (8 billion parameter) AI model that is trained to “think out loud” before answering cybersecurity questions.
When training smart language models with RL that use right-or-wrong rewards, learning can stall on 'saturated' problems that the model almost always solves.
Diffusion language models can write tokens in any order, but that freedom can accidentally hurt their ability to reason well.
STEP3-VL-10B is a small (10 billion parameters) open multimodal model that sees images and reads text, yet scores like much larger models.
JudgeRLVR teaches a model to be a strict judge of answers before it learns to generate them, which trims bad ideas early.