SLATE is a new way to teach AI to think step by step while using a search engine, giving feedback at each step instead of only at the end.
This paper teaches AI models to learn like good students: try, think about what went wrong, fix it, and remember the fix.
Training big language models with reinforcement learning can wobble because the per-token importance-sampling (IS) ratios swing wildly.
The paper finds a hidden symmetry inside GRPO’s advantage calculation that accidentally stops models from exploring new good answers and from paying the right attention to easy versus hard problems at the right times.
The paper discovers that popular RLVR methods for training language and vision-language models secretly prefer certain answer lengths, which can hurt learning.
The paper builds a simple, math-light rule to predict whether training makes a language model more open-minded (higher entropy) or more sure of itself (lower entropy).
When rewards are rare, a popular training method for language models (GRPO) often stops learning because every try in a group gets the same score, so there is nothing to compare.
LatentMem is a new memory system that helps teams of AI agents remember the right things for their specific jobs without overloading them with text.
RLAnything is a new reinforcement learning (RL) framework that trains three things together at once: the policy (the agent), the reward model (the judge), and the environment (the tasks).
The paper trains language models to solve hard problems by first breaking them into smaller parts and then solving those parts, instead of only thinking in one long chain.
Most reinforcement learning agents only get a simple pass/fail reward, which hides how good or bad their attempts really were.
Cities are full of places defined by people, like schools and parks, which are hard to see clearly from space without extra clues.