This paper introduces HACRL, a way for different kinds of AI agents to learn together during training but still work alone during use.
Reinforcement learning (RL) trains language models by letting them try answers and learn from rewards, but training is slow if we pick the wrong practice questions.
ReGFT is a simple pre-RL step that shows the model partial human hints, then makes it solve problems in its own words, creating correct, model-style solutions for hard questions.
LLMs trained with simple rewards often latch onto just a few ways of solving problems and stop exploring, which hurts their ability to find other correct answers.
DeepVision-103K is a new 103,000-example picture-and-text math dataset designed to help AI think better using rewards that can be checked automatically.
Gaia2 is a new test that measures how well AI agents handle real-life messiness like changing events, deadlines, and team coordination.
PhyCritic is a judge model that checks other AI models’ answers about the physical world, like cooking steps, robot actions, or driving choices.
Step 3.5 Flash is a huge but efficient AI that keeps 196 billion total parameters but only wakes up about 11 billion per token, so it thinks smart and fast.
The paper teaches large language models to do what good students do: find where they went wrong, turn that lesson into a rule, and remember it for next time.
This paper teaches language models not just to get the final answer right but to think in a way others can reliably follow.
Big AI reasoning models often keep thinking long after they already found the right answer, wasting time and tokens.
TRIT is a new training method that teaches AI to translate and think at the same time so it can solve hard problems in many languages without extra helper models.