This paper introduces HACRL, a way for different kinds of AI agents to learn together during training but still work alone during use.
This paper teaches a language-model agent to explore smarter by combining two ways of learning (on-policy and off-policy) with a simple, self-written memory.
VESPO is a new, stable way to train language models with reinforcement learning even when training data comes from older or mismatched policies.
The paper studies a simple way to train giant language models with reinforcement learning by replacing a hard-to-compute term (the log-partition function) with something easy: the mean reward.
The paper shows how to speed up reinforcement learning (RL) for large language models (LLMs) by making numbers smaller (FP8) without breaking training.
LLM judges are cheap but biased; without calibration they can completely flip which model looks best.