The paper shows that the popular PPO method for training language models is unfair to rare words and too gentle with very common words, which makes learning slow and unstable.
The paper shows how to train a language model with special extra hints (privileged information) during practice so it can still do well later without any hints.
Large language models learn better when we spend more practice time on the right questions at the right moments.
The paper shows that a model that looks great after supervised fine-tuning (SFT) can actually do worse after the same reinforcement learning (RL) than a model that looked weaker at SFT time.
Big language models are great at words but waste lots of time and energy when they try random actions in non-language games like Sudoku, Sokoban, 2048, FrozenLake, and Rubik’s Cube.
OCRVerse is a new AI model that can read both plain text in documents and the visual structures in charts, webpages, and science plots, all in one system.
This paper shows that giving an AI a safe, tiny virtual computer (a sandbox) lets it solve many kinds of problems better, not just coding ones.
This paper explains how to turn large language models (LLMs) from quiet students that only answer questions into active agents that can plan, act, and learn over time.
Big language models can learn new facts with simple tutoring (SFT), but that doesn’t automatically teach them how to use those facts well.
MatchTIR teaches AI agents to judge each tool call step-by-step instead of giving the same reward to every step.
Re-Align is a new way for AI to make and edit pictures by thinking in clear steps before drawing.
SmartSearch teaches search agents to fix their own bad search queries while they are thinking, not just their final answers.