The paper shows how to speed up reinforcement learning (RL) for large language models (LLMs) by making numbers smaller (FP8) without breaking training.
Reinforcement learning (RL) for large language models is slow because the rollout (text generation) stage can take more than 70% of training time, especially for long, step-by-step answers.