Jet-RL: Enabling On-Policy FP8 Reinforcement Learning with Unified Training and Rollout Precision Flow
IntermediateHaocheng Xi, Charlie Ruan et al.Jan 20arXiv
Reinforcement learning (RL) for large language models is slow because the rollout (text generation) stage can take more than 70% of training time, especially for long, step-by-step answers.
#FP8 quantization#on-policy reinforcement learning#precision flow