Online Causal Kalman Filtering for Stable and Effective Policy Optimization
IntermediateShuo He, Lang Feng et al.Feb 11arXiv
Training big language models with reinforcement learning can wobble because the per-token importance-sampling (IS) ratios swing wildly.
#Kalman filter#importance sampling ratio#policy optimization