Training big language models with reinforcement learning can wobble because the per-token importance-sampling (IS) ratios swing wildly.
Millions of public AI models exist, but downloads are concentrated on a tiny set of βofficialβ checkpoints, which are not always the best performers.