Training big language models with reinforcement learning can wobble because the per-token importance-sampling (IS) ratios swing wildly.
This paper fixes two big problems in image-making AI that builds pictures step by step: it often practices with perfect answers (teacher forcing) but must perform using its own imperfect guesses later, and the earliest coarse steps are much harder than the later fine steps.