Large language model (LLM) post-training has uneven work per GPU because some text sequences are much longer than others.
Group-based reinforcement learning for reasoning (like GRPO) uses the group's average reward as a baseline, but that makes its 'advantage' estimates biased.