Length-Unbiased Sequence Policy Optimization: Revealing and Controlling Response Length Variation in RLVR
Key Summary
- â˘The paper discovers that popular RLVR methods for training language and vision-language models secretly prefer certain answer lengths, which can hurt learning.
- â˘GRPO and GSPO, two common algorithms, accidentally push models toward shorter or longer replies because of how they average or clip the training signal.
- â˘This hidden 'length bias' can even cause response length to collapse, especially with GSPO in multimodal models, limiting reasoning and exploration.
- â˘The authors propose LUSPO, a tiny but powerful fix: multiply each sequenceâs loss by its own length so longer and shorter answers are treated fairly.
- â˘LUSPO stops length collapse, grows response lengths faster, and improves accuracy across math and multimodal benchmarks versus GRPO and GSPO.
- â˘On AIME24, LUSPO beats GSPO by up to 2.9% (7B dense model) and 6.9% (30B MoE model); on MathVista-mini it tops GRPO by 1.6% and GSPO by 0.5%.
- â˘In training, LUSPO increases average response length to about 1.5Ă that of GSPO, opening more room for step-by-step reasoning.
- â˘The fix works on both dense and Mixture-of-Experts models, and for text-only and vision-language settings, with more stable training curves.
- â˘The insight is simple: donât let the training math confuse 'short' with 'good'âbe length-unbiased so models reason as long as needed.
- â˘This has real impact for tasks like math, coding, and science where clear, multi-step explanations matter.
Why This Research Matters
When AI can explain its thinking fully, it becomes more reliable for homework help, coding, and science questions. If training secretly punishes long answers, models may skip important steps and make more mistakes. LUSPO fixes this by being fair to both short and long answers, so the model chooses the length it truly needs. That leads to clearer explanations and better accuracy on tough problems. It also prevents training collapse in multimodal models, helping them combine text and images more effectively. This means smarter tutors, better debugging assistants, and more trustworthy reasoning tools for everyone.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
đ Hook: Imagine youâre writing answers for a math contest. If the teacher graded you mostly by how short or long your answer is, instead of how correct and clear it is, you might start changing your writing style to match the grading trick. That wouldnât be fair, right?
𼏠The Concept (RLVR â Reinforcement Learning with Verifiable Rewards):
- What it is: RLVR is a way to train AI where the model tries answers, gets a reward that can be checked (like âcorrect/incorrectâ), and then learns to do better next time.
- How it works:
- The model sees a question and writes several possible answers.
- A checker (the verifier) scores each answer based on rules (correctness, format, etc.).
- The model updates itself to make good answers more likely in the future.
- Why it matters: Without RLVR, models may sound fluent but struggle with tricky reasoning like math or multi-step logic. đ Anchor: Think of a quiz game where each answer can be checkedâAI gets a point for right answers and learns to score more points over time.
đ Hook: You know how a group of friends can help you judge which solution seems best by comparing them all side by side?
𼏠The Concept (GRPO â Group Relative Policy Optimization):
- What it is: GRPO improves a model by comparing a group of answers to the same question and rewarding those that beat the group average.
- How it works:
- Generate G answers to one question.
- Score each answer.
- Compute each answerâs âadvantageâ as how much it beats (or loses to) the group average.
- Adjust the model so answers with higher advantage become more likely.
- Why it matters: It removes the need for a separate value model and focuses learning on relative quality. đ Anchor: Like a science fair where judges compare projects side-by-side and push the winnersâ ideas forward.
đ Hook: Imagine grading a whole story by looking at the story as one piece, not by checking each word one by one.
𼏠The Concept (GSPO â Group Sequence Policy Optimization):
- What it is: GSPO is like GRPO but scores and adjusts at the whole-answer (sequence) level instead of token-by-token.
- How it works:
- Treat the entire answer as a single unit.
- Compare how likely the old model vs. the new model thinks that whole answer is (importance ratio at sequence level).
- Clip extreme ratios for stability (so updates donât explode).
- Push up good whole answers; push down bad ones.
- Why it matters: GSPO is more stable, especially for MoE models, where token-level updates can get noisy. đ Anchor: Itâs like coaching a basketball team by reviewing each full game tape, not just single plays.
đ Hook: Suppose your class starts giving extra credit if your essay is short. Suddenly, students might write super short essaysâeven if the ideas need more space!
𼏠The Concept (Response length bias):
- What it is: A hidden push in the training math that makes the model prefer shorter (or longer) answers, unrelated to real quality.
- How it works:
- In GRPO, averaging over tokens makes each token in short answers count more, so short good answers get stronger boosts.
- In GSPO, sequence-level clipping and Clip-Higher prune many gradients, often leaving more short, positive samples to dominate.
- Over time, models drift toward shorter repliesâeven when longer reasoning is helpful.
- Why it matters: If the model âlearnsâ that short equals good, it may stop explaining steps, hurting complex reasoning. đ Anchor: Like grading essays mostly by length; students game the system instead of learning to reason clearly.
đ Hook: Picture a school with many teachers, each great at a different subject, teaming up to help you.
𼏠The Concept (Mixture-of-Experts, MoE):
- What it is: A model made of many âexperts,â where only some experts are activated for each input.
- How it works:
- A router picks a few best experts for the current question.
- Those experts process the input.
- Their outputs are combined into the final answer.
- Why it matters: Itâs efficient and powerful, but training can get unstable if updates swing too hard. đ Anchor: Like calling the math teacher for math and the art teacher for art, instead of having every teacher speak at once.
The World Before:
- AI models were good at sounding fluent but not at careful step-by-step reasoning.
- RLVR helped by rewarding correct, checkable answersâbut the training math accidently introduced a bias toward certain answer lengths.
The Problem:
- GRPO and GSPO, while strong, made long answers count less per token (GRPO) and clipped too many gradients in a way that favored short answers (GSPO). This caused âlength collapse,â especially in vision-language training, where answers shrank over time.
Failed Attempts:
- Switching from token-level (GRPO) to sequence-level (GSPO) updates improved stability but didnât remove the core length bias.
- Techniques like Clip-Higher improved some stability aspects but unintentionally amplified the bias toward short positives.
The Gap:
- We needed an objective that is fair to both short and long answersâso models can write as long as needed to reason well, without gaming the training math.
Real Stakes:
- In math, coding, and science questions, solutions need multiple steps. If AI cuts explanations short, it makes more mistakes and is less trustworthy.
- In multimodal tasks (with images), shrinking answers means less exploration of visual clues, hurting performance.
02Core Idea
đ Hook: You know how race judges sometimes adjust times if the track is longer or shorter, so the scores are fair? Training AI needs that kind of fairness for answer length.
𼏠The Concept (LUSPO â Length-Unbiased Sequence Policy Optimization):
- What it is: LUSPO is a simple change to GSPO that makes training fair to answers of all lengths by scaling each sequenceâs loss by its own length.
- How it works:
- Keep GSPOâs idea of treating each answer as a whole sequence.
- Compute how much to push up or down the sequence (using its reward and importance ratio).
- Multiply that push by the answerâs length so long and short answers contribute fairly overall.
- Apply clipping as usual, then update the model.
- Why it matters: Without this correction, models learn to shrink answers; with it, models safely explore longer, step-by-step reasoning and perform better. đ Anchor: Itâs like grading essays by quality and normalizing for length, so a great short essay and a great long essay are both judged fairly.
The âAha!â Moment in one sentence:
- If we scale each sequenceâs learning signal by its length, we remove the hidden push that punishes long answers, unlocking better reasoning.
Three Analogies:
- Essay Fairness: Donât let word count decide the gradeâjudge ideas fairly by adjusting for length.
- Team Highlights: When comparing whole games, give equal credit whether the team played 30 or 60 minutes, scaling the evaluation so itâs fair.
- Marathon vs. Sprint: Normalize scores so long races and short races are judged on performance, not distance.
Before vs. After:
- Before: GSPO often clipped gradients so that short, positive examples dominated. Responses got shorter over time, hurting reasoning (especially in multimodal tasks).
- After: LUSPO restores balance. Response length grows healthily, exploration increases, and benchmarks improve across dense and MoE models.
Why It Works (intuition, not math):
- GRPO averaged token signals, making each token in a short answer louder than a token in a long answer.
- GSPO looked at the whole answer but clipped in a way that often removed more negative tokens and over-weighted short positives.
- LUSPO evens the scales: multiplying by length stops long answers from being muted and prevents short replies from being over-amplified.
Building Blocks:
- Sequence-level thinking (from GSPO): judge the whole answer for stability.
- Verifiable rewards: keep correctness as the north star.
- Length scaling (new): ensure the learning signal is unbiased by length.
- Clipping for safety: keep updates stable.
- Works with both dense and MoE: robust across architectures.
03Methodology
High-Level Recipe (Input â Steps â Output):
- Input: A batch of questions (text or image+text), and a current model.
- Steps:
- Sample G answers per question from the old policy.
- Score each answer with verifiable rewards (correctness, format, overlong penalty).
- Compute group advantages (how much each answer beats the group average).
- Compute sequence-level importance ratios (how much the new policy likes that whole answer vs the old one), then clip for stability.
- Multiply each sequenceâs training signal by its own length (the LUSPO fix).
- Average within the batch and update the model.
- Output: A new model that learns to produce clear, correct, and appropriately long reasoning.
đ Hook: Think of cooking a big stew; you taste the whole stew (sequence), not each tiny grain of salt (token), and you adjust seasoning fairly whether the pot is big or small (length scaling).
𼏠The Concept (Sequence-level importance weighting and clipping, with length scaling):
- What it is: Measure how much the new model agrees with a full answer, keep extreme signals safe with clipping, and then correct for length.
- How it works:
- Whole-answer comparison keeps noise low.
- Clipping prevents wild jumps.
- Length scaling removes unfair preferences for short answers.
- Why it matters: Without it, training stability fights against reasoning depth. đ Anchor: Like adjusting the spice level of a full pot, then making sure a big pot and a small pot get fair comparisons by portion size.
Each Step in Detail:
- Sampling responses (G per question)
- What happens: For each prompt, generate multiple candidate answers with temperature and top-p sampling.
- Why it exists: More candidates mean better comparison and a clearer signal of whatâs good.
- Example: For a geometry problem, produce 8 different solution attempts.
- Rewarding answers (verifiable rewards)
- What happens: Compute a reward that includes accuracy (0/1), format (0 or 0.5), and an overlong penalty that softly discourages exceeding a buffer near the max length.
- Why it exists: Rewards define the goalâcorrect, well-formatted, not absurdly long answers.
- Example: An answer thatâs correct, well-formatted, and within length buffer might get 1.5; a wrong, messy, too-long answer might get close to 0 or slightly negative after penalty.
- Group advantage (relative to peers)
- What happens: For the 8 answers to the same question, compute each answerâs advantage as how much it beats the group average (normalized by group spread).
- Why it exists: Removes the need for a separate value model and focuses learning on âbetter than your siblings.â
- Example: If your answer scores highest among the 8, its advantage is positive; the worst answerâs advantage is negative.
- Sequence-level importance ratio and clipping
- What happens: Compare how likely the whole answer is under the new model vs. the old one; then clip the ratio within a safe range (with optional Clip-Higher tweaks).
- Why it exists: Prevents unstable updates, especially in MoE models.
- Example: If the new model over-favors a bad answer, clipping stops a harmful update.
- Length scaling (the LUSPO fix)
- What happens: Multiply each sequenceâs (already clipped) learning signal by the answerâs length.
- Why it exists: Neutralizes the hidden math that punishes long answers and over-rewards short ones.
- Example: A 300-token correct solution and a 100-token correct solution now have fair total influence.
- Update the model
- What happens: Average signals across sequences and backpropagate to adjust parameters using AdamW with a small learning rate and warm-up.
- Why it exists: This is how the model actually learns from the signals.
- Example: After several steps, the model shows longer, clearer step-by-step reasoning.
Concrete Mini Example:
- Prompt: âCompute 12Ă13.â
- Candidates:
- A: â156â (length 1, correct, good format) â reward 1.5
- B: âFirst, 12Ă10=120, 12Ă3=36, add to get 156. So 156.â (length 22, correct, good format) â reward 1.5
- C: âUh⌠maybe 158?â (length 5, wrong) â reward 0
- In plain GSPO: Short correct A can dominate because of clipping and no length correction; over time, replies may shrink.
- In LUSPO: A and B get equal reward quality, but Bâs longer reasoning isnât mutedâlength scaling keeps it fairly weighted. C is pushed down.
Training and Implementation Notes (from the paper):
- Models: Qwen2.5-7B-Base (dense), Qwen3-30B-A3B-Instruct (MoE), Qwen2.5-VL-7B-Instruct (vision-language).
- Compute: 8 H800 GPUs for 7B and VL; 4Ă8 H800 GPUs for 30B MoE.
- Rollout: 128 prompts per batch, 8 responses per prompt; top-p 0.7, temperature 1.0.
- Optimizer and schedule: AdamW, LR 1e-6, linear warm-up 20 steps.
- Max generation length: 32,768 (text), 4,096 (VL); overlong penalty uses a buffer (e.g., 512 for VL, 4096 for text) to discourage only very long outputs.
The Secret Sauce:
- A one-line conceptual changeâmultiply each sequenceâs signal by its own lengthâunlocks fair learning across lengths, prevents length collapse, and keeps GSPOâs stability benefits.
04Experiments & Results
The Test: What and Why
- They measured three main things: accuracy (did the model get the problem right?), response length (is the model explaining enough?), and validation curves (does training generalize beyond the training set?).
- Why: Longer, thoughtful answers often mean better reasoningâbut only if length is used wisely. We want more correct answers and stable training without shrinking replies.
The Competition (Baselines)
- GRPO: Strong performance but token-averaging introduces length bias; unstable for MoE.
- GSPO: More stable via sequence-level updates, but amplifies length bias, sometimes causing length collapse.
- LUSPO: The proposed method removes length bias by length scaling.
The Scoreboard (with context)
- Text-only, Qwen2.5-7B-Base (dense):
- AIME24: LUSPO beats GSPO by +2.9 points (like moving from a low B to a solid B+).
- AIME25: +2.7; AMC23 and MATH500 also improve, with average gain about +4.0.
- Text-only, Qwen3-30B-A3B-Instruct (MoE):
- AIME24: LUSPO +6.9 over GSPO (a big jumpâlike from a B to an A-).
- AIME25: +17.1 (very large improvementâlike leaping from a C+ to a B+/A- range).
- Vision-language, Qwen2.5-VL-7B-Instruct:
- MathVista-mini: LUSPO +1.6 over GRPO and +0.5 over GSPO.
- WeMath and LogicVista: LUSPO +5.1 and +6.0 over GSPO, respectively (major gains on tough reasoning).
Response Length Growth
- LUSPO raises average response length to about 1.5Ă GSPO across models (e.g., 2611 vs 3940 tokens, dense; 6757 vs 10114 tokens, MoE), showing increased exploration space.
- In multimodal training, GSPO suffers length collapse; LUSPO avoids this and maintains healthy lengths.
Training Dynamics and Validation
- Accuracy reward during training is consistently higher for LUSPO than GSPO (dense, MoE, and VL), meaning better progress per step.
- Validation curves (e.g., avg@32 on AIME24) climb higher with LUSPO, suggesting gains are not just memorization.
Surprising Findings
- GSPO sometimes underperforms GRPO on multimodal benchmarks due to amplified length bias from sequence-level clipping and Clip-Higher.
- A simple length scaling in LUSPO reverses this: stability is preserved while bias is removed.
Bottom Line
- LUSPO provides state-of-the-art optimization among tested methods, giving both longer, meaningful reasoning chains and better scores across diverse settings.
05Discussion & Limitations
Limitations
- LUSPO improves fairness across lengths, but it still relies on reward design. If rewards overly favor verbosity or extreme brevity, models may drift again.
- Very long sequences increase compute and memory costs; even with fair training, practical budgets may cap length.
- Length scaling interacts with clipping; extreme settings of Clip-Higher or epsilon could still distort signals.
Required Resources
- Substantial compute: training used clusters of H800 GPUs and long max sequence lengths (up to 32K tokens for text).
- Quality verifiers and datasets: performance depends on reliable correctness checks and suitable prompts.
When NOT to Use
- Tasks where strict brevity is essential (SMS replies, on-device assistants with tight latency/memory) may not benefit from promoting longer reasoning.
- If the verifier is noisy or easily gamed, longer answers might âfarmâ reward without real quality.
Open Questions
- Can we adaptively scale by âuseful stepsâ instead of raw token count (e.g., detect genuine reasoning structure)?
- How should length scaling interact with different clipping schemes or soft-gating methods (e.g., SAPO)?
- What are the best settings for multimodal tasks with varied input complexities?
- Can we prove stronger theoretical guarantees about convergence and bias removal under different data distributions?
06Conclusion & Future Work
Three-Sentence Summary
- The paper finds that GRPO and GSPO have a hidden length bias that skews training toward shorter answers, which can harm complex reasoning and even cause length collapse.
- It introduces LUSPO, a length-unbiased variant of GSPO that multiplies each sequenceâs learning signal by its own length, restoring fairness and stability.
- Across dense, MoE, text-only, and vision-language models, LUSPO yields longer, more effective reasoning and higher accuracy than GRPO and GSPO.
Main Achievement
- A minimal, principled fixâlength scaling at the sequence levelâthat neutralizes response length bias and unlocks better reasoning performance.
Future Directions
- Explore adaptive scaling based on reasoning structure, integrate with soft clipping methods, and refine reward design for richer step-by-step verification.
- Extend to domains like code generation, scientific writing, and step-by-step planning, where controlled length matters.
Why Remember This
- It shows that tiny tweaks in the training objective can have outsized effects on how models think. By making training fair to length, we let models explain their thoughts clearlyâand that leads to smarter, more trustworthy AI.
Practical Applications
- â˘Build math tutors that show every step without being nudged into too-short explanations.
- â˘Train coding assistants that write complete, well-commented solutions instead of terse, fragile snippets.
- â˘Improve vision-language assistants that reason over charts, tables, and diagrams without shrinking their analysis.
- â˘Use LUSPO in scientific QA systems so they provide methodical, reproducible reasoning chains.
- â˘Enhance multi-step planning agents (e.g., for robotics or data workflows) by allowing appropriately long plans.
- â˘Stabilize MoE model training for reasoning tasks without complex extra tricks.
- â˘Reduce reward hacking where models game the training by writing unusually short answers to score points.
- â˘Support education apps that require structured formats (e.g., final boxed answers) while keeping full derivations.
- â˘Improve evaluation fairness across datasets with varied typical answer lengths.
- â˘Enable safer deployment by avoiding length collapse that hides missing reasoning steps.