Evaluating Parameter Efficient Methods for RLVR

Qingyu Yin; Yulun Wu; Zhennan Shen; Sunbowen Li; Zhilin Wang; Yanshu Li; Chak Tou Leong; Jiale Kang; Jinjin Gu

Evaluating Parameter Efficient Methods for RLVR

Intermediate

Qingyu Yin, Yulun Wu, Zhennan Shen et al.12/29/2025

arXiv PDF

Key Summary

•The paper asks which small, add-on training tricks (PEFT) work best when we teach language models with yes/no rewards we can check (RLVR).
•It tests more than a dozen PEFT methods on strong reasoning models and math benchmarks, keeping training settings fair and the same for everyone.
•Structural adapters like DoRA, AdaLoRA, and MiSS beat the usual LoRA, and sometimes even beat full-model training.
•Tricks that start from SVD (PiSSA, MiLoRA) often crash because they push updates into the wrong directions for RLVR, a problem the authors call spectral misalignment.
•There’s a minimum “expressivity floor”: ultra-tiny adapters (VeRA, rank-1, or only LayerNorm tuning) choke off the model’s ability to improve its reasoning.
•Results hold across different batch sizes, learning rates, adapter ranks, and a bigger 7B model—so the story is robust, not a fluke.
•LoRA+ (which adjusts learning-rate ratios) is stable and competitive, showing that smart optimization helps even without big architectural changes.
•Bottom line: stop defaulting to plain LoRA for RL with verifiable rewards; use structure-aware adapters like DoRA or careful optimization like LoRA+.
•This matters because it lets teams train better reasoning models faster and cheaper without touching all the model’s weights.

Why This Research Matters

Better adapters for RLVR mean we can teach models to reason more accurately while spending less time and money. This helps build math tutors, coding helpers, and science assistants that are both smarter and cheaper to train. Teams without giant budgets can still achieve high-quality results by picking DoRA or LoRA+ instead of defaulting to plain LoRA. Avoiding unstable methods (like some SVD-inits) prevents wasted compute and frustrating crashes. Respecting the expressivity floor ensures models don’t “forget how to think” by using adapters that are too tiny. Overall, this turns RLVR from a fragile, expensive trick into a practical, reliable tool for everyday intelligent systems.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook): Imagine you’re practicing math with a teacher who only says “correct” or “wrong,” but never shows the full solution. You still get better because each checkmark is trustworthy.

🥬 Filling (The Actual Concept)

What it is: Reinforcement Learning with Verifiable Rewards (RLVR) is a way to train AI where it tries answers and gets a binary reward (1 or 0) from a reliable checker (like a math grader or a code runner).
How it works:
1. The model generates several answers to a question.
2. A verifier checks if each final answer is correct.
3. The model gets a reward (1 for correct, 0 for wrong) and nudges itself to make correct answers more likely next time.
Why it matters: Without verifiable rewards, the AI might chase fuzzy signals; with them, it can build strong reasoning habits it can be checked on.

🍞 Bottom Bread (Anchor): Think of a calculator app that also marks your final answer as right or wrong. If the AI sees enough right/wrong checks, it learns to aim for right more often.

🍞 Top Bread (Hook): You know how you can improve a bike without buying a new one by just adding a better seat and lights? That’s being efficient.

🥬 Filling (The Actual Concept)

What it is: Parameter-Efficient Fine-Tuning (PEFT) is upgrading a model by training only small add-on parts instead of changing the whole thing.
How it works:
1. Freeze the big model so it stays stable.
2. Attach tiny “adapters” to certain layers.
3. Train just those little adapters to learn the new skill.
Why it matters: It saves memory, time, and money, letting more people tune big models.

🍞 Bottom Bread (Anchor): Like snapping a small focusing lens onto a camera to get better close-ups without rebuilding the whole camera.

🍞 Top Bread (Hook): Picture improving a guitar’s sound by tuning only two strings that matter most for a song, not all of them.

🥬 Filling (The Actual Concept)

What it is: Low-Rank Adaptation (LoRA) is a PEFT method that changes only a small, low-rank part of each weight matrix.
How it works:
1. Keep the original weight W frozen.
2. Add a tiny update BA where B and A are skinny matrices (low rank).
3. Train B and A so BA learns the needed change.
Why it matters: You get big improvements with very few trainable parameters.

🍞 Bottom Bread (Anchor): It’s like adding a slim, adjustable cushion to your chair—tiny change, big comfort.

Before this paper, people had learned that RL (and RLVR) can boost reasoning in language models, especially for math and code. But RL training is expensive and tricky, and most teams simply used standard LoRA because it was the popular PEFT choice. That raised a big question: is plain LoRA actually the best choice for the unique, yes/no, sparse-reward world of RLVR? Also, earlier attempts to be extra-efficient—like turning the adapters super tiny—sometimes worked in supervised fine-tuning, but might not carry over to RLVR, where the learning signals are sparse and the model must reshape its reasoning pathways.

The problem: No one had run a broad, side-by-side test of many PEFT methods under the same RLVR setup. Without that, we couldn’t tell which adapters make the most of the scarce, verifiable rewards; we also didn’t know which training tricks cause crashes.

Failed attempts: Some teams tried SVD-based initializations (like PiSSA and MiLoRA) to aim updates at “important” directions of the model’s weights. But in RLVR, these often collapsed or underperformed. Others tried extreme parameter reduction (rank-1, VeRA, or only LayerNorm tuning), which sometimes choked off the model’s ability to grow better reasoning.

The gap: We needed a fair, comprehensive testbed comparing many PEFTs on the same models, data, and RLVR algorithms. The paper fills that gap by testing 12+ methods, measuring not just accuracy but stability over time, and exploring ablations (batch size, rank, learning rate) and scaling to a bigger model.

Real stakes: If we pick the wrong adapter, we waste compute, stall training, or get worse reasoners. If we pick the right one, math solvers, coding assistants, and tutoring bots get smarter, faster, and cheaper—helping students, developers, and researchers in daily life.

02Core Idea

🍞 Top Bread (Hook): Imagine you’re organizing a toolbox. Some tools (hammers) are great for nails, but you might need a screwdriver for screws. Using the wrong tool makes the job harder.

🥬 Filling (The Actual Concept)

What it is: The key insight is that the best adapters for RLVR are not the usual ones—structural variants like DoRA, AdaLoRA, and MiSS work better than plain LoRA, and SVD-based starts (PiSSA, MiLoRA) often misalign with how RLVR actually learns.
How it works:
1. Test many PEFT methods under the same RLVR recipe and data.
2. Track accuracy and stability over training.
3. Analyze where updates go in “spectral space” (which directions in the model get changed).
Why it matters: Picking the right adapter unlocks stronger reasoning with less compute; picking the wrong one can crash learning.

🍞 Bottom Bread (Anchor): It’s like discovering that a ratcheting screwdriver (DoRA) tightens screws faster and cleaner than your basic flat-head (LoRA) for this specific job.

🍞 Top Bread (Hook): You know how pushing a swing needs two choices—how hard you push and the direction you push?

🥬 Filling (The Actual Concept)

What it is: Magnitude–Direction Decoupling (as in DoRA) separates how much to change from which way to change weights.
How it works:
1. Keep track of the vector’s direction (where to go).
2. Separately learn the magnitude (how strongly to go there).
3. Combine them during the forward pass.
Why it matters: In RLVR’s sparse-reward world, this flexibility helps the model make precise, stable shifts in reasoning.

🍞 Bottom Bread (Anchor): Like steering a bike (direction) while also deciding how hard to pedal (magnitude) so you don’t wobble or stall.

🍞 Top Bread (Hook): Think of sound waves on a music equalizer—some sliders are high, some are low. Looking at which sliders move tells you how the sound is changing.

🥬 Filling (The Actual Concept)

What it is: Spectral Analysis is inspecting weight changes along directions revealed by SVD (like the “loudest” versus “quietest” components).
How it works:
1. Decompose a weight matrix into directions (singular vectors) and strengths (singular values).
2. Measure which directions the training updates prefer.
3. Compare different methods’ update patterns.
Why it matters: If RLVR prefers off-principal directions, methods that force principal directions can fight the learning signal and collapse.

🍞 Bottom Bread (Anchor): It’s like noticing your treble slider keeps jumping when the song actually needs more bass—no wonder it sounds off.

Before vs After:

Before: Plain LoRA was the default; SVD-based inits seemed elegant; tiny adapters looked attractive for savings.
After: Structure-aware adapters (DoRA, AdaLoRA, MiSS) consistently win; SVD-initialized methods can misalign with RLVR’s off-principal learning and crash; there’s a hard “expressivity floor”—too few learnable parameters strangle reasoning.

Why it works (intuition): RLVR uses sparse, outcome-only signals. To improve, the model must carefully reshape reasoning paths without wrecking pre-trained knowledge. DoRA’s decoupling allows finer control; AdaLoRA/MiSS flex the capacity where needed. SVD-based inits assume “most important” directions are best to change, but RLVR often learns in quieter, non-dominant directions to stay stable—so pushing on the loudest sliders backfires.

Building blocks:

Structural Variants (DoRA, AdaLoRA, MiSS) that add flexibility.
Optimization Tweaks (LoRA+) that tune learning-rate balance for stability.
Caution Flags (PiSSA, MiLoRA) where spectral bias clashes with RLVR dynamics.
Expressivity Floor: don’t shrink adapters past what reasoning needs.

03Methodology

At a high level: Question → Generate multiple answers → Check with a verifier (reward 0/1) → Compute group-based advantages (DAPO/GRPO family) → Update only adapter parameters (PEFT) → Repeat.

Step 1: Prepare the models and adapters

What happens: Choose base models (e.g., DeepSeek-R1-Distill-Qwen 1.5B and 7B). Attach different PEFT adapters (LoRA, DoRA, AdaLoRA, MiSS, LoRA+, etc.) to key linear layers.
Why this step exists: We want a fair, apples-to-apples test of many PEFTs under the same backbone and training recipe.
Example: Add a DoRA module to attention and MLP projection layers so the model can adjust magnitude and direction separately.

Step 2: Generate multiple candidate answers per question (rollouts)

What happens: For each math prompt, the model samples G answers (e.g., 8) with temperature/top-p to explore different reasoning paths.
Why it matters: RLVR needs exploration because the reward is sparse—trying several paths increases the chance of finding a correct one.
Example: For a geometry question, the model produces eight detailed chains-of-thought with final boxed answers.

🍞 Top Bread (Hook): Imagine a referee who only says “goal” or “no goal,” but is perfectly accurate.

🥬 Filling (The Actual Concept)

What it is: The Verifier gives a binary reward—1 if the final answer is correct, 0 otherwise.
How it works:
1. Extract the model’s final boxed answer.
2. Use tools (like latex2sympy) to check equivalence with the ground truth.
3. Assign 1/0 reward.
Why it matters: This clean, trustworthy signal guides learning even without token-by-token supervision.

🍞 Bottom Bread (Anchor): Like a spellchecker that marks only the final word as correct or wrong; simple but solid.

Step 3: Compute group-based advantages (DAPO/GRPO family)

🍞 Top Bread (Hook): Think of a coach comparing players on the same team to see who did better this round.

🥬 Filling (The Actual Concept)

What it is: Group Relative Policy Optimization (GRPO) and its variants (DAPO, Dr. GRPO) score each answer relative to others in the same batch.
How it works:
1. Collect rewards for the G answers to the same prompt.
2. Standardize or adjust those rewards using group statistics.
3. Update the model to favor relatively better answers while keeping exploration.
Why it matters: This avoids needing a separate critic network and stabilizes learning from sparse signals.

🍞 Bottom Bread (Anchor): It’s like ranking several free-throws you just shot and practicing the form that made the best ones.

Step 4: Apply the PEFT update

What happens: Only the adapter parameters (small add-ons) get updated; the big pre-trained weights stay frozen.
Why it matters: Keeps compute low and preserves the model’s general knowledge while refining reasoning.
Example: In DoRA, direction and magnitude parts update in a coordinated way; in LoRA+, the B and A parts learn with different learning rates to stay stable.

Step 5: Training logistics and fairness

What happens: Use the same ranks, learning rates, batch sizes, rollout settings, and training steps across methods unless ablation-testing those factors.
Why it matters: A fair race ensures performance differences come from the adapter design, not from better hyperparameters.
Example: Rank=32, alpha=64, dropout=0.05 for most adapters; batch sizes fixed; same temperature/top-p for sampling.

🍞 Top Bread (Hook): Imagine sorting music by loudness and pitch so you can see which parts change after you remix a song.

🥬 Filling (The Actual Concept)

What it is: Spectral Analysis (via SVD) to see whether updates prefer principal (loud) or off-principal (quiet) directions.
How it works:
1. Decompose layer weights into singular vectors/values.
2. Project updates onto these directions.
3. Plot how much energy goes into top components versus the tail.
Why it matters: RLVR tends to learn off-principal; methods that force principal updates can conflict and collapse.

🍞 Bottom Bread (Anchor): If your remix keeps cranking treble but the song needs bass, the spectrogram will show the mismatch.

The secret sauce

Structural variants (DoRA, AdaLoRA, MiSS) add flexibility right where RLVR needs it—fine-grained shifts without overstepping.
Learning-rate-aware methods (LoRA+) stabilize optimization when signals are sparse.
Careful ablations of batch size, rank, and LR expose the “expressivity floor”: too-tiny adapters starve reasoning capacity.

Concrete data example

Input: “AIME-style algebra problem.”
Model outputs: 8 chains-of-thought; 2 are correct (reward=1), 6 are wrong (reward=0).
GRPO/DAPO: Emphasize the 2 correct paths; slightly raise probabilities of their key reasoning steps.
Adapter update: DoRA tweaks direction/magnitude in attention/MLP to make those steps more likely next time.
Next round: Correct paths become more common; accuracy trend rises if the adapter matches RLVR’s dynamics.

04Experiments & Results

The test

Benchmarks: AIME24/25, AMC, HMMT, MATH-500, Minerva—classic math reasoning suites.
Metrics: Avg@k (average accuracy across k generations) and Pass@k (at least one correct in k tries), reflecting both quality and diversity.
Setup: Same RLVR recipe (DAPO by default), same sampling, same adapter ranks; ablations change one thing at a time.

The competition

Baselines: Full-parameter fine-tuning (upper bound on compute) and standard LoRA (popular PEFT baseline).
Challengers: Structural variants (DoRA, AdaLoRA, MiSS), efficiency-focused (LoRA-FA, VeRA), SVD-initialized (PiSSA, MiLoRA), and optimization-tuned (LoRA+, rsLoRA), plus IA3/LN-tuning.

The scoreboard (with context)

DoRA tops the chart on the 1.5B model: around 46.6% average accuracy, even beating full-parameter fine-tuning (~44.9%). That’s like scoring an A when the usual best is an A-.
AdaLoRA and MiSS also beat standard LoRA (LoRA ~42.5%). These are solid B+/A- versus LoRA’s B.
LoRA+ performs strongly and stably—proof that smart learning-rate balance helps even without structural changes.
Extreme shrinkers stumble: VeRA (~40.7%) and IA3 (~22.3%) fall behind, signaling an expressivity floor; shrink too far and reasoning stalls.
SVD-based inits: PiSSA collapses near zero; MiLoRA starts okay but fails to converge (~18%). This is the “spectral misalignment” story: RLVR wants off-principal moves; SVD-inits drag you toward principal components, causing instability or collapse.

Training dynamics and curves

Over steps, DoRA’s accuracy climbs smoothly; LoRA rises but lower; PiSSA/MiLoRA show early lift or flatness then crash. The learning curves mirror the spectral analysis: the stable winners don’t over-concentrate updates in top singular directions.

Surprising findings

DoRA sometimes beats full finetuning, highlighting how the right structure can guide updates better than brute force.
MiLoRA, designed to favor minor components, still ends up spiking principal directions due to tiny initial magnitude letting gradients steer it back to the loudest components.
Moderate parameter cuts (LoRA-FA) are okay; extreme cuts (VeRA, rank-1) cross the line where the adapter can’t express the needed reasoning changes.

Ablations

Batch size: Smaller can help slightly, but the effect is mild in RLVR compared to supervised fine-tuning; the sparse reward changes the game.
Learning rate: Careful scaling matters a lot; wrong LR can destabilize training even for good adapters.
Rank: Higher ranks (16–32) clearly outperform rank-1; the cost is tiny relative to the base model, so don’t starve rank.
Objective variants: Results are consistent across GRPO, DAPO, and Dr. GRPO—so the adapter story generalizes across RLVR algorithms.

Scaling up

On the 7B model, the pattern holds: DoRA and LoRA+ edge out plain LoRA again (around 55% vs 54.8%), with strong scores on tough subsets (e.g., AMC, AIME25). This shows the findings aren’t an artifact of small models.

Bottom line

The best bet for RLVR is structure-aware adapters like DoRA (and strong contenders like AdaLoRA/MiSS) or optimization-savvy LoRA+.
Avoid SVD-based inits in RLVR—they fight the optimization landscape.
Don’t undersize adapters beyond the expressivity floor if you care about reasoning.

05Discussion & Limitations

Limitations

Domain scope: Tests focus on mathematical reasoning. Other domains (e.g., long coding projects, scientific QA) might differ in how much expressivity or structure they need.
Model family: Evaluations center on DeepSeek-R1-Distill variants. Results should generalize, but confirmation on diverse architectures is needed.
Training horizon: Experiments use practical training lengths. Very long training could shift rankings or expose late-stage effects.
Infrastructure: TRL made broad comparisons easy, but ultra-large-scale distributed RL might favor different engineering choices.

Required resources

You need GPU memory for rollouts (multiple generations per prompt), verifiers (e.g., symbolic math tools), and logging. Even with PEFT, RLVR’s sampling cost dominates.
Good dataset curation with reliable verifiers is essential—garbage rewards hurt all methods.

When not to use these methods

If you only need shallow style tweaks (not deep reasoning), simple supervised fine-tuning or tiny adapters may suffice.
If verifiers are weak or noisy, RLVR’s advantage shrinks; then PEFT rankings may change because the signal guiding training is unreliable.
If you can’t afford multiple rollouts per prompt, the exploration benefit of RLVR drops and some adapters’ strengths may not appear.

Open questions

Theory: Can we precisely characterize why RLVR prefers off-principal updates and formalize which structures preserve that trajectory?
Adapter design: Are there even better magnitude–direction schemes, or hybrids that adapt rank on-the-fly without instability?
Stability: Can we invent initialization methods that keep updates in the off-principal regime without collapsing to principal spikes?
Transfer: How do findings extend to multimodal RLVR, multi-turn dialogue with verifiers, or code execution tasks with long horizons?
Engineering: What merging/inference tricks keep adapters stable and efficient in production?

06Conclusion & Future Work

Three-sentence summary

This paper runs the first big, fair race of many parameter-efficient adapters under RL with verifiable rewards and finds that plain LoRA isn’t the best choice. Structural variants like DoRA, AdaLoRA, and MiSS consistently beat LoRA and sometimes even full-model training, while SVD-based initializations often fail due to spectral misalignment. There’s also a clear expressivity floor: shrink adapters too much and reasoning quality collapses.

Main achievement

A clear, evidence-backed roadmap for PEFT in RLVR: pick structure-aware adapters (DoRA et al.) or stable optimization tweaks (LoRA+), avoid SVD-based inits, and don’t cut rank/parameters past the point where reasoning can still flex.

Future directions

Build theory for why RLVR learns off-principal directions and design initializations that respect that. Test broader domains (coding, science), longer training, and more model families. Improve infrastructure for scalable, stable RLVR and robust deployment.

Why remember this

Because it changes the default: don’t automatically choose plain LoRA for RLVR. If you want stronger reasoning per dollar, use structure-aware adapters and respect the expressivity floor. This guidance can save compute, avoid crashes, and produce smarter, more reliable reasoning models in practice.

Practical Applications

•Train math-tutoring chatbots with DoRA to reach higher accuracy using fewer resources.
•Build reliable code-verification assistants by pairing RLVR with LoRA+ for stable learning.
•Upgrade enterprise QA systems by fine-tuning only adapters, cutting compute budgets without sacrificing reasoning.
•Deploy classroom assistants that solve competition-style problems while keeping training costs manageable.
•Speed up research prototyping: swap adapters (DoRA, AdaLoRA, MiSS) to quickly test what works best for a new domain.
•Avoid training crashes in RLVR by steering clear of SVD-initialized adapters and using stable LR settings.
•Scale from 1.5B to 7B models confidently, knowing the adapter rankings remain similar.
•Set adapter rank to 16–32 to stay above the expressivity floor for complex reasoning tasks.
•Use LoRA-FA when memory is tight, but avoid extreme vector-only methods (VeRA) for deep reasoning.
•Design evaluation pipelines with multiple rollouts and strict verifiers to provide clean RLVR signals.

Version: 1