Learning Rate Matters: Vanilla LoRA May Suffice for LLM Fine-tuning
Key Summary
- •When you tune the learning rate carefully, plain old LoRA fine-tuning works about as well as fancy new versions.
- •Across math and code tasks and several model sizes, all LoRA-style methods reached nearly the same best accuracy (usually within 1–2%).
- •Different LoRA variants prefer different learning rate ranges, so judging them at a single learning rate can be misleading.
- •PiSSA generally needs a smaller learning rate than vanilla LoRA because its loss surface is sharper (bigger top Hessian eigenvalue).
- •Once the learning rate is right, batch size matters less than most people think for these setups.
- •Small rank changes can flip who ‘wins’: some variants look better at low rank but not at high rank, and vice versa.
- •Careful hyperparameter search (especially learning rate) is essential for fair comparisons among LoRA methods.
- •This study suggests vanilla LoRA remains a strong baseline and should not be dismissed without proper tuning.
- •A second-order (Hessian) analysis explains why each variant’s best learning rate differs: sharper curvature → lower safe learning rate.
- •Reported big gains from new LoRA variants may vanish after tuning, so future papers should search hyperparameters broadly.
Why This Research Matters
This study shows that careful tuning can save time, money, and complexity when adapting large models. Teams may not need a brand-new method if plain LoRA, properly tuned, gets almost the same performance. That means faster deployment cycles and simpler maintenance for real-world apps like coding assistants or math tutors. It also encourages fairer, more trustworthy research by pushing for wide hyperparameter sweeps. Understanding the Hessian link helps practitioners pick safer learning rates faster. Overall, it empowers both researchers and engineers to make smarter, more cost-effective choices.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
🍞 Top Bread (Hook) You know how when you’re learning to ride a bike, the speed you go matters a lot? Too slow and you wobble. Too fast and you might crash. Finding the right speed helps you learn safely and quickly.
🥬 Filling (The Actual Concept)
- What it is: This paper is about teaching big language models new skills using a small, efficient set of knobs called LoRA adapters, and discovering that the most important knob to tune is the learning rate.
- How it works:
- Start with a big pre-trained model (like a super-smart helper).
- Attach tiny add-on pieces (LoRA adapters) so you don’t need to change the whole model.
- Train those tiny pieces using a learning rate (how big each learning step is).
- Try different learning rates to find the sweet spot.
- Why it matters: If you pick the wrong learning rate, you might think a fancy method is better, when really the plain method just needed a better speed.
🍞 Bottom Bread (Anchor) Imagine two kids learning to ride: one on a shiny new bike and one on a basic bike. If the shiny-bike kid rides too fast and falls, while the basic-bike kid rides at a comfy speed and succeeds, it doesn’t mean the shiny bike is worse—it means speed (learning rate) mattered most.
— New Concepts (explained in the order they’re needed) —
- 🍞 Top Bread (Hook) You know how a huge library holds millions of books, but you don’t read them all at once? 🥬 Filling
- What it is: Large Language Models (LLMs) are super-sized text predictors trained on tons of text.
- How it works: They learn patterns in sentences so they can continue or answer questions.
- Why it matters: They’re powerful but expensive to retrain fully. 🍞 Bottom Bread (Anchor) Asking an LLM “What’s the capital of France?” and getting “Paris” is like it using its giant book of patterns to answer quickly.
- 🍞 Top Bread (Hook) Imagine you already know how to ride a bike, and you only need a quick tune-up for a race. 🥬 Filling
- What it is: Fine-tuning adjusts a pre-trained model to do a new task better.
- How it works: Start from a strong model, then nudge it using examples from the new task.
- Why it matters: It’s faster and cheaper than starting from scratch. 🍞 Bottom Bread (Anchor) A model good at general English can be fine-tuned to write Python code or solve math problems.
- 🍞 Top Bread (Hook) Think of adding a small gadget to a bike instead of replacing the whole frame. 🥬 Filling
- What it is: Parameter-Efficient Fine-Tuning (PEFT) changes only a few extra parts, not the whole model.
- How it works: Freeze big weights, train small adapters.
- Why it matters: Saves memory, time, and money. 🍞 Bottom Bread (Anchor) You attach a tiny new handle switch (adapter) instead of rebuilding the entire bike.
- 🍞 Top Bread (Hook) Like using two thin ropes to steer a heavy ship instead of moving the whole engine. 🥬 Filling
- What it is: LoRA (Low-Rank Adaptation) is a PEFT method that adds two small matrices (A and B) to approximate needed updates.
- How it works: Keep original weights frozen; only learn A and B; then fold them back in.
- Why it matters: Gives big improvements at tiny cost. 🍞 Bottom Bread (Anchor) You press small buttons (A and B) to nudge a giant robot arm instead of remaking the arm.
- 🍞 Top Bread (Hook) When baking cookies, the oven temperature changes everything. 🥬 Filling
- What it is: Learning rate is how big each step of learning is.
- How it works: Too big → wobble or crash; too small → crawling progress; just right → fast, stable learning.
- Why it matters: It often decides success or failure. 🍞 Bottom Bread (Anchor) Turn the dial too hot and cookies burn; too cold and they never bake.
- 🍞 Top Bread (Hook) When packing a suitcase, ‘rank’ is like how many kinds of clothes you bring. 🥬 Filling
- What it is: LoRA rank is the adapter’s size (how many directions it can adjust in).
- How it works: Low rank = fewer knobs; high rank = more expressive knobs.
- Why it matters: Rank changes capacity and can flip which method looks better. 🍞 Bottom Bread (Anchor) With rank 8, you bring basics; with rank 256, you can dress for almost any occasion.
- 🍞 Top Bread (Hook) A bumpy road needs slower biking; a smooth road can handle faster speeds. 🥬 Filling
- What it is: The Hessian matrix describes how curved (bumpy) the loss surface is. Its biggest eigenvalue measures the sharpest bump.
- How it works: Sharper curvature → need smaller learning rate; flatter curvature → can use larger learning rate.
- Why it matters: It tells us why different LoRA variants prefer different learning rates. 🍞 Bottom Bread (Anchor) If the hill is very steep (big eigenvalue), you downshift and go slow (smaller learning rate).
The World Before
- LLMs were already powerful, but fully fine-tuning billions of weights was costly. PEFT—and especially LoRA—became the go-to.
- New LoRA variants (like PiSSA, MiLoRA, Init[AB], DoRA) claimed big gains, but many comparisons used a single, fixed learning rate.
The Problem
- Neural nets are sensitive to training settings, especially learning rate. Fixed settings can make one method look unfairly better.
Failed Attempts
- Many studies reused hyperparameters from prior work or searched very narrow ranges. That can accidentally reward methods that happen to like that specific learning rate.
The Gap
- We needed a broad, systematic hyperparameter search across methods to judge fairly, and a theory (Hessian analysis) to explain why best learning rates differ.
Real Stakes
- Teams might choose a complex method, spend more compute, or even deploy worse models—simply because learning rate wasn’t tuned. This affects everyday tools like code assistants, math solvers, and chatbots.
02Core Idea
🍞 Top Bread (Hook) Imagine a race where every runner must wear shoes in just one size. Some runners do great; others stumble. But if you let each runner pick the right shoe size, the race becomes fair—and the finish times get much closer.
🥬 Filling (The Actual Concept)
- What it is: The key insight is that learning rate selection is so important that, once tuned, vanilla LoRA performs almost as well as advanced LoRA variants across tasks and models.
- How it works:
- Compare vanilla LoRA with 4 popular variants under a wide grid of learning rates, ranks, and batch sizes.
- Record the best (peak) performance each method can achieve.
- Analyze the Hessian’s top eigenvalue to explain why each method’s ‘best’ learning rate differs.
- Why it matters: Without tuning, you might conclude a variant is better. With tuning, their best scores converge, changing which method you’d pick in practice.
🍞 Bottom Bread (Anchor) It’s like testing sneakers: if you only try a size 8, you may think Brand A is better. But once each person tries their best size, the brands perform similarly.
Multiple Analogies (3 ways to see the same idea)
- Cooking: If you always bake at 350°F, some recipes taste great, others flop. Let each recipe pick its own best temperature, and many dishes become equally tasty.
- Radio Tuning: If you keep the dial fixed, one station sounds clear, others are static. Turn the dial to each station’s sweet spot, and many stations sound crisp.
- Bike Gears: One fixed gear makes certain hills manageable, others impossible. Shift gears (learning rate) to match the slope (curvature), and rides even out.
Before vs After
- Before: Advanced LoRA variants looked clearly superior in some papers—but under fixed or narrow hyperparameters.
- After: With a broad learning rate search, methods cluster within 1–2% at their peaks, and vanilla LoRA is often just as strong.
- New understanding: Differences often reflect preferred learning rate ranges shaped by each method’s loss curvature (Hessian).
Why It Works (intuition, not equations)
- The loss surface can be flat or steep along different directions. The steeper the surface (bigger top eigenvalue), the smaller your steps must be to avoid bouncing or diverging.
- Different initializations (like PiSSA vs vanilla LoRA) start you at different curvatures, changing how big a safe step is. That’s why PiSSA generally prefers smaller learning rates.
- Once you match step size to curvature, performance across methods evens out.
Building Blocks (each with a mini sandwich)
- Learning Rate 🍞 Hook: Turn the music volume too high and it hurts; too low and you can’t hear. 🥬 Concept: Learning rate sets step size. Right size → stable, fast learning. Wrong size → stalls or crashes. 🍞 Anchor: Picking 2e-4 instead of 2e-5 can turn a failing run into a winning one—or vice versa.
- Rank (capacity) 🍞 Hook: A tiny toolbox vs a full toolbox. 🥬 Concept: Rank controls how flexible adapters are. Lower rank = fewer tools; higher rank = more tools. 🍞 Anchor: At rank 8, DoRA may look better; at rank 256, vanilla LoRA can catch up or win.
- Hessian/Sharpness 🍞 Hook: Walking on a sharp ridge requires baby steps. 🥬 Concept: Larger top eigenvalue = sharper ridge = smaller safe learning rate. 🍞 Anchor: PiSSA showed larger eigenvalues than LoRA, matching its need for smaller learning rates.
- Fair Tuning 🍞 Hook: Don’t compare apples at room temperature to oranges fresh from the freezer. 🥬 Concept: Each method deserves its own tuning sweep. 🍞 Anchor: After sweeping, peak scores across methods cluster tightly.
03Methodology
🍞 Top Bread (Hook) Imagine judging five cookies by baking all of them at exactly 350°F for 7 minutes. That’s not fair—some cookies need 325°F, some 375°F, and some need more or less time. You should try a range for each recipe.
🥬 Filling (The Actual Concept)
- What it is: A head-to-head re-evaluation of vanilla LoRA and four variants, each given a wide learning rate grid (and selected sweeps of ranks and batch sizes) under the same training protocol.
- How it works (high level): Input (datasets + pre-trained models) → Choose a LoRA-style method → Sweep learning rates (and sometimes ranks, batch sizes) → Train adapters with a fixed protocol → Measure accuracy on standard test sets → Pick the best result per method → Compare peaks fairly.
- Why it matters: It isolates method quality from tuning luck, letting us see whether any method truly has a consistent advantage.
Step-by-step (like a recipe)
-
Pick base models
- What happens: Use Qwen3-0.6B, Gemma-3-1B, and Llama-2-7B (decoder-only, non-instruction-tuned).
- Why it exists: To test across different model sizes and ages.
- Example: Llama-2-7B is a popular open model; Gemma-3-1B is smaller and newer.
-
Choose tasks and data
- What happens: Train on MetaMathQA for math and CodeFeedback (Python) for coding. Test on GSM8K + MATH (math) and HumanEval + MBPP (code) using accuracy/pass@1.
- Why it exists: Math and code are challenging and widely used; using standard benchmarks makes results meaningful.
- Example: A model that gets ~35% on Llama-2-7B math is like a solid B on a tough test.
-
Define the competitors
- What happens: Compare 5 PEFT methods: • Vanilla LoRA (baseline) • PiSSA (SVD-based init with principal components) • MiLoRA (SVD-based init with minor components) • Init[AB] (random non-zero init for both A and B) • DoRA (separates direction and magnitude updates)
- Why it exists: These represent popular families: vanilla, initialization tweaks, and architecture tweaks.
- Example: Think of five different cookie recipes.
-
Set a shared training protocol
- What happens: Same optimizer (AdamW), same scheduler family (cosine with warmup), same adapter placements (all linear layers), same precision rules, one epoch, scaling factor α=r (so overall LoRA scaling γ_r=1), fixed max sequence length, and no dropout.
- Why it exists: To ensure differences come from methods or tuned hyperparameters, not from hidden training tricks.
- Example: Bake all cookies in the same oven with the same pans and same timer—only change temperature and time as part of the tuning.
-
Sweep key hyperparameters
- What happens: Learning rates are swept log-uniformly from 1e-6 to 1e-3 (or from 2e-5 to 3.6e-3 for Llama)—four values per decade; ranks {4,8,16,32,64,128,256} (on selected setups); batch sizes {16,64,128} (selected setups).
- Why it exists: Broad searches uncover each method’s true sweet spot.
- Example: On Gemma-3-1B, r=128, B in {16, 64, 128}, η across 12+ points.
-
Train, evaluate, and record peaks
- What happens: For each config, fine-tune, then evaluate on test sets; for Qwen and Gemma use three random seeds and report mean±std. Keep the best score per method.
- Why it exists: Averaging reduces luck; best score shows the method’s potential when tuned well.
- Example: If DoRA’s top at η=2e-4 is 20.96% on Gemma math (B=16, r=128), that’s its fair peak.
-
Analyze the Hessian to explain learning rate preferences
- What happens: Estimate the top eigenvalue (λ_max) of the layer-wise Hessian at initialization for trainable LoRA parameters using Lanczos + Hessian-vector products.
- Why it exists: Bigger λ_max means sharper curvature → smaller safe learning rate. This links theory and practice.
- Example: PiSSA shows larger λ_max than LoRA across layers, matching its need for smaller η.
The Secret Sauce (what makes it clever)
- Equal footing: Every method gets its own best learning rate.
- Wide sweeps: Many learning rates per decade reduce cherry-picking.
- Second-order view: Hessian eigenvalues connect the ‘why’ behind tuning differences.
- Rank coverage: Checking multiple ranks catches flip-flop performance (who wins at low vs high rank).
Concrete data walkthrough
- Example: Llama-2-7B, math, r=128, B=128. • LoRA peaks ~35.66% around η≈6.3e-4. • DoRA peaks ~36.41% around η≈1.1e-4 to 2e-4. • PiSSA peaks at smaller η and holds up longer at very large η before collapse in some cases. • Yet peak gaps are small (≈1% range), underscoring that tuning compresses differences.
- Example: Gemma-3-1B, math, r=128. • LoRA ~20.32% (best), DoRA ~20.96%, Init[AB] ~20.66%, MiLoRA ~19.99%, PiSSA ~20.65%. • Spread within ≈1%—all tightly clustered when tuned.
What breaks without each step
- No shared protocol: Hidden differences (like dropout or placement) could skew results.
- No broad learning rate sweep: A method may look unfairly strong or weak.
- No Hessian check: We’d see different best η’s but lack a satisfying reason why.
- No rank sweeps: We’d miss the flip that happens at low vs high rank.
04Experiments & Results
🍞 Top Bread (Hook) If you only let runners wear size-8 shoes, your results may just be measuring who fits size 8, not who runs fastest. Let each runner choose the right shoe size, and the race becomes real.
🥬 Filling (The Actual Concept)
- What it is: A broad empirical study showing that, when learning rates are tuned, vanilla LoRA and multiple variants achieve almost the same best accuracy across math and coding tasks.
- How it works: Models (Qwen3-0.6B, Gemma-3-1B, Llama-2-7B) are fine-tuned with each method over a wide learning rate grid. Ranks and batch sizes are also explored in selected setups. We report peak results per method.
- Why it matters: Reported big wins for new methods often vanish under tuning; vanilla LoRA remains a competitive baseline.
🍞 Bottom Bread (Anchor) Think of tuning as letting every runner adjust their laces and socks. Once everyone is comfy, their times cluster closely.
The Test (what was measured and why)
- Accuracy on standard test sets: • Math: GSM8K, MATH. • Code: HumanEval, MBPP (pass@1).
- Why accuracy: It directly reflects how helpful the fine-tuned model is at these tasks.
The Competition (who was compared)
- Vanilla LoRA.
- PiSSA (principal components init).
- MiLoRA (minor components init).
- Init[AB] (both A and B non-zero random init).
- DoRA (direction + magnitude decomposition).
The Scoreboard (with context)
- Qwen3-0.6B (math, r=128, B=64): All five methods peaked around 49–49.6% accuracy with proper learning rates—like everyone scoring an A while differing by a fraction of a point.
- Gemma-3-1B (math, r=128): LoRA ≈20.32%; DoRA ≈20.96%; Init[AB] ≈20.66%; MiLoRA ≈19.99%; PiSSA ≈20.65%. That’s a spread within about 1%—like three kids getting 90, 91, and 92 on the same test.
- Llama-2-7B (math and code, r=128, B=128): Peaks within roughly 0.5–1.8% across methods, again very tight.
- Rank sweeps: At low ranks (e.g., r=8), DoRA or MiLoRA may edge out LoRA by up to ≈1%; at higher ranks (e.g., r=128–256), LoRA or PiSSA may slightly lead. The ‘winner’ can switch with rank.
Surprising Findings
- PiSSA often needs a notably smaller learning rate than LoRA (and sometimes stays stable at very large rates longer). Hessian analysis showed PiSSA’s loss is sharper initially (bigger top eigenvalue), explaining the smaller best learning rate.
- Learning rate tuning mattered far more than batch size in these settings. With a fixed, suboptimal learning rate, results could drop from a near-top score to a mediocre one.
- The optimal learning rate tends to grow with batch size (classic scaling rule), clarifying past reports where larger batches looked worse—likely due to untuned learning rates.
Big Picture Takeaway
- After fair tuning, methods’ peak performances cluster tightly. Reported large gains from single-setting comparisons often don’t generalize. Vanilla LoRA remains a strong, simple choice when learning rate is tuned.
Hessian Lens (why η differs)
- Block-wise Hessian estimates show PiSSA has larger top eigenvalues than LoRA at initialization, matching its preference for smaller learning rates. Init[AB] and MiLoRA are closer to LoRA on Gemma and Llama, but slightly larger on Qwen—consistent with slightly smaller preferred η there.
05Discussion & Limitations
🍞 Top Bread (Hook) Think of building a treehouse with a small toolbox. You can do a lot—but there are still some jobs that need more tools or more time.
🥬 Filling (The Actual Concept)
- Limitations (what this CAN’T do) • Only decoder-only models up to 7B were tested; results may differ on larger or multi-modal models. • Only math and code tasks were evaluated; other tasks (dialogue safety, retrieval-augmented QA) might behave differently. • Not all LoRA variants in the literature were included; a few may truly help in specialized conditions. • Hessian was analyzed at initialization and per-layer blocks; full-trajectory, whole-model analyses could reveal further nuances (especially for DoRA’s evolving curvature).
- Required Resources • Multiple GPUs (e.g., 4× A6000/3090), DeepSpeed for training, vLLM for eval. • Time to run broad learning rate sweeps; three seeds for stability on smaller models.
- When NOT to Use • If you cannot afford any tuning at all and must trust defaults, your results may be misleading. • If you need extreme low-rank adapters in a very data-limited regime, a variant tailored for that corner case might still be worth trying (but tune LR!).
- Open Questions • Do these findings scale to 70B+ models and instruction-tuned backbones? • How do results change for safety, multilingual, or long-context tasks? • Can we predict the right learning rate from quick Hessian probes and skip most of the sweep? • How do optimizer choices (e.g., AdamW vs Lion), warmup, and schedulers interact with each variant’s curvature?
🍞 Bottom Bread (Anchor) It’s like saying, “Our recipe works great in a home oven; we still need to test it in a commercial bakery and for different desserts.”
06Conclusion & Future Work
🍞 Top Bread (Hook) If shoes fit, lots of runners can run fast. If not, even great runners stumble.
🥬 Filling (The Actual Concept)
- 3-Sentence Summary
- This paper re-tests vanilla LoRA and four advanced variants under broad learning rate sweeps (plus selected rank and batch size sweeps) across math and coding tasks.
- After tuning, all methods reach similar best performance (within about 1–2%), and vanilla LoRA remains a strong baseline.
- A Hessian analysis explains why each method prefers a different learning rate: sharper curvature means a smaller safe step.
- Main Achievement • Showing that learning rate tuning largely closes the performance gaps among LoRA-style methods, reframing many previously reported advantages.
- Future Directions • Scale to larger models and more tasks; develop fast predictors of the best learning rate (e.g., quick Hessian probes); study optimizer and scheduler interactions; expand to more variants.
- Why Remember This • Because careful tuning—especially the learning rate—can matter more than picking a fancy method. Fair comparisons require fair sweeps, and vanilla LoRA shouldn’t be underestimated.
Practical Applications
- •Start any LoRA fine-tuning run with a broad learning rate sweep (e.g., 1e-6 to 1e-3, 4 points/decade) before judging methods.
- •Use vanilla LoRA as a strong baseline and only switch to a variant after confirming a consistent, tuned advantage.
- •Scale the learning rate with batch size (try increasing η when you increase batch) instead of keeping η fixed.
- •If you try PiSSA, begin with smaller learning rates than you would for vanilla LoRA.
- •At very low ranks (e.g., r≤8), try DoRA and MiLoRA, but still sweep the learning rate to validate gains.
- •At medium/high ranks (e.g., r≥64), re-check vanilla LoRA and PiSSA; run a fresh sweep to see who wins there.
- •Keep adapter placement and other training settings fixed across methods to avoid hidden confounds.
- •Run multiple seeds on smaller models to ensure results are stable before scaling up.
- •If resources allow, estimate sharpness (e.g., quick Hessian probes) to guide initial learning rate choices.
- •Document your sweep ranges and report peak results for each method to ensure fair, reproducible comparisons.