JustRL: Scaling a 1.5B LLM with a Simple RL Recipe

Bingxiang He; Zekai Qu; Zeyuan Liu; Yinghao Chen; Yuxin Zuo; Cheng Qian; Kaiyan Zhang; Weize Chen; Chaojun Xiao; Ganqu Cui; Ning Ding; Zhiyuan Liu

JustRL: Scaling a 1.5B LLM with a Simple RL Recipe

Intermediate

Bingxiang He, Zekai Qu, Zeyuan Liu et al.12/18/2025

arXiv PDF

Key Summary

•JustRL shows that a tiny, steady recipe for reinforcement learning (RL) can make a 1.5B-parameter language model much better at math without fancy tricks.
•It uses single-stage training, fixed hyperparameters, a simple rule-based checker for answers, and one stabilizer (“clip higher”)—and skips length penalties, curricula, and dynamic schedules.
•On two different 1.5B models, the same fixed settings work out-of-the-box and train smoothly for thousands of steps without crashes or weird jumps.
•JustRL-DeepSeek-1.5B reaches a 54.9% average across nine math benchmarks, beating more complex methods while using about half the compute.
•JustRL-Nemotron-1.5B reaches 64.3% average, slightly topping a curriculum-based rival and doing so with roughly 2× less compute.
•Adding “standard tricks” like explicit length penalties or more forgiving verifiers actually hurt performance by squashing exploration.
•Training dynamics (entropy, reward, and response length) are healthy: steady rewards, stable exploration, and natural length compression without penalties.
•The paper argues to start simple, scale stably, and only add complexity when a clear problem appears.
•Models and code are released to set an easy, trustworthy baseline for the community.

Why This Research Matters

Simple, steady training that works out-of-the-box lowers the barrier for schools, startups, and researchers without giant budgets. If a 1.5B model can learn strong math reasoning without complicated schedules, more people can build and improve helpful AI tutors and tools. Stable training curves save time: fewer crashes and band-aid fixes mean faster progress and clearer science. A shared, minimal baseline also makes studies easier to compare, so the community learns what truly matters. Finally, using roughly half the compute for the same or better results is good for the planet and pocketbooks—greener, cheaper, and more accessible AI.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook): You know how some kids study with color-coded planners, timers, and five different notebooks, while others just stick to a simple routine—and still ace the test? Sometimes, keeping it simple works best.

🥬 Filling (The Actual Concept): The world before this paper was full of very complicated training plans to teach small language models how to solve tough math problems. Many groups believed you had to use multi-stage training, changing settings mid-flight, and carefully controlling answer lengths to avoid training from going “off the rails.”

Why it matters: If we add too many knobs and levers, we can’t tell which ones actually help, and smaller labs can’t afford all that complexity.

🍞 Bottom Bread (Anchor): Imagine trying to bake cookies with 12 different oven settings, switching trays every 3 minutes. It might work, but do you really need all that if a steady 350°F for 12 minutes makes perfect cookies?

🍞 Top Bread (Hook): Imagine a puppy learning tricks—it tries things, and when it sits on command, it gets a treat.

🥬 Filling (The Actual Concept): Reinforcement Learning (RL) is a way for models to learn by trying answers and getting a reward when they’re right.

How it works: (1) The model proposes an answer, (2) A checker gives a thumbs-up or thumbs-down, (3) The model tweaks itself to make thumbs-up more likely next time.
Why it matters: Without RL, small models that only copy big “teacher” models (distillation) hit a ceiling—once they learn everything the teacher knows, they stop improving.

🍞 Bottom Bread (Anchor): Like a puppy that learned by watching an older dog (distillation) but only truly improves by getting its own practice and treats (RL).

🍞 Top Bread (Hook): Think of a smart parrot that can talk in long sentences and explain steps.

🥬 Filling (The Actual Concept): A Large Language Model (LLM) is a program that reads and writes language, including solving math if trained properly.

How it works: It predicts the next word over and over, guided by training. In RL, it also gets rewards for correct solutions.
Why it matters: LLMs are powerful but need good training signals to reason reliably.

🍞 Bottom Bread (Anchor): When you ask, “What’s 17×19?”, an LLM that reasons well can explain the steps, not just guess.

🍞 Top Bread (Hook): Imagine running a marathon without breaking it into stages—just start steady and keep going.

🥬 Filling (The Actual Concept): Single-Stage Training means training in one continuous run instead of multiple stages with different settings.

How it works: Pick one setup; keep it throughout training; no phase changes.
Why it matters: Stage switches can introduce instability or confusion, making it harder to know what helped or hurt.

🍞 Bottom Bread (Anchor): It’s like finishing your homework in one focused session rather than chopping it into tiny bits with different rules each time.

🍞 Top Bread (Hook): Picture setting your oven to one temperature and not touching the dial.

🥬 Filling (The Actual Concept): Fixed Hyperparameters means all the training settings (like learning rate or batch size) stay the same from start to end.

How it works: Choose sensible values once; don’t schedule or adapt them during training.
Why it matters: Changing too many settings mid-training can cause weird jumps, make results hard to reproduce, and hide the real source of gains.

🍞 Bottom Bread (Anchor): It’s like riding a bike at a steady speed instead of speeding up and slamming the brakes randomly.

🍞 Top Bread (Hook): Think of a math teacher who checks if your final boxed answer exactly matches the correct one.

🥬 Filling (The Actual Concept): A Rule-based Verifier is a simple checker that reads the model’s final boxed answer and marks it correct or not.

How it works: The model must put its final answer inside \boxed{...}; the checker compares it to the right answer.
Why it matters: A clean, simple reward signal keeps training stable and fast.

🍞 Bottom Bread (Anchor): Like a multiple-choice test that’s graded instantly by scanning the bubbles.

🍞 Top Bread (Hook): Imagine playing a game where you should try new moves, but not go totally wild.

🥬 Filling (The Actual Concept): Entropy Control keeps model exploration at a healthy level—neither too random nor too repetitive.

How it works: Track how unpredictable the model’s choices are and prevent collapse. In this paper, a simple “clip higher” tweak ensures stable updates.
Why it matters: If exploration collapses, the model stops learning new, better ways to solve problems.

🍞 Bottom Bread (Anchor): Like a coach who says, “Try new plays, but stick to the sport’s rules.”

🍞 Top Bread (Hook): Think of a teacher who starts with easy problems and slowly makes them harder.

🥬 Filling (The Actual Concept): Curriculum Learning organizes training from easy to hard.

How it works: Stage 1 uses simpler items; later stages add harder ones.
Why it matters: Helpful sometimes, but it adds complexity and can hide which parts truly matter.

🍞 Bottom Bread (Anchor): Like learning multiplication tables before algebra—good idea, but not always required if the student already has the basics.

🍞 Top Bread (Hook): Imagine turning the stove up and down while cooking because you’re worried the soup will boil over.

🥬 Filling (The Actual Concept): Dynamic Hyperparameter Scheduling changes settings (like temperature or learning rate) during training.

How it works: Algorithms adjust knobs in response to metrics.
Why it matters: It can help, but also adds complexity and can cause instability if misapplied.

🍞 Bottom Bread (Anchor): If your food keeps burning or staying cold, constant fiddling may be the problem, not the solution.

🍞 Top Bread (Hook): Think of a writing contest where super-long essays get a penalty.

🥬 Filling (The Actual Concept): Length Penalties punish overly long responses.

How it works: Add negative reward if outputs exceed certain length.
Why it matters: Can backfire by making the model cut corners before it has learned solid reasoning.

🍞 Bottom Bread (Anchor): Like docking points for wordy essays may make students write too short and skip important steps.

🍞 Top Bread (Hook): Picture a roller coaster that should go smoothly up and down—but instead it jerks wildly.

🥬 Filling (The Actual Concept): Training Instability is when learning oscillates, collapses, or drifts.

How it works: Things like reward collapse, entropy drift, or length explosion appear, often prompting emergency fixes.
Why it matters: Instability wastes compute and hides what truly helps.

🍞 Bottom Bread (Anchor): Like a shaky table makes it hard to write neatly—everything else becomes harder too.

🍞 Top Bread (Hook): Think of a kid who discovers one safe game and refuses to try anything else.

🥬 Filling (The Actual Concept): Exploration Collapse is when the model stops trying new solutions.

How it works: Entropy falls too low; the model repeats itself.
Why it matters: Without trying variations, it can’t discover better reasoning paths.

🍞 Bottom Bread (Anchor): Like always guessing the same multiple-choice option and never learning the material.

🍞 Top Bread (Hook): Imagine a report card that averages all your grades to show overall performance.

🥬 Filling (The Actual Concept): Average Accuracy summarizes how often the model is correct across tasks.

How it works: Compute percent correct per benchmark, then average.
Why it matters: A single number lets us compare methods clearly.

🍞 Bottom Bread (Anchor): Like seeing you got an A- average this semester across all subjects.

02Core Idea

🍞 Top Bread (Hook): Picture a messy toolbox with fancy gadgets you barely use—then someone fixes the bike with just a wrench and a screwdriver.

🥬 Filling (The Actual Concept): The “Aha!” of JustRL is: a simple, single-stage RL setup with fixed hyperparameters, a rule-based reward, and one stability tweak (“clip higher”) can scale a 1.5B reasoning model to state-of-the-art results—no curricula, no dynamic schedules, no length penalties.

How it works: (1) Keep training in one stage, (2) Don’t change hyperparameters, (3) Use a simple boxed-answer checker for rewards, (4) Apply GRPO with clip-higher to keep updates stable, (5) Let lengths settle naturally under a generous 16K context.
Why it matters: If simple wins—saving compute and avoiding instability—then many complex tricks might be optional, not essential.

🍞 Bottom Bread (Anchor): It’s like winning the science fair with clean, sturdy engineering instead of a flashy but fragile contraption.

Three analogies (same idea, new angles):

Cooking: Instead of juggling five burners and three timers, you set one steady heat, stir occasionally, and the stew turns out perfect.
Sports: Rather than rewriting the playbook mid-game, you run a solid, well-practiced game plan with consistent coaching.
Studying: Skip the rainbow-highlighted schedule—make a simple study habit, do enough reps, and your grades climb week after week.

Before vs After:

Before: People believed small models needed many stages, dynamic knobs, and length-shaping penalties to avoid crashes. Training curves often wobbled; compute costs ballooned.
After: With JustRL, two different 1.5B models train smoothly for thousands of steps using the same fixed settings—and match or beat complex rivals with about 2× less compute.

Why it works (intuition without equations):

Stable rewards: A binary, rule-based checker provides a clean, fast signal.
Healthy exploration: Clip-higher curbs oversized policy updates without strangling exploration. Entropy stays in a safe band, so the model keeps trying new solution paths.
Natural length settling: A big enough context window lets the model first explore verbose reasoning and then compress on its own—no penalty tug-of-war needed.
Less interference: Removing overlapping tricks avoids pushing the model in conflicting directions (e.g., explore more vs. talk shorter).
Scale over tweaks: Enough steps, batch size, and rollouts give learning room to breathe, so simplicity + scale beats clever micromanagement.

Building blocks (each as a mini concept):

🍞 Hook: Imagine grading by a clear answer key. 🥬 Concept: Rule-based Verifier gives binary rewards by checking a boxed final answer. How: compare \boxed{...} to ground truth; reward 1 if match else 0. Why: Simple, fast signal. 🍞 Anchor: Like a scantron machine grading multiple-choice answers.

🍞 Hook: Think of practicing a move 8 times and then keeping the best idea. 🥬 Concept: Rollouts are multiple sampled solutions per problem (here, 8). How: sample 8 attempts; score each; learn from them. Why: Seeing several tries helps the model understand which attempts work. 🍞 Anchor: Like rehearsing a speech several times to keep the strongest parts.

🍞 Hook: Picture a careful coach who won’t let risky changes ruin the game plan. 🥬 Concept: GRPO (a policy-optimization algorithm variant) updates the model using advantages from reward signals. How: compare sampled answers’ rewards, increase the chance of better ones; apply clip-higher to avoid too-big updates. Why: Prevents unstable jumps. 🍞 Anchor: Like tuning an instrument gently so a string doesn’t snap.

🍞 Hook: Imagine a librarian who never rearranges shelves mid-day. 🥬 Concept: Fixed Hyperparameters keep training steady. How: one learning rate, one batch size, no mid-course resets. Why: Reduces surprises and makes results reproducible. 🍞 Anchor: Like a classroom with a predictable routine that helps everyone focus.

🍞 Hook: Think of a generous notebook where you can first write long drafts, then edit down. 🥬 Concept: 16K Context without length penalties. How: allow long reasoning and let the model self-compress. Why: Penalties can force premature shortcuts and harm exploration. 🍞 Anchor: Draft first, polish later—quality improves naturally.

Secret insight: When you remove clashes between tricks, a clean, scaled baseline can discover strong behaviors on its own. That’s the core bet JustRL makes—and the results back it up.

03Methodology

At a high level: Math problem + simple prompt → sample 8 solutions (rollouts) → rule-based checker gives binary rewards → GRPO updates the model with clip-higher → repeat for thousands of steps.

Step 0 — Data and prompt

What happens: Use DAPO-Math-17k problems. Append a simple suffix prompt: “Please reason step by step, and put your final answer within \boxed{}.” Max total tokens set to 16K.
Why this step exists: Standardized, clean input keeps training predictable; the box rule aligns generation with the verifier.
Example: Problem: “If 2x+3=11, find x.” Model writes reasoning and ends with \boxed{4}.

🍞 Hook: Like trying a riddle several times to see which explanation works best. 🥬 Concept: Rollouts (N=8) are multiple attempts per problem.

How it works: For each problem, the model samples 8 different solutions at temperature 1.0 during training.
Why it matters: Multiple tries reveal which styles lead to correct boxed answers; learning uses this diversity. 🍞 Anchor: Think of shooting 8 free throws; you analyze which form scores.

Step 1 — Rewarding with a simple checker

What happens: The rule-based verifier checks the final \boxed{...} answer and returns 1 (correct) or 0 (incorrect).
Why this step exists: A crisp, low-noise reward helps steady learning; no heavy math libraries needed.
Example: If the problem’s answer is 17 and the model ends with \boxed{17}, reward = 1; else 0.

🍞 Hook: Imagine updating your strategy based on which attempts scored. 🥬 Concept: GRPO update with clip-higher.

How it works (friend explanation): Compare the 8 attempts; those with reward 1 get boosted; those with 0 get less chance next time. Clip-higher softly caps overly aggressive changes, keeping updates within a safe band.
Why it matters: Prevents destabilizing jumps that cause oscillations or collapse. 🍞 Anchor: Like learning to kick a soccer ball farther while avoiding wild swings that send it into the stands.

Step 2 — Keep settings steady

What happens: Single-stage training, fixed hyperparameters for both backbones: learning rate 1e-6, batch size 256, mini-batch 64, clip range [0.8, 1.28], no KL loss, temperature 1.0, max response 15K tokens, max prompt 1K.
Why this step exists: Simplicity improves stability and makes results transferable (same settings worked on two different 1.5B models).
Example: No mid-training resets, no adaptive schedules—press go and let it run.

🍞 Hook: Think of a teacher who doesn’t punish long drafts but does expect clarity in the final answer. 🥬 Concept: No explicit length penalties; rely on natural compression.

How it works: Provide enough space (16K) so the model can reason fully at first; as it learns, it tends to become more concise on its own.
Why it matters: Penalties in ablations crushed exploration and worsened final accuracy. 🍞 Anchor: Like first outlining a big essay, then trimming fluff once you understand the material.

Step 3 — Training duration and scale

What happens:
- DeepSeek-R1-Distill-Qwen-1.5B: ~4,380 steps on 32×A800-80GB GPUs (~15 days).
- OpenMath-Nemotron-1.5B: ~3,440 steps with the same settings.
Why this step exists: Sufficient scale allows stable, monotonic learning without complicated interventions.
Example: The DeepSeek-based model’s AIME24 avg@32 rises from ~28% to ~58% over ~4,000 steps.

Step 4 — Evaluation

What happens: Test on nine math benchmarks (AIME24/25, AMC23, MATH-500, Minerva, OlympiadBench, HMMT Feb 2025, BRUMO 2025, CMIMC 2025). Use Pass@1 averaged over N samples (N=32 for most, N=4 for MATH-500, Minerva, Olympiad), decoding with temperature 0.7 and top-p 0.9; allow up to 32K tokens at test-time. Use CompassVerifier-3B to catch false negatives on evaluation only.
Why this step exists: Common, reproducible scripts make comparisons fair; extra verifier reduces accidental misgrading.
Example: A problem with answer 2025: the model must end with \boxed{2025} to count as correct.

What breaks without each piece:

Without rollouts: The model sees too few solution variations; learning signal is weaker.
Without a simple verifier: Rewards get noisy or slow; training can wobble.
Without clip-higher: Policy updates can spike; entropy may drift and cause instability.
Without fixed settings: Hard to debug; interventions may mask real causes.
With length penalties: Exploration can collapse early; accuracy drops.

🍞 Hook: Imagine making steady progress in a video game without switching controllers mid-level. 🥬 Concept: Policy Entropy tracking keeps exploration healthy.

How it works: Monitor how varied the model’s token choices are; healthy band ~1.2–1.4 later in training.
Why it matters: Too low = repeats itself; too high = random babble. JustRL stays in the sweet spot. 🍞 Anchor: Like playing boldly but not recklessly.

Secret sauce (what’s clever here):

Do less, scale more: A minimal, non-fighting set of choices (single-stage, fixed hyperparams, simple rewards) lets the model self-organize.
Clip-higher > micromanagement: One stabilizer beats a tangle of competing tricks.
Natural compression: Letting length settle organically avoids the adversarial tug-of-war caused by penalties.

Concrete data walkthrough:

Suppose the problem is: “Find the number of integer solutions to x^2 − 5x + 6 = 0.” Ground truth: 2 solutions (x=2,3). The model samples 8 solutions; 5 end with \boxed{2} (wrong), 3 end with \boxed{2,3} or an equivalent correct final form (depending on formatting, only the strictly correct boxed answer counts). The verifier returns rewards for those with the exact required format (say 2 of 8). GRPO then increases the probability of the reasoning paths that led to those correct endings. Over time, more samples hit the exact correct boxed answer consistently.

04Experiments & Results

The test (what and why):

Benchmarks: Nine respected math sets measured with Pass@1, averaging multiple samples per problem. Why: These reflect challenging, competition-style reasoning, not just routine arithmetic.
Metrics: Per-benchmark accuracy and overall average. Why: A clear scoreboard across diverse task styles.

The competition (who we compare against):

DeepSeek backbone: DeepScaleR-1.5B, ProRL-V2, BroRL (some results partial).
Nemotron backbone: QuestA (curriculum via question augmentation).
They use multi-stage training, dynamic schedules, length controls, or many rollouts; JustRL uses none of that, except entropy-friendly clip-higher.

The scoreboard (with context):

JustRL-DeepSeek-1.5B: 54.87% average across nine benchmarks, surpassing ProRL-V2’s 53.08%. That’s like earning an A- while a complex classmate gets a solid B+. It also leads on six of nine benchmarks, showing broad gains, not cherry-picking.
JustRL-Nemotron-1.5B: 64.32% average, slightly above QuestA’s 63.81%, akin to winning a close race by a stride—impressive because QuestA uses a clever curriculum and extra engineered data.

Compute efficiency (why it matters):

DeepSeek track: JustRL roughly halves compute vs ProRL-V2; BroRL uses ~4.9× more compute by expanding to hundreds of rollouts per example. JustRL reaches strong results without this cost.
Nemotron track: Despite more steps, JustRL-Nemotron uses about 2× less compute than QuestA thanks to simpler, steadier training.

Training dynamics (surprisingly smooth):

Entropy: Stays around 1.2–1.4 later in training—no drift up (chaos) or down (stagnation). Translation: exploration is healthy.
Reward: Climbs steadily from negative to ~+0.4—no long plateaus or crashes. Translation: the model keeps learning without emergency fixes.
Length: Starts verbose (~7–8K tokens) and naturally compresses to ~4–5K by ~1,000 steps without explicit penalties—no length explosion.

Ablations (what happened when we added standard tricks):

Overlong penalty added: AIME24 plateau drops to ~50% (vs ~55% baseline). Entropy collapses to ~0.5–0.6—exploration got choked.
Overlong penalty + robust verifier: AIME24 falls further to ~45%. Hypothesis: a permissive verifier reduces the “richness” of learning signals and removes pressure for precise internal computation and formatting.

Takeaways from the numbers:

Simplicity scales: Same fixed hyperparameters work on two separate backbones, transferring without tuning.
Stable beats clever: Monotonic progress across thousands of steps means the method is less likely to need rescue maneuvers.
Context matters: Tricks that help in one setting (penalties, dynamic schedules) can hurt in another by collapsing exploration or adding interference.

05Discussion & Limitations

Limitations (be specific):

Scope: Results focus on math reasoning at 1.5B scale; we don’t know if coding, QA, or larger/smaller models behave the same.
Attribution: We can’t cleanly separate which pieces (hyperparameters, dataset, verifier) are most critical.
Resources: Though simpler and cheaper than rivals, training still used 32×A800 GPUs for ~2 weeks—heavy for small labs.
Horizon: It’s unclear whether advantages hold at much longer training or when pushing for even higher ceilings.

Required resources:

Hardware: Multi-GPU clusters (e.g., 32×A800-80GB), efficient RLHF tooling (veRL/HybridFlow-like), and storage for long-context logs.
Data: DAPO-Math-17k or similar QA-style math problems; standardized evaluation scripts.

When NOT to use this:

If your domain has very noisy or ambiguous rewards (e.g., open-ended creative writing) where a simple rule-based verifier can’t provide reliable signals.
If compute is so limited that even 8 rollouts or long-context generations are unaffordable—alternative lightweight strategies might be necessary.
If you specifically need curriculum signals (e.g., pedagogy-oriented instruction) or must enforce strict length budgets.

Open questions:

Generalization: Will the simple recipe transfer to coding, multi-step tool use, or multimodal reasoning?
Scaling laws: How do entropy bands, rollout counts, and clip ranges interact at different model sizes?
Verifier design: Can we craft simple-yet-rich reward signals that stay fast but encourage deeper math precision?
Robustness: Under adversarial or distribution-shifted problems, does the no-penalty approach still avoid collapse?
Efficiency: What’s the minimal rollout N and batch size that still preserve monotonic gains?

Bottom line: The work suggests that some complex interventions may fix problems caused by other complex choices. Start with a steady, simple baseline, then add complexity only to solve a specific, observed failure.

06Conclusion & Future Work

Three-sentence summary:

JustRL shows that a minimal RL recipe—single-stage training, fixed hyperparameters, a simple rule-based reward, and clip-higher stabilization—scales 1.5B reasoning models smoothly.
It matches or beats more complex pipelines on nine math benchmarks while using about half the compute and avoiding length penalties or dynamic schedules.
Attempts to add common tricks (length penalties, more forgiving verifiers) actually harmed performance by collapsing exploration.

Main achievement:

Establishing that a clean, stable, reproducible baseline can reach state-of-the-art small-model math reasoning without multi-stage complexity—and that the exact same settings transfer across backbones.

Future directions:

Test transfer to coding, QA, and larger/smaller models; probe how rollouts, entropy, and clip ranges scale.
Explore simple-but-richer verifiers that stay fast yet reward precision.
Find the minimal compute setting (steps, batch, rollouts) that preserves monotonic improvement.

Why remember this:

It reframes the field’s default: before stacking tricks, first try a strong, simple baseline at adequate scale. If it trains stably and climbs steadily, you’ve saved time, compute, and complexity—and you’ll better understand which interventions truly help.

Practical Applications

•Build a classroom math assistant using a 1.5B model trained with JustRL to explain solutions step by step.
•Fine-tune small models for company-specific math or analytics tasks without designing complex curricula.
•Use the JustRL recipe as a baseline to test new reward functions or datasets and measure real lift over a stable control.
•Deploy long-context reasoning (up to 16K tokens) for technical reports that require step-by-step derivations.
•Create lightweight math validators using simple boxed-answer rules to speed up RL training loops.
•Run ablations safely: start from JustRL and add one change at a time to see if it genuinely helps.
•Train model variants on limited budgets by keeping rollouts modest (e.g., 8) and skipping dynamic schedules.
•Use the stable entropy band as a health check for other RL projects to detect exploration collapse early.
•Port the fixed hyperparameters to similar small models in new domains as a strong first try before tuning.
•Establish a reproducible benchmark pipeline (the nine math sets) for fair comparisons across teams.

Version: 1