Native Parallel Reasoner: Reasoning in Parallelism via Self-Distilled Reinforcement Learning

Tong Wu; Yang Liu; Jun Bai; Zixia Jia; Shuyi Zhang; Ziyong Lin; Yanting Wang; Song-Chun Zhu; Zilong Zheng

Native Parallel Reasoner: Reasoning in Parallelism via Self-Distilled Reinforcement Learning

Beginner

Tong Wu, Yang Liu, Jun Bai et al.12/8/2025

arXiv PDF

Key Summary

•This paper teaches a language model to think along several paths at the same time instead of one step after another.
•It does this without any stronger teacher model, using a careful three-stage training recipe that learns from its own good examples.
•A new rule set for reinforcement learning, called PAPO, helps the model decide when and how to branch its thoughts in parallel.
•A sturdy engine fixes memory and timing issues so parallel thinking runs correctly and fast on real hardware.
•Across eight tough reasoning tests, the method makes a 4.6× speedup in some cases while also getting more correct answers.
•Unlike earlier methods that secretly fell back to regular one-by-one decoding, this system stays truly parallel 100% of the time.
•The model’s own self-made training data beats teacher-made data by about 10 points on average in some comparisons.
•The approach turns parallel thinking into a reliable, reusable skill for many kinds of reasoning problems.
•It shows that parallel reasoning can be learned natively, not just imitated, opening a path toward faster and smarter AI.
•The full pipeline works on modest 4B-parameter backbones, showing the idea is efficient and scalable.

Why This Research Matters

Many real problems have several good ways to solve them, and trying them one by one wastes time and risks getting stuck. Native parallel reasoning lets models explore multiple ideas at once, so they can be both faster and more reliable. Because this method is teacher-free, it lowers the cost and complexity of building strong reasoners and avoids copying a teacher’s limitations. The robust engine and PAPO make parallelism safe and consistent, which is crucial for real-world deployment. Faster, more accurate models mean better math help, stronger coding assistance, and more trustworthy tools. As tasks grow harder, the speedups grow too, making this approach increasingly valuable. This work turns “thinking wide” into a dependable skill that can be reused across many domains.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook): You know how a team of friends can split a big project into parts—one writes, one edits, one designs—and finish way faster than one person doing everything in order? Our brains and teams often work in parallel.

🥬 Filling (The Actual Concept): Before this paper, most language models solved problems one token at a time in a strict left-to-right line, called autoregressive (AR) decoding. That is simple and reliable, but it makes the model commit early to one path and slows down tough reasoning where exploring several ideas at once would help. How it works: 1) The model reads the question. 2) It writes the next word based only on all prior words. 3) It repeats this until it reaches an answer. 4) If the early path is wrong, it has to backtrack or start over. Why it matters: Without the ability to branch, the model can get stuck on one idea and waste time. 🍞 Bottom Bread (Anchor): Imagine guessing a math answer by following only one plan. If that plan is off by a little, you might finish quickly but be wrong. If you could try three short plans at once, you could pick the best and be both faster and more accurate.

🍞 Top Bread (Hook): Think of computer classes where you learned to run tasks on many processors at the same time. Parallel processing is like many cooks in a kitchen, each handling a dish. 🥬 The Concept: Parallel reasoning lets a model explore multiple independent reasoning steps at once when those steps don’t depend on each other. How it works: 1) Split a big problem into subproblems. 2) Work on each subproblem in its own branch simultaneously. 3) Merge the results at the end to decide the final answer. Why it matters: Without parallel reasoning, you pay a time cost for exploring options one by one and risk sticking to a single, possibly bad idea. 🍞 Anchor: When solving a geometry problem, one branch can try angle chasing, another can try similar triangles, and a third can test a known formula. Then the system compares results and picks the consistent one.

🍞 Top Bread (Hook): Imagine learning to skateboard without a coach—you try, fall, adjust, and slowly figure out good moves. 🥬 The Concept: Reinforcement learning (RL) teaches a model by giving rewards for good outcomes and penalties for bad ones, letting it improve through trial and error. How it works: 1) The model tries to answer. 2) A verifier checks correctness and format. 3) The model gets a reward or penalty. 4) It changes its policy to earn higher future rewards. Why it matters: Without RL, the model mostly imitates; with RL, it can discover better strategies. 🍞 Anchor: If the model answers a math question correctly and formats its steps properly, it gets a thumbs-up and moves its behavior slightly toward what worked.

🍞 Top Bread (Hook): You know how teachers sometimes give you your own past best work as study material because it already shows how you think? That’s self-study. 🥬 The Concept: Self-distillation means the model generates many attempts and then keeps only the correct, well-structured ones to teach itself later. How it works: 1) Sample multiple solutions from the model. 2) Filter by correctness and proper parallel structure. 3) Use those good samples as training data. Why it matters: Without self-distillation, you’d need expensive teacher models to create examples, limiting growth and originality. 🍞 Anchor: The model writes 8 tries to a problem, 2 are correct and cleanly structured; it saves those 2 as examples to learn from next time.

🍞 Top Bread (Hook): Think of Map–Process–Reduce like a class project: plan tasks, do them in groups, then combine results. 🥬 The Concept: A simple, tag-based schema (<guideline>, <plan>, <step>, <takeaway>) structures how to branch, think in parallel, and then merge. How it works: 1) <guideline> lists brief plans. 2) Each <step> executes one plan independently. 3) <takeaway> compares and fuses the results. Why it matters: Without a clear schema, branches get tangled and hard to verify or train on. 🍞 Anchor: For a word problem, plans might be “try algebra,” “test small numbers,” and “draw a diagram,” each run in its own step, then the takeaway picks the consistent solution.

🍞 Top Bread (Hook): Imagine a highway with lanes that keep cars from crashing into each other. 🥬 The Concept: A parallel attention mask and special position IDs act like traffic rules so branches can run side by side without interfering. How it works: 1) Detect branch spans by tags. 2) Mask attention across branches to prevent cross-talk. 3) Align positions so parallel steps share a synchronized timeline. Why it matters: Without these, parallel branches would bleed into each other and break correctness. 🍞 Anchor: While step 1 explores “factorization,” it can’t peek at tokens in step 2 “mod equations,” so each idea stays clean until the merge.

🍞 Top Bread (Hook): A referee who knows the game rules must watch special signals closely. 🥬 The Concept: DAPO is a reinforcement learning setup that updates the model using groups of sampled answers, balancing exploration and stability. How it works: 1) Generate a group of attempts. 2) Score them. 3) Update token probabilities with clipped ratios and normalized advantages. Why it matters: Without careful updates, training becomes unstable or collapses to one pattern. 🍞 Anchor: The model writes several candidate solutions, then slightly boosts tokens in winning ones and softens tokens in weaker ones.

🍞 Top Bread (Hook): Now imagine a traffic controller who is aware of many intersections at once and updates the rules accordingly. 🥬 The Concept: PAPO (Parallel-Aware Policy Optimization) is a new RL objective that preserves gradients on special branching tokens and respects true parallel semantics. How it works: 1) Generate strictly structured parallel rollouts. 2) Filter malformed ones. 3) Normalize advantages at batch level. 4) Keep gradients on special tags; use on-policy updates for stability. Why it matters: Without PAPO, the model forgets how to branch or drifts into fake parallel behavior. 🍞 Anchor: PAPO ensures the tokens that open and close branches remain learnable, so the model reliably plans, explores, and merges.

🍞 Top Bread (Hook): A well-run kitchen needs the right appliances, or even great chefs trip over each other. 🥬 The Concept: The NPR Engine is the upgraded runtime that fixes memory leaks, token budgeting, and illegal branch states so parallel RL runs fast and safely. How it works: 1) Careful KV-cache reclamation avoids corruption. 2) Branch-aware token accounting enforces length limits. 3) Pre-branch validation prevents bad states. 4) Gentle repetition control keeps steps clear. Why it matters: Without a solid engine, parallel training crashes or produces junk. 🍞 Anchor: With the engine’s safeguards, eight branches can sprint in parallel on the GPU without stepping on each other’s memory.

Putting it all together, the world before this work leaned on slow, single-lane reasoning or teacher-made parallel demos. That failed to give models native, dependable parallel skills. The gap was a teacher-free, end-to-end system that could discover and strengthen real parallelism. This paper fills that gap with self-distillation, PAPO, and a robust engine—so the model can split, explore, and merge ideas faster and more accurately, like a well-coached team.

02Core Idea

🍞 Top Bread (Hook): Imagine solving a mystery with several mini-detectives checking different clues at the same time, then meeting to agree on the truth.

🥬 The Concept (One-sentence Aha!): Teach a language model to natively split a problem into parallel branches, explore them simultaneously, and merge the best ideas—without any teacher model—using a three-stage, self-distilled reinforcement learning pipeline plus a purpose-built engine.

Multiple analogies:

Kitchen analogy: One head chef (<guideline>) assigns dishes (<plan>), each cook (<step>) works independently, and the head chef tastes everything (<takeaway>) to serve the best plate.
School project analogy: The leader writes a plan, teammates research different parts in parallel, and the group summary combines the strongest facts.
Sports strategy analogy: Coaches draw multiple plays, runners try them simultaneously in practice, and the team keeps the play that scores most reliably.

Before vs After:

Before: Models mostly walked a single path; when they tried “parallel,” they often simulated it or fell back to sequential decoding. Accuracy and speed suffered on hard problems.
After: The model learns branching as a native skill: it plans multiple routes, runs them in true parallel (100% of the time in tests), and speeds up wall-clock time by up to 4.6× while improving correctness.

Why it works (intuition, no equations):

Structure makes parallel safe: tags and masks create clean, non-overlapping lanes so ideas don’t mix until the merge.
Self-distillation finds real, data-efficient patterns: keeping only correct, well-structured traces gives a compact, high-signal training set from the model’s own distribution.
PAPO protects the “branch tokens”: by keeping gradients on special tags and using on-policy updates, the model doesn’t lose its ability to open, run, and close branches.
Engine stability = reliable learning: fixing memory and accounting issues stops hidden crashes and weird behavior that would otherwise confuse RL.

Building blocks (bite-sized): 🍞 Top Bread (Hook): You know how a to-do list helps you split chores into clear tasks. 🥬 NPR Schema: <guideline> to list plans, <step> for each branch’s reasoning, and <takeaway> to compare results and synthesize. How it works: plan → parallel steps → merge. Why it matters: Without a schema, branches tangle and can’t be verified. 🍞 Anchor: Three plans to solve a puzzle run at once; the takeaway picks the consistent answer.

🍞 Top Bread (Hook): Trying bike tricks and keeping the ones that work is how you improve fast. 🥬 Stage 1 (NPR-ZERO): Use DAPO-style RL with a reward that checks both format and answer to discover valid parallel-shaped outputs without a teacher. How it works: sample, score, update, repeat. Why it matters: Without this warmup, the model won’t learn the tags or structure. 🍞 Anchor: After many tries, the model starts producing well-tagged, sometimes-correct parallel traces.

🍞 Top Bread (Hook): Choose your best practice essays as study guides. 🥬 Stage 2 (NPR-BETA): Rejection-sample only correct, well-formatted traces and do supervised fine-tuning with parallel masks and position IDs so branches run truly in parallel. How it works: filter → SFT with parallel mechanics → stable parallel skills. Why it matters: Without this, the model stays in “pretend parallel,” still relying on sequential visibility. 🍞 Anchor: The model now emits clean branches that the engine executes in parallel.

🍞 Top Bread (Hook): A coach not only praises goals but also keeps the playbook’s signals crisp. 🥬 Stage 3 (Native-Parallel RL with PAPO): Run parallel rollouts with the NPR Engine, filter malformed samples, normalize advantages at batch level, preserve gradients on special tokens, and use on-policy objectives. How it works: stable, structure-aware updates directly inside the parallel graph. Why it matters: Without PAPO, branching weakens or collapses to AR. 🍞 Anchor: PAPO steadily improves how often and how well the model branches and merges to reach correct answers.

🍞 Top Bread (Hook): Tools matter; dull knives slow down great chefs. 🥬 NPR Engine: Memory-safe KV-cache reuse, branch-aware token budgeting, pre-branch validation, and selective repetition penalties. How it works: protect caches, count tokens per branch, reject illegal forms, keep steps concise. Why it matters: Without the engine, training is fragile and unreliable. 🍞 Anchor: Eight branches run together without memory crashes or runaway lengths.

The core idea turns parallelism from a brittle script into a native habit: the model learns when to branch, how wide to branch, and how to reconcile branches—all while getting faster and more accurate.

03Methodology

At a high level: Input question → Stage 1 (Format-Follow RL to find structure) → Stage 2 (Parallel SFT warmup on self-distilled data) → Stage 3 (Native-Parallel RL with PAPO and NPR Engine) → Parallel answers and final merged solution.

Stage 1: Format-Follow RL (NPR-ZERO)

What happens: The model starts from a seed (e.g., Qwen3-4B-Instruct). It generates groups of answers. A verifier gives two signals: structure and correctness. If the output follows the tag schema (<guideline>/<plan>/<step>/<takeaway>), it passes the format check; then correctness gives +1 or -1. If format fails, a penalty applies. The RL objective (DAPO-style) nudges the model toward producing structurally valid, sometimes-correct parallel outputs.
Why this step exists: Without learning the format, later parallel execution can’t be enforced, and branches will be messy. Also, we avoid any stronger teacher so the model can discover its own native parallel habits.
Example with data: On a math item from the 8k ORZ subset, the model samples 8 candidate solutions. Two are well-tagged and correct, three are well-tagged but wrong, and three are malformed. Rewards push the next training step to produce more well-tagged, correct traces.
What breaks without it: Skipping this stage often yields low compliance with tags, causing unstable or non-parallel behavior later.

Stage 2: Rejection Sampling and Parallel Warmup (NPR-BETA)

What happens: We gather many samples from NPR-ZERO. We keep only those that pass both correctness and format checks (rejection sampling). We then run supervised fine-tuning with a parallel attention mask and special position IDs so each <step> attends within itself and shares context safely, enabling genuine parallel execution.
Why this step exists: SFT on self-distilled, high-quality data stabilizes the learned structure and transitions the model from “simulated parallel” to real parallel computation.
Example with data: For each question, we accept only samples where the final numeric answer matches ground truth and the tags are correct. These accepted pairs train the model with NLL loss while the engine enforces parallel masking.
What breaks without it: If you do RL straight into parallel execution, instability increases; if you do SFT without rejection sampling, noisy or malformed data confuses structure learning.

Stage 3: Native-Parallel RL with PAPO

What happens: Starting from NPR-BETA, we now improve reasoning quality using PAPO and the NPR Engine. We generate strictly structured parallel rollouts using engine rules, filter any rare malformed outputs, then compute rewards based solely on correctness. We normalize advantages at the batch level, keep gradients on special tags, and update with an on-policy objective to avoid unstable importance weights.
Why this step exists: SFT locks in structure but can be narrow; RL expands and adapts the policy to perform better on unseen tasks and to optimize branching decisions directly.
Example with data: On AIME-grade problems, the model decides to use 2–5 branches depending on difficulty, tries varied strategies (algebraic manipulation, numerical probing, geometric relations), and the final takeaway picks the consistent solution. Correct runs push the policy to reuse good branching patterns.
What breaks without it: Without PAPO’s protected gradients on special tokens, the model gradually forgets how to branch; without on-policy updates, ratios destabilize training; without batch-level normalization, advantage signals collapse after filtering.

NPR Engine (the secret sauce for stability and speed)

What happens: The engine redesigns core execution: deterministic KV-cache reclamation to prevent double-free, branch-aware token budgeting to prevent overruns, a pre-branch validator to avoid undefined states, and selective repetition penalties inside <step> to keep reasoning crisp.
Why this step exists: Parallel RL stresses memory and control flow in ways typical engines don’t handle. A robust engine is required for large-scale, reliable training and evaluation.
Example with data: During an 8-way parallel rollout, the engine tracks tokens per branch so overall max_new_tokens isn’t exceeded, and it avoids cache corruption when branches fork and join.
What breaks without it: Memory leaks, corrupted context, runaway sequences, and hidden AR fallback can all appear, derailing learning.

Concrete walkthrough with a toy problem

Input: “How many positive two-digit integers are factors of both 100 and 150?”
Stage 1: The model learns to output a <guideline> with plans like “list factors of 100,” “list factors of 150,” “take intersection and filter two-digit.” It begins to format <step> blocks but accuracy may be mixed.
Stage 2: Using only correct, well-tagged traces, SFT with parallel masking lets “list factors of 100” and “list factors of 150” run in parallel, then the takeaway intersects and counts two-digit numbers.
Stage 3: PAPO refines how many branches to use (e.g., add a cross-check step), when to stop a branch, and how to synthesize reliably, improving both speed and accuracy.

Summary flow (recipe style)

Input → Plan (<guideline>/<plan>) → Parallel Steps (<step>×N with mask and positions) → Merge (<takeaway>) → Final answer
Train: Stage 1 RL for format discovery → Stage 2 SFT for stable parallel execution → Stage 3 RL (PAPO) for adaptive, accurate parallel reasoning
Engine: Safe memory, correct token accounting, structural validation, clean step text

The secret sauce

Self-distillation removes dependence on teachers and matches the model’s native distribution.
PAPO keeps the branching machinery learnable and stable.
The engine guarantees that what we train is what we run, making parallelism genuine, fast, and reliable.

04Experiments & Results

🍞 Top Bread (Hook): When you try out for a team, you’re judged on both speed and accuracy, not just one.

🥬 The Test: The researchers measured two big things across eight reasoning benchmarks: (1) accuracy (avg@k and best@k, especially avg@8 on small sets) and (2) speed (tokens per second and speedup vs. a sequential baseline). They also checked a “parallel trigger rate” to see if the model really runs in parallel, not secretly switching back to one-by-one decoding.

The Competition: They compared against strong sequential models (Qwen2.5-32B-Instruct, Qwen3-4B-Instruct-2507, Qwen3-4B non-thinking) and recent parallel systems (Multiverse-32B, a reproduced Multiverse-4B). They also compared to their own sequential versions (SR-BETA and SR) to isolate the effect of parallel methods.

The Scoreboard (with context):

Accuracy gains: On AIME-style and other math/logic sets, NPR consistently improved over both Multiverse and sequential baselines. For example, with Qwen3-4B-Instruct as the base, NPR achieved 50.4% on AIME25 and 63.3% on AIME24, outperforming Multiverse-4B and even Multiverse-32B trained on teacher-distilled data. On ZebraLogic and MATH500, parallel SFT from self-distilled data boosted performance by large margins (e.g., +15.9 points on ZebraLogic vs. the reproduced Multiverse-4B SFT). Overall averages rose notably, with up to about 24.5% improvement on the non-thinking backbone across tasks.
Speedups: NPR delivered up to 4.6× wall-clock speedup over autoregressive decoding, and 1.3×–2.4× faster than Multiverse in tokens-per-second across selected benchmarks. The harder the task, the bigger the speed win (e.g., AIME25: 4.6×; HMMT25: 4.1×; AMC23: 2.9×). That’s like running a race not just faster, but even more so on the toughest courses.
Parallel trigger rate: NPR showed 100% genuine parallel execution across all eight benchmarks, while Multiverse’s rate varied by dataset (e.g., lower on logic-heavy ZebraLogic). This means NPR consistently uses its parallel skills instead of falling back to AR.

Surprising findings:

Self-distilled data beats teacher-distilled: The self-made parallel dataset produced by NPR-ZERO and filtered in Stage 2 outperformed teacher-generated trajectories from prior work by about 10 points on average in some head-to-head comparisons. This suggests learning from your own successful patterns can be cleaner and more tailored than copying a stronger teacher’s sequential habits.
Scalability at test time: best@8 improved notably after NPR, especially when the base model started weaker. This shows that native parallel branching increases the chance that at least one branch nails the solution.
Stability matters: The NPR Engine fixes were not cosmetic; without them, parallel RL suffered from memory issues and control-flow glitches, undermining learning. With the fixes, training and inference were both faster and more dependable.

🍞 Bottom Bread (Anchor): Imagine a quiz where you can try up to eight approaches at once and keep the best. NPR not only tries those eight faster, it also makes more of them correct—especially on the trickiest questions—because its branches are clean, independent, and reliably merged.

05Discussion & Limitations

Limitations:

Model choice sensitivity: The method worked best on Qwen3-4B-Instruct and Qwen3-4B (non-thinking). Heavily RL-trained “thinking-mode” models were hard to reformat; special tokens fragmented during SFT. Porting to very different backbones may require extra tuning.
Verifiable domains: The reward is mainly answer correctness (e.g., math), which is easy to check. Tasks without clear verifiers (e.g., open-ended creativity) may see weaker RL signals, making gains smaller.
Engine dependency: Genuine parallel RL depends on the NPR Engine’s stability. Without robust memory and length accounting, training can falter. This is an engineering investment some teams must plan for.
Data filtering trade-offs: Rejection sampling discards malformed or wrong traces. While this cleans data, it might remove rare but potentially useful reasoning styles, slightly narrowing diversity.
Long-sequence budgets: Parallel branches can increase total token use. Although the engine tracks budgets, teams must still set reasonable max lengths and branching degrees.

Required resources:

A capable 4B-class backbone with good instruction following.
GPUs with enough memory for parallel branches and KV-cache reuse.
The NPR Engine integrated with a structured inference stack.
A modest, verifiable dataset (e.g., an 8k subset of ORZ) to bootstrap the pipeline.

When NOT to use:

Tiny devices or strict latency ceilings where any branching is impossible.
Non-verifiable or highly subjective tasks (pure storytelling) where correctness rewards aren’t clear.
Models whose internal “thinking” tokens are tightly fixed by prior RL; reformatting may be brittle.

Open questions:

How to extend verifiable rewards to broader domains (program synthesis with tests is one step; what about science reasoning with simulators)?
Can PAPO be combined with uncertainty estimates to auto-tune branch widths per problem difficulty more precisely?
How to co-train the schema itself—learning when to introduce sub-blocks or multi-stage verification dynamically?
Can we share partial branch results safely to reduce compute without breaking independence?
How will scaling to larger backbones and multimodal inputs change the optimal branching and merging strategies?

06Conclusion & Future Work

Three-sentence summary: This paper builds a teacher-free way for language models to think in parallel: plan multiple approaches, explore them at the same time, and merge their results. It uses a three-stage recipe—self-distilled format discovery, parallel warmup, and PAPO-based RL—plus a robust engine that makes parallel execution safe and fast. The result is higher accuracy, up to 4.6× speedups, and 100% genuine parallelism across diverse benchmarks. Main achievement: Turning parallelism from a scripted imitation into a native, reliable habit using self-distillation and a parallel-aware RL objective that protects the branching machinery. Future directions: Broaden verifiable rewards to more domains, scale to larger and multimodal models, refine automatic branch-width control, and explore partial result sharing without leakage. Investigate how the schema can evolve on its own and how PAPO interacts with other policy-optimization tricks. Why remember this: It shows that models can learn to truly “think wide,” not just “think deep,” and that doing so can make them both faster and more accurate. This work offers a clean, scalable path to agentic reasoning that uses parallel exploration as a first-class skill rather than an afterthought.

Practical Applications

•Math tutoring systems that try multiple solution paths in parallel and explain the most consistent one to students.
•Coding assistants that attempt several algorithmic strategies at once and present the fastest correct implementation.
•Data analysis tools that run parallel hypotheses on datasets and converge on the strongest explanation.
•Scientific reasoning assistants that explore alternative models or parameter settings concurrently and summarize validated outcomes.
•Logic puzzle solvers that branch on key assumptions, then cross-check results to avoid hidden contradictions.
•Operations research planners that evaluate multiple schedules or routes in parallel and pick the most cost-effective plan.
•Customer support bots that form independent interpretations of a query and choose the best-matching answer.
•Financial modeling tools that test parallel what-if scenarios and synthesize robust recommendations.
•Education platforms that show different solution methods side by side, helping learners compare and understand.
•Automated grading or verification systems that concurrently test multiple solution checks to ensure fairness and accuracy.

Version: 1