PaCoRe: Learning to Scale Test-Time Compute with Parallel Coordinated Reasoning

Jingcheng Hu; Yinmin Zhang; Shijie Shang; Xiaobo Yang; Yue Peng; Zhewei Huang; Hebin Zhou; Xin Wu; Jie Cheng; Fanqi Wan; Xiangwen Kong; Chengyuan Yao; Kaiwen Yan; Ailin Huang; Hongyu Zhou; Qi Han; Zheng Ge; Daxin Jiang; Xiangyu Zhang; Heung-Yeung Shum

PaCoRe: Learning to Scale Test-Time Compute with Parallel Coordinated Reasoning

Intermediate

Jingcheng Hu, Yinmin Zhang, Shijie Shang et al.1/9/2026

arXiv PDF

Key Summary

•PaCoRe is a way for AI to think in many parallel paths and then coordinate them, so it can use a lot more brainpower at test time without running out of context window space.
•It works in rounds: explore in parallel, shrink each path into a short message, then combine those messages to guide the next round.
•A special training with reinforcement learning teaches the model to synthesize (not just vote) across conflicting messages to reach a better final answer.
•Because only short messages are kept, PaCoRe can effectively use millions of tokens of compute while staying within a normal context window.
•On hard math, an 8B PaCoRe model hit 94.5% on HMMT 2025, beating GPT-5’s 93.2%, by scaling effective test-time compute to roughly two million tokens.
•PaCoRe outperforms simple self-consistency (majority voting), which quickly saturates and can’t keep improving with more samples.
•Ablations show both parallel exploration and message passing are essential; without compaction, performance hits the context limit.
•Training data and method also generalize: PaCoRe improves coding and multi-turn tasks and even boosts other RL training when its data are used.

Why This Research Matters

PaCoRe shows how to grow a model’s problem-solving power at answer time without needing a bigger context window. That means stronger help on tasks like competitive math, debugging tricky code, or making careful plans. It makes extra compute pay off by coordinating many attempts and keeping only the most useful notes. This approach can transfer to multi-turn conversations and software engineering, improving reliability when details matter. Open-sourcing the model, data, and pipeline invites others to build on it, accelerating progress in practical reasoning AI.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook) Imagine you’re solving a big puzzle with your friends. If only one friend can talk at a time and you must write every single step on a tiny notepad, you’ll run out of space before you finish. But if lots of friends work in parallel on different parts, and each person reports just the key clue, you can fit more total thinking into the same tiny notepad.

🥬 Filling (The Actual Concept)

What it is: PaCoRe is a new way for AI to think that shifts from one long, crowded chain of thought to many parallel attempts that share short, helpful messages between rounds.
How it works (step by step):
1. Start with a hard problem. 2) Launch many parallel reasoning paths. 3) Compress each path into a short message (like a punchline). 4) Feed all these short messages back to the model. 5) Use them to guide the next round of parallel exploration. 6) Repeat a few rounds until a final answer emerges.
Why it matters: Standard models squeeze all steps into a single growing chain inside a fixed context window; once it fills up, thinking must stop. PaCoRe decouples how much total thinking is done from how much space the window has by storing only compact messages between rounds.

🍞 Bottom Bread (Anchor) Think of a group science project. Instead of having one student write every idea in a huge essay, each teammate explores a clue, then writes a one-sentence takeaway. The group reads these takeaways, plans the next steps, and repeats. The final report is better and still fits on a page.

— New Concepts Introduced Here —

Test-time Compute (TTC) 🍞 Top Bread (Hook) You know how you can spend more time checking your math on a test to be surer of your answer? 🥬 Filling

What it is: TTC is how much thinking a model does on a single problem after it’s trained.
How it works: 1) The model can run more steps, 2) try multiple attempts, and 3) cross-check them before answering.
Why it matters: More TTC often means better answers on hard problems; without it, the model may stop early and miss the solution. 🍞 Bottom Bread (Anchor) Like taking extra scratch-paper attempts on a tricky math question before you circle your final answer.

Context Window 🍞 Top Bread (Hook) Imagine writing notes on a whiteboard that only has so much space. 🥬 Filling

What it is: The context window is the text space the model can see at once.
How it works: 1) Put the question and clues on the board, 2) the model reads them to think, 3) but it can’t see anything beyond the board.
Why it matters: If every thought step must stay on the board, you run out of space and must stop thinking. 🍞 Bottom Bread (Anchor) If your whiteboard fits 100 lines, you can’t list 1,000 steps there, even if you have time to think more.

Message-Passing Architecture 🍞 Top Bread (Hook) Think of passing short sticky notes between teammates during a project. 🥬 Filling

What it is: A way for different reasoning paths to share short summaries with each other between rounds.
How it works: 1) Each path creates a brief message, 2) messages are gathered, 3) the next round reads these to plan better searches.
Why it matters: Without message passing, the team can’t coordinate and keeps repeating work. 🍞 Bottom Bread (Anchor) Everyone explores a clue, then passes a one-line finding. The next meeting starts smarter.

Parallel Exploration (Parallel Reasoning Trajectories) 🍞 Top Bread (Hook) It’s like having many detectives chase different leads at the same time. 🥬 Filling

What it is: Running multiple independent solution paths in parallel for the same problem.
How it works: 1) Spawn many tries, 2) let each pursue a hypothesis, 3) later compare what they found.
Why it matters: A single path can get stuck; many paths raise the chance someone finds the key. 🍞 Bottom Bread (Anchor) Multiple classmates solve the same riddle differently; one of them spots the trick.

Message Compaction 🍞 Top Bread (Hook) Imagine squashing a long story into a headline that keeps the key fact. 🥬 Filling

What it is: Turning each long reasoning path into a short final message that fits in the context.
How it works: 1) Read the path, 2) extract the conclusion, 3) discard the bulky steps, 4) keep it short and clear.
Why it matters: Without compaction, messages become too long and overflow the context window. 🍞 Bottom Bread (Anchor) Instead of pasting full essays into a group doc, you paste just the final bullet points.

Reasoning Synthesis 🍞 Top Bread (Hook) Like a judge listening to many witnesses and then deciding what truly happened. 🥬 Filling

What it is: The skill of combining many messages—some conflicting—into a single, better answer.
How it works: 1) Check where messages agree, 2) spot contradictions, 3) weigh evidence, 4) build a unified plan or conclusion.
Why it matters: If you just vote or ignore the notes, you waste the group’s work and stay average. 🍞 Bottom Bread (Anchor) From five different tips on a math problem, you stitch together the steps that actually solve it.

Reinforcement Learning (Outcome-Based RL) 🍞 Top Bread (Hook) When you practice free throws and only count points when the ball goes in, you learn what works. 🥬 Filling

What it is: A way for the model to learn by trying, getting a reward if the final answer is correct, and adjusting to do better next time.
How it works: 1) Model reads the problem plus messages, 2) proposes a solution path, 3) receives a reward if the extracted final answer is right, 4) updates itself to favor helpful synthesis.
Why it matters: Without RL, models often ignore context and try to solve from scratch (reasoning solipsism), wasting the team’s notes. 🍞 Bottom Bread (Anchor) Like a dog learning tricks from treats, the model learns to use the notes because that wins rewards.

The World Before PaCoRe Before this work, language models improved with chain-of-thought and sometimes with self-consistency (sample many answers and vote). These methods still tried to pack all thinking into one growing text inside a fixed context window. On tricky tasks like competition math or coding, the window became the ceiling: once full, thinking stopped. Voting could help only when a short, easy-to-compare answer existed; for messy or multi-step problems, models often ignored shared context and started over.

The Gap We needed a general way to: (1) unleash massive parallel exploration, (2) coordinate those efforts compactly, and (3) actually synthesize better answers from the shared notes. PaCoRe fills this gap by interleaving broad parallel search with tight message compaction and RL-trained synthesis.

Real Stakes This matters in everyday ways: better math help, stronger code debugging, more reliable planning, and clearer multi-turn conversations. More importantly, it shows how to grow a model’s effective brainpower at answer time—without needing a bigger context window or a bigger model.

02Core Idea

Aha! Moment in One Sentence If we keep only short, critical messages between rounds and train the model to coordinate them, we can scale how much total thinking it does at test time far beyond the context window.

Three Analogies (Different Lenses)

Study Group: Many classmates solve a problem in their own notebooks. Each posts a sticky-note takeaway to a shared board. The group reads the stickies, adjusts, and repeats until they ace the question.
Detective Squad: Investigators chase different leads in parallel. After each sweep, they file brief reports. The chief compares them, resolves conflicts, and directs the next sweep.
Cooking Showdown: Multiple chefs try variations of a recipe at once. Each leaves a tasting note. The head chef blends the best ideas into a winning final dish.

Before vs After

Before: One long, linear chain of thought stuffed into a fixed window. More steps = more context used, until it overflows. Self-consistency helps only when answers are easy to compare.
After: Many parallel tries per round, compact messages, and a synthesis brain that actually uses the notes. Now we can pour in millions of tokens of total exploration while the context stays small because it holds only the problem and short messages.

Why It Works (Intuition, Not Equations)

Breadth beats stuckness: Multiple parallel paths reduce the chance all get stuck the same way.
Compaction beats the window: By turning long paths into short conclusions, you shift storage off the main context and keep coordination lightweight.
Synthesis beats voting: On complex tasks, the best final answer often requires merging complementary hints, not just counting identical guesses.
RL beats habits: Rewarding only good final outcomes pushes the model to actually read and leverage the messages rather than ignoring them.

Building Blocks (Explained with Sandwich Pattern)

PaCoRe (the Framework) 🍞 Top Bread (Hook) You know how a relay team wins by coordinating smooth handoffs, not just running fast? 🥬 Filling

What it is: An iterative framework that alternates between parallel exploration and message-based coordination.
How it works: 1) Run many attempts, 2) compress each into a message, 3) feed messages into the next round, 4) end with a single, synthesized answer.
Why it matters: It breaks the link between total thinking and context size, enabling multi-million-token effective compute. 🍞 Bottom Bread (Anchor) A debate club runs multiple arguments, then each captain posts a one-liner; the coach fuses them into the winning case.

Parallel Exploration 🍞 Top Bread (Hook) Many treasure hunters search different parts of the map at once. 🥬 Filling

What it is: Launching multiple independent solution paths per round.
How it works: 1) Sample different trajectories, 2) let them roam, 3) harvest their conclusions.
Why it matters: Diversity uncovers breakthroughs that a single path may miss. 🍞 Bottom Bread (Anchor) Trying ten approaches to a geometry proof; one reveals the hidden angle relation.

Message Passing 🍞 Top Bread (Hook) Short postcards between expeditions keep everyone aligned. 🥬 Filling

What it is: Sharing compact notes between rounds.
How it works: 1) Collect conclusions, 2) bundle them with the problem, 3) feed them to the model to plan the next sweep.
Why it matters: Coordination turns many solo searches into a team. 🍞 Bottom Bread (Anchor) A captain reads all one-line findings and directs where to sail next.

Message Compaction 🍞 Top Bread (Hook) Summaries beat transcripts when space is tight. 🥬 Filling

What it is: Extracting just the final conclusion from each trajectory.
How it works: 1) Parse the trajectory, 2) keep the conclusion segment, 3) discard lengthy reasoning.
Why it matters: Prevents context overflow so more total compute can be used. 🍞 Bottom Bread (Anchor) Keep the final answer, not the entire scratch work.

Reasoning Synthesis 🍞 Top Bread (Hook) A conductor blends different instruments into harmony. 🥬 Filling

What it is: Combining messages (even conflicting ones) into a stronger final solution.
How it works: 1) Cross-check, 2) reconcile disagreements, 3) plan a smarter next step or final answer.
Why it matters: Beats naive voting and unlocks improvements as compute scales. 🍞 Bottom Bread (Anchor) Two partial algebra hints fuse into the full factorization.

Outcome-Based Reinforcement Learning 🍞 Top Bread (Hook) Scoreboard learning: you practice what actually wins points. 🥬 Filling

What it is: Training that rewards only correct end results to push real synthesis.
How it works: 1) Provide problem + messages, 2) model answers, 3) give reward if correct, 4) update policy.
Why it matters: Prevents “reasoning solipsism” where the model ignores the messages and starts over. 🍞 Bottom Bread (Anchor) The model learns that reading teammates’ notes boosts its chance to score.

Net Effect This combination lets an 8B model rival and surpass frontier systems on math by using huge effective TTC (about two million tokens) without exceeding a normal context window.

03Methodology

High-Level Recipe: Input → Parallel Exploration & Synthesis → Message Compaction → Next Round (repeat R−1 times) → Final Single-Trajectory Answer

Step 0: Prepare the Input

What happens: The problem x is paired with a set of compact messages M from the previous round (or empty for round 1). A prompt function formats them into a structured input: the problem followed by “Reference Responses” (the messages).
Why this exists: Structure makes it easy for the model to treat messages as evidence to consider, not noise.
Example: Problem: “Find the value of n for which …” Messages: “Ref 1: n=12 by parity,” “Ref 2: n=10 by factorization,” etc.

Step 1: Synthesis and Parallel Exploration (per Round r) 🍞 Top Bread (Hook) Like starting a new treasure hunt after reading postcards from the last trip. 🥬 Filling

What it is: Generating K_r independent trajectories from the formatted input.
How it works: 1) Invoke the same model in parallel on the same input, 2) each run explores different steps (sampling), 3) each ends with a conclusion.
Why it matters: Parallelism increases diversity and reduces the risk of all paths failing the same way. 🍞 Bottom Bread (Anchor) Six code solutions attempt different data structures; one nails both performance and correctness.

Details in Plain Language

What happens: The model reads the problem and short messages, then spawns K_r separate solution attempts. Sampling (via temperature/top-p) creates variation so paths don’t collapse into clones.
Why this step exists: New ideas need fresh tries; without this, you just recycle the same thought.
Tiny data example: Problem: “Given array [3,1,4], make sum divisible by 3.” Messages: “Ref 1: remove 1,” “Ref 2: change 4→3.” Parallel attempts might test removal vs modification strategies.

Step 2: Message Compaction (per Round r) 🍞 Top Bread (Hook) Turn long expedition journals into short radio check-ins. 🥬 Filling

What it is: Extract just the conclusion from each trajectory to form a short message set M_r.
How it works: 1) Parse each trajectory ω(i)_r, 2) keep the final conclusion segment, 3) discard the internal steps.
Why it matters: Keeps the coordination traffic tiny, so the next round’s context fits. 🍞 Bottom Bread (Anchor) Instead of copying pages of derivations, you paste: “Final: n=12.”

Details in Plain Language

What happens: A simple extraction picks the last, clearly marked answer span. The resulting messages are short enough to include many of them next round.
Why this step exists: Without compaction, messages swell and crush the context. Scaling stalls.
Tiny data example: From a 500-token reasoning, keep only: “Answer: 17.”

Step 3: Iterate Rounds and Converge

What happens: Repeat Steps 1–2 for R−1 coordination rounds. On the final round, set K_R=1 to produce a single, final compact message y (the system’s answer).
Why this exists: Multiple cycles let evidence accumulate and conflicts be resolved by synthesis.
Example: Round 1 (K=32) finds varied leads; Round 2 (K=4) zooms in on the best; Final (K=1) delivers the answer.

The Training Procedure (Teaching Synthesis) 🍞 Top Bread (Hook) Practice matters: scrimmages with scorekeeping turn a team into a coordinated unit. 🥬 Filling

What it is: Outcome-based reinforcement learning (RL) that treats a single PaCoRe round as an episode.
How it works: 1) Sample a problem and a set of messages M, 2) the model proposes a trajectory, 3) extract its final answer, 4) reward = 1 if correct else 0, 5) update the policy via PPO to increase future rewards.
Why it matters: It pushes the model to truly use M (not ignore it) and to reconcile disagreements. 🍞 Bottom Bread (Anchor) If all inputs are wrong but contain useful hints, the trained model can still recover a correct solution—proof of real synthesis beyond voting.

Data Curation for Learning Synthesis

What happens: The training pool is filtered to prefer hard cases where naive aggregation (like majority voting) fails. Messages are sampled from a cache of model-generated attempts (size 16–24), ensuring diversity.
Why this exists: If training was full of easy cases where voting works, the model would never need to learn deep synthesis.
Example: Coding tasks with multiple failing hints; math problems where partial steps conflict.

Secret Sauce

Coordinated breadth: Parallel exploration plus message passing means more total thinking without bloating context.
Compact handoffs: Only conclusions are carried; this is the key to scaling.
RL-for-synthesis: Rewards for correct final outcomes pressure the model to use the handoffs wisely, not to restart from scratch.

Concrete Walkthrough (Mini Math Example)

Input: “Solve: Find integer n such that n^2−5n=14.” (Messages: none in Round 1)
Round 1 (K=8): Paths try factoring, completing square, trying integers. Conclusions vary: n=7, n=−2, etc.
Compaction: Keep short messages: “Ans: 7,” “Ans: −2.”
Round 2 (K=4): Reads both results, checks they both satisfy the equation; synthesizes: “Two solutions: n=7 and n=−2.”
Final (K=1): Outputs: “{−2, 7}.”

Concrete Walkthrough (Mini Code Example)

Input: “Given array, make sum divisible by 3 with minimal changes.” Messages from Round 1 suggest removing different elements.
Round 2: Synthesis notices removing element with remainder 1 (like 1) works; tries minimal-change variants.
Final: Picks the cheapest valid operation.

Putting It All Together PaCoRe is an inference-time loop that turns many long thoughts into a few short messages, again and again, while training teaches the model to trust, check, and merge those messages. That is how it reaches multi-million-token effective TTC with a normal context window.

04Experiments & Results

The Test: What and Why

What they measured: Accuracy on hard math (AIME 2025, HMMT 2025, IMOAnswerBench), coding (LiveCodeBench), science and multi-turn tasks (HLE text, MultiChallenge). They also tracked effective test-time compute (TTC) per problem.
Why it matters: Higher accuracy with higher TTC shows that the method turns extra compute into better answers, not just more text.

The Competition (Baselines)

Frontier proprietary: GPT-5.
Strong open systems: Qwen3-235B-Thinking, GLM-4.6, DeepSeek-V3.1-Terminus.
A strong starting checkpoint: RLVR-8B (the base PaCoRe is built from).
Majority-vote style: Self-Consistency (SC) sampling as a classic test-time baseline.

The Scoreboard (With Context)

HMMT 2025 (Math): PaCoRe-8B reaches 94.5% with high test-time effort (about 1.8–2.0 million tokens effective TTC), surpassing GPT-5’s 93.2%. That’s like scoring an A+ when the previous top student had an A.
AIME 2025 (Math): 93.7% at high effort—again top-tier.
IMOAnswerBench (Math): 78.4% at high effort—strong gains over the base model.
LiveCodeBench (Coding): 78.2% at high effort—competitive with GLM-4.6 and Kimi-K2-Thinking.
Apex (extremely hard): RLVR-8B got 0.0%; PaCoRe-8B reaches 2.3% at high effort—small number, but meaningful on a punishing benchmark.

Test-Time Scaling That Actually Scales

Self-Consistency (majority voting) quickly plateaus: sampling more answers adds little after a point, even if TTC skyrockets.
PaCoRe keeps improving with more compute: Increase parallel width and coordination rounds → steady gains.
Translation: If you spend more thinking tokens, PaCoRe makes them count.

Ablations (What’s Essential?)

Parallel vs Sequential: With the same number of total attempts, parallel coordinated reasoning beats purely sequential deep chains. It’s like having many scouts instead of one very long march.
Message Passing vs None: Without compaction (passing full trajectories), performance degrades and hits the context limit. With compaction, effective TTC scales beyond the window—no hard ceiling.

Surprising/Notable Findings

Emergent Correctness: Even when all input messages are wrong, the trained model’s chance of producing a correct final answer rises during training—evidence of true synthesis, not just voting.
Linguistic Signals of Synthesis: The model starts using more cross-checking language (“reference”, “ref 2”, etc.) over training, especially in code, where it was near zero at the start.
Generalization: Without special tuning, PaCoRe’s low-effort variant improves SWE-Verified (34.0% vs 29.8% baseline) and boosts MultiChallenge substantially under higher TTC.

Compute and Practicalities

Inference uses normal context windows (e.g., 131k tokens) but achieves multi-million-token effective TTC via parallelism + compaction.
Caching: For evaluation efficiency, they pre-cached a pool of first-round trajectories and sampled seeds for later rounds—empirically equivalent to generating everything fresh.

Bottom Line Across tasks, PaCoRe turns extra test-time compute into real accuracy gains. In math, it surpasses even GPT-5 by coordinating massive parallel exploration within a fixed context window—exactly the promised benefit.

05Discussion & Limitations

Limitations (Be Specific)

Latency and Cost: Parallel rounds mean more compute and coordination overhead. For easy questions, this is overkill.
Domain Dependence: The biggest gains show up on complex, verifiable reasoning (math, coding). Open-ended creative writing may benefit less.
Message Quality: Compaction keeps only conclusions; if a conclusion is misleading without its reasoning, synthesis can stumble.
Training Demand: Outcome-based RL at scale needs infrastructure and careful data curation to avoid learning shortcuts.

Required Resources

A capable base model (e.g., 8B parameters or larger) with long-context support.
Parallel generation infrastructure (many trajectories per round) and a message cache.
RL training pipeline (PPO-like) with verifiable rewards or high-quality judges.

When NOT to Use

Trivial or low-stakes tasks where a single pass is already near-perfect; the extra rounds add delay without benefit.
Strict latency settings (real-time chatbots with tight budgets) unless you use a low-effort PaCoRe setting.
Tasks without a checkable target where outcome-based RL is hard to apply and synthesis quality is tough to measure.

Open Questions

Richer Messages: Can we compress more informative snippets than just final answers (e.g., short rationales) without breaking the context budget?
Adaptive Budgets: How to predict the smallest parallel width/rounds needed per problem to save compute while keeping accuracy?
Multi-Agent Learning: What emerges if both the explorers and the synthesizer are trained jointly with communication protocols?
Beyond Text: How does PaCoRe extend to multimodal tasks (vision, tools) where evidence lives outside text?
Safety and Robustness: How to detect and downweight persuasive but wrong messages during synthesis?

06Conclusion & Future Work

Three-Sentence Summary PaCoRe is a framework that scales test-time compute by coordinating many parallel reasoning paths across multiple rounds, while keeping only compact messages in the context. Trained with outcome-based reinforcement learning, the model learns to synthesize conflicting messages into better final answers. This enables multi-million-token effective compute without exceeding the context window and leads to state-of-the-art results on hard reasoning tasks like HMMT 2025.

Main Achievement An 8B PaCoRe model reaches 94.5% on HMMT 2025—surpassing GPT-5—by turning massive parallel exploration and message-passing synthesis into real, scalable accuracy gains.

Future Directions

Scale breadth and depth further, apply to agentic and multimodal tasks, and learn more efficient division of labor across trajectories.
Jointly train exploration and communication to study emergent multi-agent intelligence.
Use PaCoRe itself to generate better synthetic data for pretraining and post-training.

Why Remember This PaCoRe shows a practical path to “thinking more at answer time” without needing a larger context window or a bigger model. It converts extra compute into better decisions through parallelism, compaction, and learned synthesis—a pattern that could shape how future AI systems plan, reason, and collaborate.

Practical Applications

•Competition math assistants that coordinate many proof attempts and deliver verified final answers.
•Code generation and debugging tools that run multiple design ideas in parallel and synthesize a robust fix.
•Educational tutors that present cross-checked solutions and explain why certain approaches win over others.
•Research helpers that explore diverse hypotheses in parallel and summarize the most promising directions.
•Planning agents that integrate multiple candidate plans into a coherent, low-risk final plan.
•Customer support bots that reconcile conflicting knowledge snippets to produce accurate guidance.
•Data cleaning and validation systems that test parallel rules and synthesize the most consistent dataset.
•Legal or policy draft assistants that merge arguments from different angles into a balanced final brief.
•Scientific computing workflows that try different parameter sweeps in parallel and summarize key findings.

Version: 1