Long-horizon Reasoning Agent for Olympiad-Level Mathematical Problem Solving

Songyang Gao; Yuzhe Gu; Zijian Wu; Lingkai Kong; Wenwei Zhang; Zhongrui Cai; Fan Zheng; Tianyou Ma; Junhao Shen; Haiteng Zhao; Duanyang Zhang; Huilun Zhang; Kuikun Liu; Chengqi Lyu; Yanhui Duan; Chiyu Chen; Ningsheng Ma; Jianfei Gao; Han Lyu; Dahua Lin; Kai Chen

Long-horizon Reasoning Agent for Olympiad-Level Mathematical Problem Solving

Intermediate

Songyang Gao, Yuzhe Gu, Zijian Wu et al.12/11/2025

arXiv PDF

Key Summary

•This paper builds a math problem–solving agent, Intern-S1-MO, that thinks in multiple rounds and remembers proven mini-results called lemmas so it can solve very long, Olympiad-level problems.
•Instead of cramming everything into one super-long message, the agent stores key steps in a compact memory and reuses them, like notes in a notebook.
•A team of mini-helpers (reasoner, summarizer, verifier) work together each round to explore, compress to lemmas, and check correctness.
•A special training method, OREAL-H, uses reinforcement learning that rewards not just a correct final answer but also well-checked reasoning steps.
•The system uses two verifiers: a theorem verifier to trust lemmas and a process verifier to point out errors in whole proofs and guide revisions.
•To assign credit fairly, it builds a lemma dependency graph that tracks which lemmas actually help finish the proof.
•It achieves 26/35 on non-geometry IMO 2025 problems (silver medal level), 96.6% on AIME2025, 95% on HMMT2025, and 102/126 on CMO2025 (above the gold threshold).
•An ablation study shows every component (multi-round reasoning, lemma memory, verifiers, and RL) adds meaningful gains.
•The approach breaks free from fixed context limits (e.g., 64k tokens) by using multi-round exploration with structured memory.
•This points to a future where AI mathematicians reason more like humans: step-by-step, carefully checked, and building on past progress.

Why This Research Matters

This work makes AI better at careful, multi-step thinking, not just quick guesses. By saving and checking mini-results (lemmas), the agent can handle puzzles too long for one-shot answers, much like top students do. Reliable verifiers and principled rewards make the system’s reasoning more trustworthy and easier to review. That opens doors to smarter tutoring, where the AI shows clean, checked steps, and to research help, where the AI proposes and verifies sub-claims. It also helps evaluate long solutions fairly and consistently, reducing the risk of lucky passes. In short, it’s a practical path toward AI that thinks deeply, transparently, and well beyond context limits.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

You know how tough puzzles sometimes need a fresh page, a list of helpful facts, and several tries before the final answer clicks? That’s how many International Mathematical Olympiad (IMO) problems feel: they’re so long and twisty that one straight-through attempt usually isn’t enough.

🍞 Top Bread (Hook): Imagine trying to finish a giant jigsaw puzzle in one sitting without ever saving your progress. If you sneeze and the pieces scatter, you start over. 🥬 The Concept: Chain-of-Thought (CoT) is when AI writes out its thinking steps instead of jumping to the answer.

How it works:
1. The model lists small steps.
2. Uses each step to build the next one.
3. Reaches the final answer with a readable trail.
Why it matters: Without CoT, the AI often guesses, and we can’t see where it went wrong. 🍞 Bottom Bread (Anchor): If the question is “What’s 27 × 14?”, CoT writes 27 × 10 = 270, 27 × 4 = 108, add to get 378, instead of blurting a random number.

🍞 Top Bread (Hook): Think of a video game where you get coins for beating levels. If the game only gives you a coin at the very end, it’s hard to learn which moves were good. 🥬 The Concept: Reinforcement Learning (RL) teaches AI by rewarding good actions.

How it works:
1. Try something.
2. Get a reward or not.
3. Do more of what earned rewards.
Why it matters: Without rewards, the model won’t know what behaviors to repeat. 🍞 Bottom Bread (Anchor): Like a dog learning to sit with treats, RL helps models learn better steps.

🍞 Top Bread (Hook): Now imagine giving coins only if the last level is perfect—nothing for the early levels. 🥬 The Concept: RL with Verifiable Rewards (RLVR) gives rewards when the final answer is provably correct.

How it works:
1. Generate a solution.
2. Check correctness with a verifier or a judge.
3. Reward only if it’s correct.
Why it matters: Without a strict check, the AI might celebrate lucky guesses. 🍞 Bottom Bread (Anchor): If the answer “42” is wrong, RLVR gives zero reward—no shortcuts.

🍞 Top Bread (Hook): A super-long maze needs many turns; you can’t see the exit from the start. 🥬 The Concept: Long-horizon reasoning means planning and thinking across many steps.

How it works:
1. Break the big problem into smaller steps.
2. Keep track of progress.
3. Continue until the goal.
Why it matters: Without it, models give up early or get lost. 🍞 Bottom Bread (Anchor): Solving a big algebra proof sometimes needs 20+ mini-claims chained together.

Before this paper, models tried to stretch their single “message” longer and longer (like bigger jigsaw tables) using long contexts (e.g., 64k–128k tokens). But IMO-level problems can need way more space and careful reuse of prior progress. Some teams tried multi-round chatting or tree searches, but often they didn’t keep a clean, reusable memory; they still treated each attempt like a separate run. Others used formal proof systems that are powerful but slow, hard to set up, and not very friendly for human readers.

🍞 Top Bread (Hook): Think of teamwork: a thinker, a note-taker, and a checker working together. 🥬 The Concept: A multi-agent system splits roles: one reasons, one summarizes into lemmas, and one verifies.

How it works:
1. Reasoner explores and writes a solution attempt.
2. Summarizer distills key mini-results (lemmas).
3. Verifier checks those lemmas and the final proof.
Why it matters: Without roles, work gets messy and mistakes sneak through. 🍞 Bottom Bread (Anchor): Like a scientist (thinker), a lab notebook (summarizer), and a reviewer (checker) ensuring high-quality results.

🍞 Top Bread (Hook): In math class, when you prove a small claim you’ll use later, that’s a lemma. 🥬 The Concept: A lemma is a mini-proofed fact you can reuse to build the final solution.

How it works:
1. State a useful sub-claim.
2. Prove it carefully.
3. Reuse it later without re-proving.
Why it matters: Without lemmas, long proofs become tangled and repetitive. 🍞 Bottom Bread (Anchor): Proving “If a number is even, its square is even” can be a lemma used in bigger number theory problems.

The gap this paper fills: Instead of one giant context, it enables cumulative, multi-round exploration with a memory of lemmas and verifiers that guide the search. It also trains this behavior with a new RL method (OREAL-H) that uses process-level feedback—not just final answers—so the agent learns which steps truly help.

Why should you care? Because this moves AI from short “guess and check” thinking to sturdy, buildable reasoning—like how top students and mathematicians work. It means AI can tackle problems that previously needed 1.5 hours of human-grade, step-by-step insight, which could help in education, science, and any task that needs carefully-checked, long chains of thought.

02Core Idea

Aha! Instead of forcing one mega-thought, the agent thinks in rounds, turns progress into reusable lemmas, checks them, and learns which steps matter most using a special RL method—so it can go much farther than context limits allow.

🍞 Top Bread (Hook): Picture climbing a mountain with basecamps. You don’t carry everything at once; you set up camps (notes), check safety (verifiers), and then climb higher. 🥬 The Concept: Lemma-based memory management stores compact, verified mini-results to reuse in future rounds.

How it works:
1. After each round, the agent extracts lemmas from its attempt.
2. A theorem verifier checks these lemmas with multiple tries to assign confidence.
3. Trusted lemmas go into a library that the next round can use.
Why it matters: Without this, earlier progress gets lost or repeated, wasting the tiny context budget. 🍞 Bottom Bread (Anchor): Prove a helpful inequality once, store it, and reuse it instead of re-deriving it 5 times.

🍞 Top Bread (Hook): Think of a school newspaper team: writer, editor, fact-checker. 🥬 The Concept: Multi-agent reasoning organizes work into reasoning (writer), summarizing (editor), and verifying (fact-checker).

How it works:
1. Reasoner explores solutions.
2. Summarizer compresses into precise lemmas.
3. Verifier checks rigor and flags errors.
Why it matters: Without roles, errors pile up and progress is hard to trace. 🍞 Bottom Bread (Anchor): The editor trims fluff and keeps only solid facts; the fact-checker ensures sources are correct.

🍞 Top Bread (Hook): Sometimes your teacher doesn’t just grade your final answer, they mark which steps were shaky. 🥬 The Concept: Process verifier gives feedback on which proof steps have issues, not just whether the final answer is correct.

How it works:
1. It scans the proof and marks the first flawed step.
2. Repeats checks multiple times to reduce randomness.
3. Returns feedback that the agent can use to revise.
Why it matters: Without process checks, models can pass by luck or fail without learning why. 🍞 Bottom Bread (Anchor): Like getting comments in the margin: “Step 3: you used a theorem wrong—fix this.”

🍞 Top Bread (Hook): When many students collaborate on a giant proof, you want to know which mini-claims really helped. 🥬 The Concept: Lemma dependency graph links which lemmas lead to which others and finally to the solution.

How it works:
1. Gather lemmas from many rollouts.
2. Connect them when one helps derive another.
3. Flow value backward from successful solutions to the lemmas that enabled them.
Why it matters: Without it, credit gets spread randomly; the model may learn from noisy or useless steps. 🍞 Bottom Bread (Anchor): Like tracing which clues solved the mystery and praising the most important ones.

🍞 Top Bread (Hook): If a referee flips a coin to judge each play, you shouldn’t trust one flip; you average many. 🥬 The Concept: Conjugate reward models the verifier’s pass/fail uncertainty probabilistically (with a Beta distribution), turning multiple checks into a reliable reward.

How it works:
1. Run several independent verifications.
2. Use Bayesian updating to estimate true proof quality.
3. Reward is the probability your solution beats a “clearly bad” baseline.
Why it matters: Without denoising, the model might chase lucky passes instead of solid logic. 🍞 Bottom Bread (Anchor): Four passes out of four gives a very high-quality score; one out of four gives a small, cautious score.

🍞 Top Bread (Hook): A boss and a worker: the boss chooses tasks, the worker writes the sentences. 🥬 The Concept: Hierarchical Markov Decision Process (HMDP) separates high-level choices (e.g., extract lemmas?) from low-level token writing.

How it works:
1. High-level actions pick what to do next.
2. Low-level policy writes the detailed text.
3. Rewards flow back to both levels.
Why it matters: Without hierarchy, long proofs tangle planning and writing, making training unstable. 🍞 Bottom Bread (Anchor): Decide “Prove Lemma 2 now,” then write the formal proof steps.

What changes because of this idea?

Before: One long pass tried to do it all; mistakes early on would haunt the rest, and useful bits got lost in the scroll.
After: Multi-round exploration, lemma memory, and verifier-guided RL let the agent push far beyond context limits, reuse hard-won facts, and steadily improve.

Why it works (intuition):

Memory: Saving lemmas reduces repeated work and compresses long reasoning chains.
Verification: Reliable feedback turns fuzzy guesses into solid steps.
Credit assignment: The lemma graph and conjugate rewards teach the model which steps actually matter, not just which happened to be nearby.
Hierarchy: Splitting planning from writing simplifies learning and improves stability.

Building blocks:

Lemma-based memory library.
Multi-agent workflow (reasoner, summarizer, verifier).
Process verifier feedback for revisions.
Lemma dependency graph for value flow.
OREAL-H RL to train the whole system end-to-end with denoised, step-aware rewards.

03Methodology

At a high level: Problem → Reasoner explores → Summarizer extracts lemmas → Theorem verifier scores lemmas → Update lemma library → Next round repeats → Final solution → Process verifier pinpoints errors → Revise until verified or budget ends.

Step-by-step recipe with purpose, failure mode, and example:

Input and initialization

What happens: The agent receives a math problem. If this is the first round, the lemma library is empty.
Why it exists: Start clean and avoid dragging in irrelevant facts.
Example: “Prove that for integers a, b, if a is even then a·b is even.” No lemmas yet.

Reasoner’s exploration round

What happens: The reasoner writes a detailed attempt: tries approaches, splits into cases, derives subclaims.
Why it exists: Exploration finds useful sub-results; without it, nothing new to summarize.
What breaks without it: The system would only repeat old lemmas, never discovering new ones.
Example: Reasoner derives, “Lemma: If a is even, then a = 2k for some integer k.” and uses it to argue about a·b.

Summarizer distills lemmas

What happens: The summarizer scans the attempt, extracts only novel, clearly stated lemmas with their proofs, and formats them cleanly.
Why it exists: Exploration is messy; we must keep just the gold.
What breaks without it: Future rounds drown in noise; context fills with redundant or half-baked steps.
Example: It keeps “If a is even, then a=2k” and drops rambling text that didn’t help.

Theorem verifier (lemma checker) with confidence voting

What happens: Each lemma is checked multiple times in parallel; the fraction of successful checks becomes a confidence score.
Why it exists: To avoid trusting a single noisy judgment.
What breaks without it: One lucky or unlucky check could wrongly promote/demote a lemma, poisoning the library.
Example: The basic parity lemma gets high confidence (e.g., 0.9), so it goes into the library.

🍞 Top Bread (Hook): Like keeping index cards of facts you’ll reuse. 🥬 The Concept: Lemma-based memory management stores compact, proven facts between rounds.

How it works:
1. Only high-confidence lemmas are saved.
2. Each new round gets the problem + current lemma library.
3. The reasoner can build on these, going deeper.
Why it matters: Without it, you re-derive the same facts and hit context limits quickly. 🍞 Bottom Bread (Anchor): Next round starts by using “a=2k” immediately, saving time and space.

Next round reasoning with memory

What happens: The reasoner now has the lemma library as input and can push further, proving stronger lemmas.
Why it exists: Multi-round depth lets the agent reach solutions beyond single-pass limits.
What breaks without it: The agent stalls on long problems that need dozens of steps.
Example: Using a=2k, it proves “If b is any integer, then a·b is even.”

Final solution and process verification

What happens: When the reasoner proposes a final proof, the process verifier (OPV-style) identifies the first incorrect step or approves it.
Why it exists: To ensure the whole argument—not just bits—is sound.
What breaks without it: The agent might combine correct lemmas in a wrong way and still pass.
Example: If step 3 misuses a lemma, the verifier flags “STEP3,” and the agent revises.

Iterative revision loop

What happens: Using verifier feedback, the agent edits and rechecks the proof for a fixed number of rounds.
Why it exists: Few proofs are perfect on the first try.
What breaks without it: Small fixable flaws would cause full failure, wasting all progress.
Example: It corrects the application of a parity rule and passes on the next check.

🍞 Top Bread (Hook): Imagine a coach who watches many practice games and points out the plays that truly lead to wins. 🥬 The Concept: Lemma dependency graph assigns credit to lemmas that actually help finish proofs.

How it works:
1. Collect lemmas across multiple rollouts for the same problem.
2. Link lemmas that lead to later ones and to final success.
3. Flow value backward so helpful lemmas get higher value.
Why it matters: Without focused credit, training amplifies noisy branches instead of good ones. 🍞 Bottom Bread (Anchor): If Lemma A appears in many winning proofs, it gets a big thumbs-up; if Lemma Z rarely helps, it’s down-weighted.

🍞 Top Bread (Hook): If multiple judges review your essay, you want a fair score that respects uncertainty. 🥬 The Concept: Conjugate reward turns multiple noisy verifier votes into a stable, probabilistic reward.

How it works:
1. Run n verifier checks; observe k passes.
2. Use a Bayesian Beta model to estimate true quality.
3. Reward is the chance your proof is better than a baseline that always fails.
Why it matters: Without this, the model might chase luck. 🍞 Bottom Bread (Anchor): 4/4 passes yields a huge reward; 1/4 yields a small one; the curve is smooth and principled.

🍞 Top Bread (Hook): A principal (high-level) plans the schedule; teachers (low-level) run the classes. 🥬 The Concept: Hierarchical MDP splits planning (which action next?) from token generation (what to write?).

How it works:
1. High-level policy chooses actions like “extract lemmas” or “revise proof.”
2. Low-level policy writes the actual text.
3. Both learn from rewards.
Why it matters: Without hierarchy, the model fumbles between plan and prose. 🍞 Bottom Bread (Anchor): Decide to verify now; then write the verification prompt and interpret results.

Training: OREAL-H (Outcome Reward Reinforcement Learning—Hierarchical)

Cold start with behavior cloning: copy well-formed trajectories so the agent learns the workflow and lemma style.
Online RL with OREAL-H:
- Use process verifier signals (k of n passes) with conjugate rewards.
- Build lemma dependency graphs to compute progress-conditioned advantages.
- Update low-level generation with advantages focused on rounds that made real progress.

Secret sauce:

The combination: multi-round structure + lemma memory + dual verifiers + principled RL credit yields deep, steady progress.
Adaptive budget: For easy problems, it stops early; for hard ones, it spends more rounds and checks.
Human-aligned: Proofs are readable and step-checked, closer to how students and coaches work.

04Experiments & Results

The Test: The team measures how often the agent solves problems correctly (pass@1, pass@4) and how many points it earns under Olympiad-style grading. They use well-known competitions to prove real-world difficulty: AIME2025 and HMMT2025 (short-answer), plus CNMO2025 and IMO2025 (proof-heavy), and even enter CMO2025 officially under human grading.

🍞 Top Bread (Hook): It’s like trying out for the math team by doing the same contests as everyone else—and letting real judges score you. 🥬 The Concept: Benchmark evaluations compare models on shared test sets with fixed rules.

How it works:
1. Pick public math contests.
2. Run each model with the same budgets and rules.
3. Score using exact answers or a rubric with partial credit for proofs.
Why it matters: Without fair tests, we can’t tell if a new method is truly better. 🍞 Bottom Bread (Anchor): On AIME2025, the score is exact final-answer accuracy; on CNMO/IMO, partial credit reflects proof quality.

The Competition: They pit Intern-S1-MO against strong baselines like Gemini 2.5 Pro, o3-high, Grok 4, GPT-OSS-120B, DeepSeek-R1-0528, and Qwen3-235B-A22B. Some baselines are huge and proprietary; others are open.

The Scoreboard (with context):

AIME2025: Intern-S1-MO gets 96.6% pass@1—like scoring an A+ when others score A or B.
HMMT2025: 95% pass@1—again top-tier, showing consistent short-answer strength.
CNMO2025: 232.4/260—substantially ahead of baselines; this is where proof-building matters.
IMO2025 (non-geometry, pass@4): 26/35—silver medal human level. That’s elite performance.
CMO2025 (official, human-graded): 102/126, above the gold threshold of 78. This is the kind of result that turns heads.

Why these numbers are meaningful:

Short-answer contests (AIME/HMMT) check accuracy and speed. Intern-S1-MO wins there, but the bigger story is proof-heavy contests, where step-by-step rigor counts.
On CNMO and IMO proof problems, the agent’s lemma memory and verifiers shine, because success depends on building and checking long chains, not just pattern-matching.

Ablations (what changed when parts were removed):

Single-round with agents: Decent, but limited depth.
- Multi-round reasoning (no lemma scores): Jumps up—depth helps.
- Theorem Verifier: Another bump—trustworthy lemmas matter.
- Process Verifier: Bigger bump—step-level correctness improves global proofs.
- OREAL-H RL: Biggest gains—learning where progress truly comes from. This shows each piece is necessary, and together they’re powerful.

Surprising findings:

Parameter-efficient variant (Intern-S1-mini-MO) still performs very strongly, especially on CNMO and IMO. This suggests structure and training matter as much as sheer model size.
The agent sometimes discovers novel solution paths that also earn high human-graded scores—evidence that it’s not just memorizing old tricks.

Takeaway: The results validate the central thesis—cumulative, verified lemma memory and step-aware RL let the agent reach beyond fixed context limits and compete at Olympiad levels.

05Discussion & Limitations

Limitations:

Geometry excluded: Reported IMO/CMO results avoid geometry problems; geometry often needs diagrams and specialized reasoning, which would require new tools.
Verifier noise: Even with multiple checks and conjugate rewards, verifiers aren’t perfect; rare subtle errors can still slip by or be unfairly penalized.
Budget costs: Deep multi-round searches and multiple verifications raise compute costs; though adaptive budgeting helps, full test-time scaling is expensive.
Dependency on good lemma extraction: If summarization misses or garbles key lemmas, progress stalls or drifts.

Required resources:

A capable base reasoning model (e.g., Intern-S1 family).
Verifier models (theorem verifier and process verifier) with parallel sampling.
Compute for multi-round rollouts and RL (batches, multiple trajectories, long token limits ~64k per round and many rounds).

When not to use:

Very short problems where single-pass answers work fine—the overhead is unnecessary.
Domains where formal guarantees are mandatory and natural-language proofs aren’t acceptable—then formal proof systems may be better.
Tasks dominated by retrieval (memorized facts) rather than reasoning chains—the lemma machinery adds little value.

Open questions:

Geometry integration: How to blend diagram reasoning, construction tools, and lemma memory cleanly?
Stronger, cheaper verifiers: Can we further reduce noise and cost while improving recall/precision on step errors?
Knowledge transfer across problems: How much of the lemma library can be shared across related topics without causing confusion?
Automated difficulty pacing: Can the agent learn to set its own budgets on the fly more precisely, spending just enough compute?
Human-in-the-loop: What is the best way for teachers or students to interact with lemma libraries for learning and grading?

06Conclusion & Future Work

In three sentences: This paper presents Intern-S1-MO, a long-horizon math agent that solves Olympiad-level problems by thinking in multiple rounds, saving proven lemmas, and verifying both steps and final proofs. A new RL framework, OREAL-H, turns noisy process checks into stable rewards and uses a lemma dependency graph to credit the steps that truly drive progress. Together, these ideas break free from fixed context limits and deliver silver-medal IMO performance and gold-level CMO results.

Main achievement: Showing that a structured, verifiable, lemma-centric workflow—trained with step-aware reinforcement learning—can push AI beyond single-pass context ceilings into sustained, high-precision reasoning on very hard math problems.

Future directions: Add geometry reasoning with tool support; build stronger, cheaper verifiers; improve adaptive budgeting; and explore cross-problem lemma sharing for faster bootstrapping. Also, extend the approach beyond math to scientific proofs, program verification, and complex planning.

Why remember this: It marks a shift from “longer prompts” to “longer thinking” with memory and checks—closer to how great mathematicians work. That shift could power the next wave of trustworthy AI that solves big, multi-step problems in classrooms, labs, and the real world.

Practical Applications

•Olympiad training assistant that proposes step-by-step hints and verified lemmas for students.
•Automated grading helper that pinpoints the first incorrect step and assigns rubric-based partial credit.
•Proof sketch generator for researchers, building reusable lemmas across related problems.
•Math curriculum tool that converts messy student work into clean, checked lemma summaries.
•Competition prep system that adapts compute budgets to hard problems and saves progress across attempts.
•Safety checker for AI math outputs, filtering solutions through process and lemma verifiers before release.
•Interactive study coach that lets learners accept/reject lemmas and see how those choices affect the full proof.
•Content creation for textbooks: multiple solution paths with verifiable intermediate claims.
•Benchmarking suite that uses verifier ensembles and refined rubrics to evaluate model reasoning fairly.
•Assistive planning beyond math (e.g., code proofs, protocol design) by reusing the lemma memory pattern.

Version: 1