Achieving Olympia-Level Geometry Large Language Model Agent via Complexity Boosting Reinforcement Learning

Haiteng Zhao; Junhao Shen; Yiming Zhang; Songyang Gao; Kuikun Liu; Tianyou Ma; Fan Zheng; Dahua Lin; Wenwei Zhang; Kai Chen

Achieving Olympia-Level Geometry Large Language Model Agent via Complexity Boosting Reinforcement Learning

Intermediate

Haiteng Zhao, Junhao Shen, Yiming Zhang et al.12/11/2025

arXiv PDF

Key Summary

•This paper builds InternGeometry, a large language model agent that solves Olympiad-level geometry by talking to a math engine, remembering what worked, and trying smart new ideas.
•The agent learns like a student with a coach: it proposes steps, gets checked by a symbolic proof engine, and improves based on feedback.
•A dynamic memory keeps the important parts of a very long conversation (over 200 steps), so the agent doesn’t forget key discoveries.
•A new training method, Complexity-Boosting Reinforcement Learning (CBRL), steadily raises problem difficulty so the agent always learns at the right challenge level.
•InternGeometry solves 44 of 50 IMO geometry problems from 2000–2024, beating prior expert systems while using only about 13K training examples.
•It can invent clever auxiliary constructions that don’t appear in human solutions, showing genuine creativity in geometry reasoning.
•The system’s success comes from long-horizon interaction, proposition-by-proposition proving, and a tight feedback loop with a geometric proof engine.
•Ablations show that removing slow thinking, memory compression, or proposition steps hurts performance a lot, proving the importance of these pieces.
•CBRL’s curriculum matters: training only on easy or only on hard problems performs worse than the adaptive schedule.
•The approach demonstrates that LLM agents, not just specialist expert systems, can reach medalist-level geometry performance with far less data.

Why This Research Matters

Geometry powers things we see daily—buildings, maps, graphics, robots—and it demands careful spatial reasoning. This paper shows an LLM agent can master that careful reasoning with far less data than expert systems by thinking in words, acting in formal steps, and learning at just the right challenge level. The same loop—plan, act, verify, remember—can generalize to other long, complex tasks beyond geometry. For students and teachers, it hints at future tutors that don’t just give answers but also model smart exploration and proof habits. For engineers and scientists, it suggests AI partners that can explore designs, prove properties, and surface creative constructions humans might miss.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you’re building a Lego castle without the picture on the box. You try a piece here, a piece there, check if it fits, and keep the good ideas in mind while tossing the bad ones. That’s how good problem solvers work.

🥬 The Concept (Reinforcement Learning): Reinforcement Learning is a way for AI to learn by trying actions and getting feedback (rewards or no rewards). How it works:

The AI tries something. 2) It gets a signal: “That helped” or “That didn’t.” 3) It repeats and improves. Why it matters: Without RL, the AI can’t learn from experience and won’t get better at planning many steps ahead. 🍞 Anchor: Like a video game player who learns which moves beat the boss, the AI learns which geometry moves lead to a proof.

🍞 Hook: You know how talking through a math problem out loud helps you think more clearly?

🥬 The Concept (Natural Language Reasoning): Natural Language Reasoning is the AI’s skill to think and explain its plan in everyday words before making a formal move. How it works:

The AI writes out its thoughts. 2) It picks a concrete action to try. 3) It reads the result and updates its plan. Why it matters: Without this, the AI would make random moves without a plan or reflection. 🍞 Anchor: Like explaining your steps to a teacher before you write the final answer, the AI plans in words, then acts carefully.

The World Before: For many math areas, LLM agents already do well by using tools like code runners and formal proof checkers. But geometry is special: fancy Olympiad problems often need very long proofs and, most importantly, new helper drawings called auxiliary constructions (like adding a new point or circle) that aren’t obvious. Expert systems such as AlphaGeometry 2 got great results by training special models on massive synthetic datasets and doing huge searches, but those systems were heavy and less flexible.

The Problem: Geometry’s auxiliary constructions have weak heuristics. That means there isn’t a simple rule like “always draw this line” that works reliably. You must explore, try, verify, and backtrack—like a detective testing leads.

🍞 Hook: Think of trying to open a tricky combination lock without the numbers. You test a lot, listen for clicks, and slowly find the pattern.

🥬 The Concept (Auxiliary Constructions): Auxiliary constructions are extra points, lines, or circles you add to make hidden relationships become clear. How it works:

Guess a helpful object to add. 2) Check what new angles or equalities appear. 3) Keep the ones that help the proof. Why it matters: Without them, many Olympiad problems are nearly impossible because the key structure stays hidden. 🍞 Anchor: If two triangles won’t look similar, adding a point on a circle might reveal equal angles that unlock the proof.

Failed Attempts: People tried fixed sets of constructions, shallow searches, or massive data training. Fixed rules missed creative ideas. Shallow searches got stuck. Massive data worked but was expensive and brittle. What was missing was an agent that could think long, remember well, try fresh ideas, and learn the “right next challenge.”

The Gap: We needed a general LLM agent that (1) can talk to a strong geometry engine, (2) can run hundreds of steps without getting lost, (3) can learn from attempts, and (4) can train on problems that get harder at the perfect pace.

Real Stakes: Why care? Because geometry is everywhere—maps, architecture, robotics, computer graphics, and even everyday reasoning about space. If AI can reason visually and logically like this, it can help students learn, help engineers design, and help scientists discover. It also shows a path for AI to handle long, tricky tasks in many fields—not just math.

02Core Idea

🍞 Hook: Imagine coaching a smart student who solves puzzles by trying moves on a whiteboard that instantly tells them if a step is valid. You keep a neat notebook of what worked so far and steadily give them harder puzzles as they improve.

🥬 The Concept (InternGeometry): InternGeometry is an LLM agent that solves IMO-level geometry by proposing ideas, verifying them with a symbolic engine, remembering key progress, and training on a smartly rising difficulty schedule. How it works:

Think in words, then output a precise action in a geometry code (DSL). 2) The engine checks it. 3) The agent stores the important results in memory. 4) Repeat hundreds of steps until the proof is complete. Why it matters: Without this close loop and long memory, the agent can’t discover the creative constructions and multi-step chains that Olympiad problems demand. 🍞 Anchor: Like solving a 500-piece puzzle, InternGeometry keeps testing where pieces might fit, marking successes, and building up the final picture.

The “Aha!” Moment in one sentence: Pair a long-horizon LLM thinker with a geometric proof engine and train it using a difficulty schedule that always targets “just hard enough.”

Three Analogies:

Rock climbing with staged routes: start on easy holds, then move to trickier ones as your grip improves.
Detective work: propose clues, check them, keep the promising ones, and follow the trail.
Cooking school: learn recipes step by step; as your skills grow, the chef gives you more complex dishes.

🍞 Hook: You know how a good notebook keeps only the essentials so you don’t drown in details?

🥬 The Concept (Dynamic Memory Mechanism): Dynamic memory compresses long histories to keep only the crucial actions and outcomes. How it works:

Summarize early turns. 2) Keep the last full feedback and key facts. 3) Provide a compact state so the agent can reason far. Why it matters: Without compression, the agent’s context gets too long and it forgets which tries worked or failed. 🍞 Anchor: Like a tidy lab journal that lists the experiments, results, and next steps—without pages of chatter.

🍞 Hook: Think of proving a big claim by checking smaller helper facts first.

🥬 The Concept (Proof by Propositions): The agent breaks the final goal into bite-sized propositions it can ask the engine to verify. How it works:

Choose a sub-claim. 2) Ask the engine to prove it. 3) Use the new fact to unlock further steps. Why it matters: Without subgoals, the agent must jump straight to the final theorem, which is too hard. 🍞 Anchor: Like proving two lines are parallel by first showing equal alternate interior angles, then using that to finish the argument.

🍞 Hook: When you train for a race, you don’t jump straight to a marathon—you increase distance step by step.

🥬 The Concept (Complexity-Boosting Reinforcement Learning, CBRL): CBRL is a curriculum that automatically adjusts problem difficulty so the agent learns fastest. How it works:

Generate problems at a target complexity. 2) Train with RL on them. 3) Measure how well the agent is doing. 4) Nudge difficulty up or down to stay near the sweet spot. Why it matters: Without a tuned curriculum, training is either too easy (no growth) or too hard (no learning). 🍞 Anchor: Like a coach keeping your workout challenging but doable so you keep improving.

🍞 Hook: Picture a factory that can stamp out practice problems at any difficulty you ask for.

🥬 The Concept (Data Synthesis Pipeline): This is a generator that creates geometry problems with controllable complexity based on proof-step length. How it works:

Build a raw configuration. 2) Add auxiliary constructions. 3) Use the engine to find nontrivial goals. 4) Keep items near the target difficulty. Why it matters: Without the right data at the right level, the agent can’t learn expert skills efficiently. 🍞 Anchor: Like a math teacher crafting a worksheet that’s not too easy, not too hard—just right for today.

Before vs After: Before, expert models needed huge datasets and wide search trees. After, an LLM agent with memory, subgoals, and a symbolic checker can reach medalist-level geometry with far less data. It learns to explore creatively instead of memorizing countless patterns.

Why It Works (Intuition): The engine guarantees correctness, the agent provides creativity, the memory supports long plans, and CBRL feeds the right challenges at the right time. This synergy transforms trial-and-error into guided discovery.

Building Blocks: Dynamic memory to keep context manageable; proposition checking to climb the proof ladder; long-horizon interaction to allow hundreds of thoughtful moves; and CBRL to pace learning for rapid, robust gains.

03Methodology

High-Level Recipe: Input (a geometry problem) → Think (natural language plan) → Action (formal DSL command) → Feedback (symbolic engine verdict) → Memory update (compress essentials) → Repeat many turns → Output (complete verified proof).

🍞 Hook: Imagine you’re playing chess with a coach who immediately tells you if a move is legal and why it helps or not.

🥬 The Concept (Symbolic Engine Interaction): Symbolic Engine Interaction is how the agent talks to a formal geometry prover that checks each step. How it works:

The agent sends a structured action (like “build this point” or “prove this angle equality”). 2) The engine executes and returns success/failure and new facts. 3) The agent uses this to guide the next step. Why it matters: Without the engine, the agent can’t confirm which moves are truly valid, and it can drift off course. 🍞 Anchor: Like a spellchecker for math moves that says “This is proven” or “This construction is invalid.”

🍞 Hook: Think of building with Lego instructions written in a special code that the builder understands perfectly.

🥬 The Concept (DDAR – Deductive Database Arithmetic Reasoning): DDAR is the engine’s brain that knows geometry rules and does calculations like angle and ratio chasing. How it works:

Store known facts. 2) Apply geometry theorems exhaustively. 3) Derive all consequences and check propositions. Why it matters: Without DDAR, the system can’t expand knowledge from a few facts to the full web of implications needed to solve the problem. 🍞 Anchor: It’s like a giant, always-on geometry textbook that can immediately apply any relevant theorem.

Step-by-Step Details:

Build the initial scene (Action: <build>): The agent formalizes the problem’s given points, lines, circles, and goals in the DSL, then asks the engine to initialize.

Why this step exists: It ensures the starting diagram and given facts are unambiguous.
Example: For a quadrilateral ABCD with point X inside, the agent encodes the constraints and target (like “∠BXA + ∠DXC = 180°”).

Think in words, then act (Action: <propose> or <add>): The agent writes its reasoning: what subgoal to try or what helper point to add. Then it outputs a crisp DSL command.

Why this step exists: The natural language plan reduces randomness; the formal command ensures precision.
Example: “I’ll try to prove ∠ABX = ∠CDX” (<propose>) or “Place K on the circumcircle of A, B, X and also on C, D, X” (<add>). The engine returns success/failure and any new facts.

Memory compression (Dynamic Memory): After each turn, the agent summarizes what worked and what failed, preserving key actions and the latest detailed feedback.

Why this step exists: Hundreds of turns would overflow the context; compression keeps the mind sharp.
Example: The memory notes: “Added K on both circles—success. Proposition ‘∠ABX = ∠CDX’—failed. Known: ABKX cyclic.”

Long-horizon interaction: The agent can run over 200 steps on one problem, iteratively refining its plan as new facts appear.

Why this step exists: Many Olympiad problems need extended exploration to find the right path.
Example: The agent cycles through a few failed propositions, then discovers a crucial cyclic quadrilateral and angle equality that unlocks the finale.

🍞 Hook: When you’re brainstorming, it’s bad to repeat the same wrong idea over and over.

🥬 The Concept (Rejection Sampling Guard): The agent uses a simple rule-based filter to avoid repeated, too-long, or malformed outputs. How it works:

Sample a candidate thought and action. 2) Check rules (no repeats, reasonable length, valid format, varied action types). 3) If it fails, resample. Why it matters: Without this, the agent might get stuck repeating itself. 🍞 Anchor: It’s like a teacher saying, “Try a different approach; you already did that twice.”
Proposition-by-proposition proving: The agent sets subgoals that the engine proves or rejects, building a scaffold up to the main theorem.

Why this step exists: Climbing small, verified steps is more reliable than leaping for the end all at once.
Example: Prove two angles equal; then use that to prove triangles similar; then conclude parallel lines; finally finish the target angle sum.

Auxiliary constructions: The agent adds candidate points/lines/circles that might reveal hidden symmetries or cycles.

Why this step exists: Many tasks are impossible without the right construction.
Example: Add a point T on AC so that ∠BDA = ∠TDC, leading to isogonal structures that break the stalemate.

🍞 Hook: Training smart is better than just training hard.

🥬 The Concept (CBRL – Complexity-Boosting RL): The agent trains on a stream of synthesized problems whose difficulty is tuned by proof length so that the average success signal stays around a sweet spot. How it works:

Generate a batch near difficulty κ. 2) Run RL with simple, rule-based rewards (success of steps and final proof). 3) If it’s too easy or too hard, nudge κ. Why it matters: Without CBRL, RL either stagnates (too easy) or fails to converge (too hard). 🍞 Anchor: Like an adaptive workout plan that increases the weight when sets get easy and backs off when you’re failing every rep.
Data Synthesis Pipeline: A two-stage generator crafts problems with adjustable complexity.

Why this step exists: Real datasets are imbalanced, with too few expert-level items.
Example: Start from random DDAR predicates, add constructions with certain priors, filter by nontrivial provable goals, and select those near target proof length.

Secret Sauce:

Tight think–act–check loop with a strong engine keeps reasoning grounded.
Dynamic memory lets the agent sustain deep, multi-branch explorations.
Proposition proving scaffolds hard goals into manageable steps.
CBRL keeps learning always in the high-growth zone.

Result: The method turns open-ended exploration into disciplined discovery, enabling medalist-level geometry with a modest training budget.

04Experiments & Results

🍞 Hook: Picture a school contest where three students tackle the same 50 toughest puzzles. One of them solves the most, using far fewer practice sheets than the others.

🥬 The Concept (The Test): The researchers used the IMO 50 (all geometry problems from 2000–2024) plus IMO 2025’s geometry problem to measure agent ability. How it works:

The agent gets each problem. 2) It’s allowed up to 200 interactive steps per attempt. 3) Test-time uses best-of-K sampling (Pass@K), up to K=256. Why it matters: Without a standard, tough benchmark and controlled budgets, we can’t meaningfully compare systems. 🍞 Anchor: It’s like giving every student the same final exam and the same amount of scratch paper.

🍞 Hook: Think of two top chess engines that rely on massive opening books versus a thoughtful player who calculates deeply with a good coach.

🥬 The Concept (The Competition): Baselines were AlphaGeometry 2 and SeedGeometry, state-of-the-art expert-model systems trained on hundreds of millions of examples. How it works:

These systems predict constructions and then run powerful proof searches. 2) InternGeometry uses an LLM-agent approach with far less training data. Why it matters: Beating well-established expert systems with tiny data is a big deal for data efficiency. 🍞 Anchor: If a runner with a smart coach beats a team that trained on miles and miles of track, that coach’s plan clearly works.

The Scoreboard (with context):

InternGeometry solves 44/50 problems on IMO 50 (Pass@256), surpassing AlphaGeometry 2 (42/50) and SeedGeometry (43/50). That’s like scoring an A when others got solid A−/B+.
It also solves the 2025 geometry problem, totaling 45/51 when that one is included.
Training data: about 13K examples—around 0.004% of AlphaGeometry 2’s and 0.006% of SeedGeometry’s scale. That’s a tiny fraction of the data for stronger or comparable results.
Long-horizon helps: Allowing up to 200 steps notably boosts success across sampling budgets. Extending trajectory length gives better returns than just sampling more attempts.

🍞 Hook: Sometimes the biggest surprise is that patience beats brute force.

🥬 The Concept (Surprising Findings): Longer interaction length was more valuable than simply increasing the number of parallel tries; also, the agent discovered novel constructions not seen in human solutions. How it works:

With more steps, the agent can refine its heuristics mid-solve. 2) Creative constructions emerged from exploration plus engine feedback. Why it matters: It shows that thoughtful exploration and memory can outperform shallow breadth. 🍞 Anchor: Like solving a Rubik’s Cube by learning from each twist, not by trying a million random scrambles at once.

Ablations (what breaks without key parts):

Removing proposition steps or slow thinking or memory compression or the rejection guard all damages performance. For example, without slow thinking and memory compression, results drop from 44/50 to around 20–23/50 in tests—like falling from an A to a low C.
CBRL matters: Training only on easy or only on hard synthesized data underperforms the adaptive schedule. Without the curriculum, the agent often doesn’t converge well or fails to generalize to IMO level.

Takeaways:

Long-horizon, proposition-first strategies are crucial for geometry.
Adaptive difficulty is key to sample-efficient RL.
A small but smart training loop can outshine massive-data alternatives.

05Discussion & Limitations

🍞 Hook: Even the best Swiss Army knife can’t replace every tool in the garage.

🥬 The Concept (Limitations): InternGeometry still struggles with problems that go beyond pure geometric proof into heavy computation or non-geometry analysis. How it works:

The engine and DSL are tuned to classic Euclidean reasoning. 2) When tasks lean into numeric optimization or advanced analysis, the expressiveness may fall short. Why it matters: Without extending the toolset, some IMO tasks remain out of reach. 🍞 Anchor: It’s like having a perfect ruler and compass but needing a calculator for a different kind of puzzle.

Resources Needed:

A capable LLM backbone (here, InternThinker-32B) for long-horizon thinking.
The InternGeometry-DDAR engine with enriched theorems and robust construction support.
Compute for training (RL loops, data synthesis) and testing (up to 200 steps × Pass@K sampling).

When NOT to Use:

Problems that primarily require number crunching, calculus, or combinatorial arguments outside geometry.
Settings where you can’t run many interaction steps or store/update memory (e.g., ultra-low-latency contexts).

Open Questions:

How to generalize this approach to multi-branch mathematics (algebraic geometry, inequalities, combinatorics) with toolboxes beyond Euclidean engines?
Can we learn even better construction heuristics that transfer between diagram families?
What’s the best balance between model size, memory compression, and engine strength for cost-effective scaling?
Can outcome and step rewards be enriched (while still automatic) to accelerate RL even more?

Bottom Line: InternGeometry shows that an LLM agent with strong feedback, memory, and a right-paced curriculum can rival expert systems in geometry. But expanding beyond pure Euclidean reasoning and reducing inference cost further are meaningful next steps.

06Conclusion & Future Work

Three-Sentence Summary: InternGeometry is a long-horizon LLM agent that solves Olympiad-level geometry by iteratively proposing constructions and propositions, verifying them with a symbolic engine, and remembering key results. It learns efficiently using Complexity-Boosting Reinforcement Learning (CBRL), which adapts problem difficulty to keep training in the high-growth zone. With only about 13K training examples, it solves 44/50 IMO geometry problems (2000–2024), surpassing expert systems that used vastly more data.

Main Achievement: Showing that an agentic LLM, tightly integrated with a geometric proof engine and trained with an adaptive curriculum, can achieve medalist-level geometry using orders-of-magnitude less data than prior expert-model approaches.

Future Directions:

Extend the engine and DSL to cover tasks mixing geometry with algebraic or analytic computation.
Design richer, still-automatic reward schemes and stronger memory tools for even longer horizons.
Generalize CBRL and proposition-first strategies to other branches of mathematics and symbolic reasoning.

Why Remember This: It’s a blueprint for how AI can tackle long, creative reasoning: think in words, act in formal steps, get immediate verification, remember what matters, and practice at just the right difficulty. That combination doesn’t just crack geometry; it points to a broader path for building truly capable reasoning agents.

Practical Applications

•Create an interactive geometry tutor that proposes helpful auxiliary lines and explains why they work.
•Assist competition training by generating targeted practice problems that steadily increase in difficulty.
•Support architects and CAD users by auto-suggesting constructions that guarantee desired angle or length constraints.
•Verify safety-critical geometric relationships in robotics motion planning and mechanical linkages.
•Automate proof drafting for math education, turning student sketches into formal, checkable arguments.
•Accelerate graphics and game design by proposing constructions that ensure symmetry or perspective constraints.
•Help researchers explore novel constructions in advanced geometry, discovering non-obvious solution paths.
•Provide curriculum-aligned worksheets with auto-tuned difficulty for classrooms and self-study.
•Integrate with math contest platforms to offer stepwise feedback on proposed moves and constructions.
•Benchmark long-horizon reasoning methods by extending the CBRL framework to other formal domains.

Version: 1