InftyThink+: Effective and Efficient Infinite-Horizon Reasoning via Reinforcement Learning

Yuchen Yan; Liang Jiang; Jin Jiang; Shuaicheng Li; Zujie Wen; Zhiqiang Zhang; Jun Zhou; Jian Shao; Yueting Zhuang; Yongliang Shen

InftyThink+: Effective and Efficient Infinite-Horizon Reasoning via Reinforcement Learning

Intermediate

Yuchen Yan, Liang Jiang, Jin Jiang et al.2/6/2026

arXiv

Key Summary

•Long chains of thought make AI smarter but also slower, pricier, and limited by memory windows.
•InftyThink+ teaches an AI to pause, summarize its progress, and continue later, so it can reason for a very long time without running out of space.
•The trick is reinforcement learning (RL): the model tries strategies, gets rewards for right answers, and learns when to summarize and when to stop.
•A special efficiency reward nudges the AI to solve problems in fewer iterations without sacrificing correctness.
•On tough math tests like AIME24, InftyThink+ boosted accuracy by about 21 percentage points over the starting model and beat standard long-CoT RL.
•It also cut inference time a lot (around 30–70% depending on setup) because each step runs in a small, fixed context.
•The method starts with a cold start (supervised fine-tuning) to learn the format, then uses RL to learn the strategy.
•Compared to vanilla long reasoning, InftyThink+ trains faster (up to ~40% speedup per RL step) and generalizes better to new benchmarks.
•The big idea: optimize the whole journey (trajectory) of thinking, not just the final answer.
•This makes reasoning models both smarter and more efficient in real-world tasks like math, coding, and science.

Why This Research Matters

This work makes AI both smarter and faster by teaching it how to pause and summarize strategically, just like a good student taking useful notes. It breaks the link between depth of thinking and memory limits, so models can tackle harder, longer problems without grinding to a halt. Because the training optimizes the whole journey—not just the final answer—the AI learns durable habits that generalize across tasks. Real-world users get quicker, more accurate help on math, coding, and science without paying huge compute costs. The approach also speeds up RL training itself, letting teams improve models more rapidly. In short, it helps move AI from “think longer” to “think smarter,” which scales better everywhere from classrooms to research labs.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you’re solving a giant maze. If you try to remember every single turn forever, your brain gets overloaded and you slow down. But if you pause sometimes, jot a short note like “left at the red door, then up the stairs,” and keep going, you can handle much bigger mazes.

🥬 The Concept 1: Reinforcement Learning (RL)

What it is: RL is a way for AI to learn by trying things and getting rewards for good outcomes.
How it works:
1. The AI tries a strategy.
2. It gets a reward if it solves the problem correctly.
3. It updates its behavior to do better next time.
Why it matters: Without rewards, the AI just imitates patterns it saw before; it doesn’t learn better strategies. 🍞 Anchor: Like a student who learns which study habits raise their grades because good report cards act like rewards.

🥬 The Concept 2: Iterative Reasoning

What it is: A think-in-steps method where the AI pauses, checks progress, summarizes, and continues.
How it works:
1. Think for a bit.
2. Summarize key points.
3. Use that summary as the starting point for the next bit of thinking.
Why it matters: Without iteration, long problems overflow the AI’s memory window and get messy. 🍞 Anchor: Building a LEGO castle in layers: finish a floor, write a short note on the blueprint, then build the next floor.

🥬 The Concept 3: Task Reward

What it is: A score for getting the answer right.
How it works:
1. Check if the final answer matches the ground truth.
2. Give 1 point if correct, 0 if not.
Why it matters: It keeps the AI focused on accuracy, not just style. 🍞 Anchor: Getting a gold star only if your math answer is correct.

🥬 The Concept 4: Efficiency Reward

What it is: A score for solving the problem with fewer steps.
How it works:
1. Count the number of reasoning rounds.
2. Give a higher score when the AI uses fewer rounds (but only if it’s correct).
Why it matters: Without it, the AI might always think forever to be safe. 🍞 Anchor: Finishing a race quickly and cleanly gets you bonus points.

🥬 The Concept 5: Summarization

What it is: Compressing many thoughts into the few most important ideas.
How it works:
1. Pick the key steps and conclusions.
2. Write them concisely so they fit in a small space.
3. Reuse them to continue thinking later.
Why it matters: Without good summaries, later steps forget crucial facts or repeat themselves. 🍞 Anchor: After reading a chapter, you write three bullet points so you remember the plot next time.

The World Before: Big reasoning models got better by “thinking longer” during inference. But this had three big problems:

Cost explosion: Transformers’ attention gets much more expensive as text gets longer. Long chains are slow and pricey.
Hard limits: Every model has a max context window. If the chain goes past it, the model can’t continue.
Lost-in-the-middle: Important early details become hard to use when buried in the middle of a huge context.

Failed Attempts and the Gap: People tried pruning tokens, compressing hidden states, or chopping text into fixed chunks. These helped a bit but missed three key decisions: when to summarize, what to keep, and how to continue. Supervised learning (just imitate examples) teaches the format but not the strategy. If an early summary drops a key constraint, everything after can fail—and SFT can’t fix that because it never sees outcome-based feedback.

What This Paper Fills: InftyThink+ adds end-to-end RL to the iterative reasoning setup. First, it cold-starts with SFT so the model learns how to produce iterations and summaries. Then, RL optimizes the whole journey: it learns the timing, the content of summaries, and how to resume effectively. The reward design includes both correctness (task) and efficient solving (efficiency), applied to the entire trajectory.

Real Stakes: This matters for everyday uses—math homework, coding bugs, science questions, planning long tasks—because we want AI that can think deeply without hitting memory walls or wasting time. With InftyThink+, the model learns to pause smartly, pack the right info into summaries, and move on quickly. That means better answers, faster responses, and lower cost.

🍞 Anchor: Think of preparing for a school debate. Instead of memorizing everything word-for-word, you boil each section down to three bullet points. During the debate, you flip through those bullets to keep your argument sharp and on track—fast, focused, and effective.

02Core Idea

🍞 Hook: You know how great hikers mark their trail with short notes so they can go further without getting lost? Those notes let them travel way beyond what they could remember in their heads.

🥬 The Concept 6: InftyThink Paradigm

What it is: A way for AI to reason in multiple rounds, each ending with a compact summary that the next round uses.
How it works:
1. Round i: Reason for a bit within a small, fixed context.
2. Write a short summary of the essentials.
3. Start Round i+1 using only the question and the latest summary.
4. Stop when ready to give the final answer.
Why it matters: It separates total depth from per-step memory and cost—so the AI can, in principle, think indefinitely. 🍞 Anchor: Like writing a travel log each day so tomorrow’s plan is simple and you don’t carry your entire diary on every hike.

The “Aha!” in one sentence: Don’t just teach the AI to format iterative thinking—use reinforcement learning to optimize the entire journey: when to pause, what to keep, and how to continue.

Three Analogies:

Backpacking: Pack only essentials (summary) each day so you can hike longer without getting weighed down.
Cooking marathon: After each dish, jot a tiny checklist of what worked; use it to speed up the next dish.
Chess tournament: After each game, record key patterns; use those notes to improve the next game without replaying every move in your head.

Before vs. After:

Before: Long, single-chain reasoning gets slower and hits context limits; important details get buried.
After: Iterative rounds keep each step short, with a summary that preserves what matters; RL tunes timing and content so continuation stays sharp and efficient.

🥬 The Concept 7: Trajectory-level Learning

What it is: Learning from the whole sequence of steps, not just the final line.
How it works:
1. Generate multiple rounds (a trajectory) for one problem.
2. Give a single reward to that whole trajectory (correctness, and optionally efficiency).
3. Share that reward back to all the tokens across all rounds so early good decisions get credit.
Why it matters: Without trajectory learning, great early summaries get zero credit, so the model won’t learn to make them. 🍞 Anchor: Your team wins a relay race; everyone on the team gets a medal because early runners set up the victory.

Why It Works (intuition, not equations):

Information focus: Each summary acts like a tiny “memory” that preserves only the most answer-relevant bits. That reduces clutter, avoids lost-in-the-middle, and gives the model a stable state to build on.
Cost control: Because each round reads only the question plus the latest short summary, compute stays bounded per round, even as total depth grows.
Strategic timing: RL teaches the model to pause and summarize at helpful moments—often after reaching a mini-conclusion—so the next round starts from a strong position.
Safe efficiency: The efficiency reward only applies when the final answer is correct, so the model won’t rush to be short at the cost of being wrong.

Building Blocks of InftyThink+:

Cold start (SFT): Teaches the model the structure—how to produce rounds and summaries—so it doesn’t flail.
RL rollouts with a cap on rounds: Ensures training stays stable and affordable while exploring strategies.
Reward design: Task reward for correctness; multiplicative efficiency reward to favor fewer, smarter rounds only when accurate.
Shared advantages: The same advantage signal is broadcast to all tokens in the trajectory so early steps are properly credited.
Stability tricks: Masking tokens where backend mismatches happen keeps training steady.

🍞 Anchor: Picture a math bee where you take brief notes after each part-solution. A coach reviews the entire attempt and says, “This whole run was great and fast,” so you learn not just the final statement, but exactly when your note-taking and pacing worked best.

03Methodology

At a high level: Question → (Cold Start SFT to learn the format) → (RL with multi-round rollouts) → Final Answer.

Step-by-step (like a recipe):

Transform Data into Iterative Format (Cold Start Prep)

What happens: Take existing long chain-of-thought (CoT) examples and split them into chunks of reasoning, then generate short summaries after each chunk.
Why this step exists: The model must first learn how an iterative conversation looks—where summaries appear and how they connect rounds. Without it, RL would struggle because the model can’t even produce well-formed rounds.
Example: A math solution of 10,000 tokens is split into ~6,000-token segments (η=6k). For each segment, a small but strong LLM writes a concise summary capped at ~1,000 tokens (γ=1k).

Supervised Fine-Tuning (SFT) to Learn the Format

What happens: Fine-tune the model to imitate these iterative samples: Round 1 (reason + summary), Round 2 (use previous summary + reason + new summary), …, Final Round (use last summary + reason + conclusion).
Why this step exists: This “cold start” teaches the syntax and rhythm of iterative reasoning. Without it, RL wastes time just learning to output the right structure.
Example: The model learns to output <summary>…</summary> at the right times and to place <history>…</history> correctly.

Reinforcement Learning (RL) to Learn Strategy

What happens: For each question, the model rolls out multiple rounds, each time seeing only the question and the latest summary. It stops when it decides to give the final answer or when it hits the round cap (φ=5). The whole trajectory gets a reward for correctness, and optionally a bonus that is higher when fewer rounds were used.
Why this step exists: Strategy—in particular, when to pause, what to preserve, and how to continue—requires outcome feedback. SFT can’t provide that.
Example: If the model solves a problem in 2 rounds (correct), it gets a higher reward than solving it in 5 rounds (also correct) because of the efficiency term, but anything incorrect gets 0.

Optimization Details that Make It Work

Shared advantages: Every token in the trajectory gets the same normalized advantage for that query, so early summaries that led to success are rewarded.
Clipping: Use a PPO-like clipped objective to prevent unstable updates.
History masking: History tokens (summaries carried forward) are not optimized; only newly generated text is trained.
Stability (IcePop): Mask tokens when inference/training engines disagree too much on probabilities, improving robustness.

Inputs and Outputs

Input: The user’s question; for rounds >1, also the previous summary (inserted as history).
Output per round: A reasoning snippet and either a summary (continue) or a final conclusion (stop).

Concrete Walkthrough (with numbers):

Suppose a geometry problem needs 3 key steps: define variables, derive a relation, and compute the numeric answer. Round 1: The model defines variables and gets a mini result. It writes a short summary: “Set x=…, found relation A>B.” Round 2: It uses that summary, derives a new relation, and writes: “Combined with earlier, we got x=… .” Round 3: It uses the latest summary to compute the final number and outputs the conclusion instead of a summary.

Secret Sauce

Decoupling format from strategy: First learn the dance steps (SFT), then learn when to dance fast or slow (RL). This separation makes training stable yet powerful.
Multiplicative reward: Efficiency counts only when the answer is correct, preventing the model from gaming the system by stopping too early.
Trajectory-level credit: Early choices (like a crisp summary) get credit, which is vital for learning when/what to summarize.

Extra Sandwiches for Key Pieces: 🍞 Hook: You know how a coach reviews your whole season, not just the championship game? 🥬 Trajectory-level Learning

What it is: Learning from the entire path of attempts, not just the final moment.
How it works:
1. Roll out several rounds.
2. Score the whole journey.
3. Share that score with all steps that led there.
Why it matters: Early good habits get reinforced; early mistakes get fixed. 🍞 Anchor: Every runner in a relay earns the team trophy.

🍞 Hook: You know how you first learn the rules of chess before mastering strategy? 🥬 Cold Start (SFT)

What it is: A supervised stage that teaches the model the structure and signals of iterative reasoning.
How it works:
1. Convert long solutions into round-by-round samples.
2. Fine-tune the model to output the right tags, summaries, and conclusions.
Why it matters: Without knowing the basics, the model can’t benefit from RL. 🍞 Anchor: Drill the moves before playing in a tournament.

04Experiments & Results

The Test: The authors asked, “Does optimizing the whole iterative journey make models more accurate and faster?” They measured:

Accuracy: Did the model get the right answer?
Tokens: How many tokens were generated (a proxy for reasoning length)?
Latency: How many seconds did inference take? They used tough math and science benchmarks like AIME24, AIME25, MATH500, and GPQA_diamond.

The Competition: They compared four settings on two base models (DeepSeek-R1-Distill-Qwen-1.5B and Qwen3-4B-Base):

Vanilla (no iteration), SFT only.
Vanilla + RL (task reward only).
InftyThink+ (iterative), SFT only.
InftyThink+ + RL (task only) and InftyThink+ + RL (task + efficiency).

Scoreboard with Context (DeepSeek-R1-Distill-Qwen-1.5B):

AIME24 accuracy:
- Vanilla SFT: 26.67% (like a D).
- Vanilla + RL (task): 38.75% (big jump, to a C+).
- InftyThink+ SFT: 29.48% (a bit better than vanilla SFT).
- InftyThink+ + RL (task): 50.94% (another big jump; now near a solid B to B+).
- InftyThink+ + RL (task+efficiency): 43.96% (a bit lower than task-only, but much faster and shorter).
Latency on AIME25:
- Vanilla SFT: 134.34s.
- InftyThink+ SFT: 98.10s (already faster despite slightly more tokens).
- InftyThink+ + RL (task+efficiency): 68.39s (about 32.8% faster than vanilla baseline).
Out-of-distribution (GPQA_diamond) accuracy:
- Vanilla SFT: 29.40%.
- InftyThink+ SFT: 32.31%.
- InftyThink+ + RL (task): 37.50%.
- InftyThink+ + RL (task+efficiency): 35.46%.

Scoreboard with Context (Qwen3-4B-Base):

AIME24 accuracy:
- Vanilla SFT: 44.06%.
- Vanilla + RL (task): 50.31%.
- InftyThink+ SFT: 43.65%.
- InftyThink+ + RL (task): 52.29% (best).
- InftyThink+ + RL (task+efficiency): 49.06% (slightly lower, but notably faster).
Latency averages drop notably under InftyThink+ SFT and drop further under task+efficiency.

Make the Numbers Meaningful:

Think of the 21-point AIME24 jump with InftyThink+ + task-RL over its SFT baseline like moving from barely passing to comfortably passing a tough contest.
The latency drops are like getting your results in half the time without paying extra.
Tokens fall dramatically under the task+efficiency setting, meaning the AI learns to be concise when it can, without throwing away correctness.

Surprising Findings:

Even before RL, InftyThink+ often runs faster than vanilla, despite using similar or more tokens. Why? Each round’s context is short and bounded, so compute per token is cheaper.
Replacing model-made summaries with external summaries helps in SFT-only mode but hurts after RL. This shows RL tailors summaries to the model’s own continuation style—an end-to-end coupling that’s stronger than generic summaries.
Adaptive timing beats fixed or random timing for when to summarize; RL makes that gap bigger, showing the model learns to pause at just the right moments.

Training Efficiency:

RL per-step time: Vanilla long-context RL ~300s/step; InftyThink+ task-RL ~225s/step (~25% speedup); InftyThink+ task+efficiency ~175s/step (~40% speedup). That means you can train on more data in the same time budget.

Bottom Line: Across datasets and models, InftyThink+ with RL reliably raises accuracy more than vanilla RL and trims latency substantially. Adding the efficiency reward trades a small bit of accuracy for large speed gains—great for settings where time and cost matter.

05Discussion & Limitations

Limitations (honest take):

Stage-like tasks work best: InftyThink+ assumes you can break problems into steps where each step’s key info fits a short summary. Tasks with very tangled, flowing reasoning might benefit less.
Natural-language summaries can be fuzzy: Text is flexible but can hide what’s most important. There’s no explicit structure for priorities or constraints, so the model must re-interpret them each round.
Cold-start dependence: You need a data transformation pipeline (splitting reasoning, generating summaries) to teach the format before RL. That’s extra engineering when moving to new domains.
Iteration cap: A fixed cap (like φ=5) balances cost and depth, but some problems may want more rounds.

Required Resources:

GPUs for SFT and RL (the paper used up to 32 H200s in one setup), an inference engine (e.g., SGLang), and a training stack that supports RL with token masking and group-based objectives.
Verifiers for rewards (e.g., math checkers) to assign correctness automatically.

When NOT to Use:

Very short, easy tasks where one-shot answers are already perfect—iteration adds overhead.
Tasks where meaning depends on the exact full long text (like creative storytelling continuity) rather than distilled facts.
Settings without good automatic verifiers; if you can’t measure correctness, RL has weak signals.

Open Questions:

Better summaries: Would semi-structured or hybrid (symbolic + vector) summaries give clearer constraints and even better continuation?
Adaptive caps: Can the model learn its own dynamic limit on rounds instead of a fixed φ?
Multi-objective shaping: How should we tune efficiency rewards across different domains (math vs. code vs. science) for optimal trade-offs?
Verification at scale: How do we design robust, low-noise verifiers for broader tasks beyond math?
Beyond text: Could we extend summaries to include tool results, graphs, or code snippets in a principled way?

06Conclusion & Future Work

Three-Sentence Summary: InftyThink+ turns very long, costly chains of thought into many short, well-connected rounds by adding summaries and then uses reinforcement learning to optimize when to summarize, what to keep, and how to continue. This trajectory-level training, powered by correctness and efficiency rewards, raises accuracy beyond standard long-CoT RL while slashing inference latency and training time. The method consistently generalizes across models and benchmarks, showing that strategy—not just length—drives better reasoning.

Main Achievement: Cleanly separating format learning (via cold-start SFT) from strategy learning (via trajectory-level RL) and pairing it with a reward design that values both correctness and concise reasoning. This leads to large accuracy gains (e.g., +21 points on AIME24) and big efficiency wins (30–70% latency cuts), all by teaching the model to make smart pauses and powerful summaries.

Future Directions: Explore richer, more structured summary types; bring the approach to long-horizon agents that use tools and memory; create adaptive iteration policies; and broaden reliable verifiers beyond math. These steps could make iterative reasoning even more robust, controllable, and scalable.

Why Remember This: InftyThink+ shows that the secret to “infinite” reasoning isn’t longer text—it’s smarter management of what to remember and when to pause, learned end-to-end. That shift—from length to strategy—can make next-generation reasoning systems both sharper and faster in classrooms, labs, and real-world applications.

Practical Applications

•Math tutoring systems that solve multi-step problems with concise, reusable summaries.
•Coding assistants that debug in rounds, summarizing hypotheses and test results to converge faster.
•Scientific assistants that track assumptions and intermediate findings across long analyses.
•Planning agents that break goals into steps and carry forward only the key constraints.
•Tool-using agents that summarize retrieved evidence before the next query to avoid context bloat.
•Customer support bots that condense long chat histories into short, accurate case summaries before troubleshooting.
•Education apps that teach students to write effective study notes by mirroring the AI’s summaries.
•Automated graders/verifiers that guide RL training with correctness checks at scale.
•Research copilots that keep compact logs of experiments and decisions to enable week-long projects.
•Edge/low-latency deployments where bounded-per-round context keeps responses fast and affordable.

Version: 1