Turn-PPO: Turn-Level Advantage Estimation with PPO for Improved Multi-Turn RL in Agentic LLMs

Junbo Li; Peng Zhou; Rui Meng; Meet P. Vadera; Lihong Li; Yang Li

Turn-PPO: Turn-Level Advantage Estimation with PPO for Improved Multi-Turn RL in Agentic LLMs

Intermediate

Junbo Li, Peng Zhou, Rui Meng et al.12/18/2025

arXiv PDF

Key Summary

•Turn-PPO is a new way to train chatty AI agents that act over many steps, by judging each conversation turn as one whole action instead of judging every single token.
•It replaces GRPO’s noisy, sample-only scoring with PPO’s learned critic, which gives steadier and fairer feedback for long missions.
•Modeling the problem at the turn level matches how agents really work (decide, act, see result), so the value function learns better and advantage estimates become more accurate.
•On two tough multi-step tasks (WebShop and Sokoban), Turn-PPO trained more stably and often scored higher than both GRPO and token-level PPO.
•Turn-PPO prevents update explosions by clipping at the turn level, stopping risky big jumps for entire turns that would destabilize training.
•Ablations show PPO needs careful settings: the critic learns faster than the actor, batches should be diverse, and turn-level GAE works best with γ≈0.99 and λ≈0.9.
•GRPO frequently collapsed on long-reasoning runs because it gives the same advantage to all tokens and suffers high-variance estimates in multi-turn worlds.
•Turn-PPO costs about the same compute per step as token-PPO but turns that compute into more reliable learning.
•This approach offers a practical recipe for building more capable tool-using, multi-turn LLM agents.

Why This Research Matters

Reliable multi-turn agents power practical tools like web assistants, UI automation, and planning bots. Turn-PPO makes these agents learn steadily instead of crashing, which means fewer failures and better results in real workflows. By judging whole turns, the method matches how real environments change, improving credit assignment and planning. This stability lets developers train larger, smarter agents that handle long missions with fewer tweaks. In everyday life, that could mean assistants that actually complete tasks end-to-end—finding items, filling forms, booking services—without getting stuck. For businesses, it translates to higher success rates, lower iteration costs, and safer deployments.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook) You know how when you play a board game, you don’t score points for each tiny thought you have—you score for each turn you take and how it changes the game? Judging turns, not tiny thoughts, makes the score fair.

🥬 Filling (The Actual Concept)

What it is: Training large language model (LLM) agents for real multi-step tasks (like shopping on a website or solving a puzzle game) works best when we treat each turn—one full decision and action—as the basic unit of learning, not each token.
How it works: Historically, people used reinforcement learning (RL) with token-level updates (every word counts as an action). But in multi-turn tool use, the world changes in big chunks at each turn (you click, the website responds). So researchers reframed the problem to learn per turn using PPO with a critic that estimates how good each whole turn was, which stabilizes long-horizon training.
Why it matters: Without turn-level learning, the training signal becomes noisy and unfair—some tokens get blamed or praised equally even if they mattered very differently, and the model can crash during training.

🍞 Bottom Bread (Anchor) Imagine a shopping helper bot: “search” is one turn, then “click a product” is another, then “choose size,” then “buy now.” Scoring each whole move makes more sense than scoring every character it typed.

Now let’s introduce the key ideas in the right order, using the Sandwich pattern each time we first meet a new concept.

🍞 Reinforcement Learning (RL)

Hook: Imagine training a puppy with treats. Good actions get rewards; bad actions don’t.
What it is: RL is a way for AI to learn by trying actions and using rewards to get better over time.
How it works:
1. The agent sees a situation.
2. It picks an action.
3. The world reacts and gives a reward.
4. The agent updates its strategy to earn more reward next time.
Why it matters: Without rewards and trial-and-error, the agent can’t learn what actually works in the real world.
Anchor: A web-browsing bot tries “search,” sees results, and learns which searches lead to successful purchases.

🍞 Markov Decision Process (MDP)

Hook: Think of a treasure hunt where your next step depends on where you are now and the move you choose.
What it is: An MDP is a formal way to describe decision-making over steps: states, actions, transitions, and rewards.
How it works:
1. State: what the agent knows now.
2. Action: what it does next.
3. Transition: how the world changes.
4. Reward: how good that move turned out.
Why it matters: Without the MDP map, learning becomes guessy and tangled, especially over many steps.
Anchor: In WebShop, the “state” is the page and history, the “action” is click/search, the “reward” is buying the right item.

🍞 Multi-Turn Tasks

Hook: You don’t solve a maze in one step; you take many turns.
What it is: Multi-turn tasks need several decisions in sequence, where each choice affects the next.
How it works:
1. Take a turn (decide and act).
2. See the new situation.
3. Repeat, building on history.
Why it matters: Without modeling turns, agents forget context and can’t plan ahead.
Anchor: A bot that shops must search, skim items, pick one, then buy—each a separate turn.

🍞 Long-Horizon Reasoning

Hook: Planning a road trip across many cities needs decisions that work well far into the future.
What it is: Long-horizon reasoning means planning across many steps where early choices shape later options.
How it works:
1. Keep track of goals over time.
2. Choose actions that help later, not just now.
3. Adjust the plan as the world reacts.
Why it matters: Without it, agents chase short-term wins and fail long missions.
Anchor: In Sokoban, a single wrong push can block the goal many moves later, so early care is vital.

The World Before: LLMs used RL to learn tool use and interactive skills. Many systems adapted a method called GRPO, which scores whole trajectories by comparing multiple rollouts and then gives that same normalized score to every token in the response. This worked reasonably in single-turn question answering but struggled in true multi-turn agent settings.

The Problem: Multi-turn worlds produce uneven turns—some turns matter a lot (e.g., selecting the right product), others a little (e.g., a harmless note). GRPO’s “one score for all tokens” ignores this. Also, sampling many rollouts in dynamic environments adds noise, making training unstable, especially when chains of reasoning are long.

Failed Attempts: Tweaks to GRPO—removing standard deviation in normalization, removing KL regularization, and increasing batch diversity—gave only small, temporary gains. Crashes still happened.

The Gap: We needed an advantage estimation that (a) understands turns, not tokens, and (b) reduces variance with a learned value function (critic), not only sample comparisons.

The Fix Introduced: Turn-PPO reframes the MDP so each whole turn is the action. A PPO critic learns turn values; generalized advantage estimation (at the turn level) provides stable, accurate credit assignment. Result: more robust training and better scores on WebShop and Sokoban.

Why You Should Care: Reliable multi-turn agents power web assistants, UI automation, and planning robots. Making them learn stably means fewer crashes, better choices, and more helpful everyday AI.

02Core Idea

🍞 Top Bread (Hook) Imagine grading a soccer player by each pass they make (every tiny touch) versus grading each play they run (one cohesive turn). Which grade matches the game better? The play-level grade!

🥬 Filling (The Actual Concept)

What it is: The key insight is to train LLM agents with PPO at the turn level (Turn-PPO), so one complete turn (full response) is the action, and a learned critic estimates how good that turn was.
How it works (recipe):
1. Define states as all history plus the current query; define actions as the entire turn’s response.
2. Use a value function (critic) to predict how promising each turn is.
3. Compute turn-level advantages with generalized advantage estimation (GAE).
4. Apply PPO’s clipped update on whole turns to avoid unstable jumps.
Why it matters: Without turn-level framing, the critic sees mismatched token-by-token transitions (some tiny, some huge), learns blurry values, and advantage estimates get noisy—training derails, especially on long tasks.

🍞 Bottom Bread (Anchor) A shopping agent deciding “click this product” should be scored on that entire decision, not on each word it used to say “click.”

Three Analogies for the Same Idea

Sports Playbook: Don’t grade every footstep; grade the finished play. Turn-PPO grades per turn, not per token.
Cooking Steps: Taste after each step (saute, simmer, bake) rather than after each grain of salt. The critic judges step-sized chunks (turns).
School Projects: Assess the whole project checkpoint, not every keystroke you typed. Advantage is the improvement from one checkpoint to the next.

Before vs After

Before (Token-MDP + GRPO): All tokens got the same trajectory score, even if they mattered differently. High variance across multi-turn environments made training fragile. Critics (when used at token-level PPO) struggled because state jumps were inconsistent across tokens versus tool outputs.
After (Turn-MDP + PPO): The critic learns meaningful turn values. Advantages line up with actual decision boundaries. PPO’s clipping at the turn level prevents wild updates that often crash training.

Why It Works (Intuition, no equations)

The world changes between turns, not tokens: tools reply in big chunks; decisions land in full responses. Making turns the action aligns learning with real transitions, so the critic’s job becomes easier and its estimates sharper.
Learned critic beats sample-only baselines: Instead of relying on noisy rollout groups, the critic smooths estimates over many experiences, lowering variance.
Clipping at the right granularity: If a new policy would change an entire turn too much, the update is clipped. This keeps learning on a safe path.

Building Blocks (with Sandwich mini-explanations)

🍞 Advantage Estimation

Hook: Choosing a snack is easier if you compare it to other snacks you could have had.
What it is: Advantage tells how much better an action was compared to the average expected outcome from that state.
How it works: Predict state value (critic), compare the actual outcome to that prediction, and use the difference as learning signal.
Why it matters: Without advantage, updates are too noisy or biased.
Anchor: If buying this product beat your usual success rate, the advantage is positive.

🍞 Proximal Policy Optimization (PPO)

Hook: When trying a new skateboard trick, you don’t change your style wildly; you adjust in small safe steps.
What it is: PPO is an RL method that updates policies with a clipping rule to avoid overly large, risky changes.
How it works: Measure how the new policy differs from the old on taken actions, clip big jumps, and learn steadily.
Why it matters: Without clipping, training can explode and collapse.
Anchor: If the agent suddenly favors a weird turn too strongly, PPO clips that jump.

🍞 Group Relative Policy Optimization (GRPO)

Hook: Judging a performance by ranking it only among the small group you watched can be unfair and bouncy.
What it is: GRPO scores actions by normalizing trajectory rewards within a group of sampled rollouts, removing the critic.
How it works: Sample multiple rollouts, compute relative scores, assign the same score to all tokens in a trajectory.
Why it matters: Without a learned critic and turn-aware credit, scores can be high-variance and misassigned.
Anchor: In long tasks, one great decision and many filler tokens all get the same credit.

🍞 Token-MDP vs Turn-MDP

Hook: Counting each letter you write as a separate action is weird; counting each sentence (a full thought) feels right.
What it is: Token-MDP treats each token as an action; Turn-MDP treats each full turn as one action.
How it works: Token-MDP updates at every token; Turn-MDP updates per full response, matching real environment steps.
Why it matters: Without Turn-MDP, critics face mismatched transitions and learn fuzzy values.
Anchor: In WebShop, the environment changes after your click (a turn), not after each word you typed.

Put together, these blocks create Turn-PPO: a turn-aware, critic-guided, safely-updated learning method that fits multi-turn agents like a glove.

03Methodology

At a high level: Input (multi-turn environment state) → Build turn-level state (history + current query) → Actor generates full-turn response → Environment returns next state and reward → Critic estimates values → Compute turn-level advantages (GAE) → PPO clipped updates (actor and critic) → Output: a more capable, stable multi-turn agent.

Step-by-step with Sandwich explanations where new ideas appear:

🍞 Turn-Level State Builder

Hook: Before making your next move in chess, you look at the whole board plus the last moves.
What it is: The state is the entire conversation and tool results so far, plus the current query for this turn.
How it works:
1. Concatenate all past turns: (query_1, response_1, …, query_{n-1}, response_{n-1}).
2. Add the current query_n.
3. That combined context is the turn-n state.
Why it matters: Without full history, the agent can’t connect earlier choices to current decisions.
Anchor: In WebShop, the state includes past searches, clicked products, and the current page’s prompt.

🍞 Actor: Full-Turn Action Generation

Hook: When you answer a question in class, you give the whole answer, not just one letter.
What it is: The action is the entire response for the turn (e.g., search[…], click[…], or a structured plan).
How it works:
1. The policy (LLM) reads the turn state.
2. It generates the full response for that turn.
3. That response is treated as one atomic action for learning.
Why it matters: Without full-turn actions, credit gets split awkwardly across tokens.
Anchor: The bot outputs “click[b01hqtwl6s]” as one action, not scored letter-by-letter.

🍞 Environment Transition and Reward

Hook: After a big move in a game, the board changes and you see if it helped.
What it is: The environment returns the next state (e.g., new web page or new Sokoban board) and possibly a reward.
How it works:
1. Apply the action to the environment.
2. Observe the updated context and any reward (often sparse and final).
3. Log the transition for learning.
Why it matters: Without seeing consequences, the agent can’t learn what works.
Anchor: Click a product → see its details page; finishing Sokoban → get terminal reward.

🍞 Critic: Turn-Value Estimation

Hook: It’s easier to improve when a coach tells you how promising that last play looked.
What it is: A learned value function (critic) predicts how good the current turn’s state is, in terms of expected future reward.
How it works:
1. Attach a value head to the LLM (shared encoder, separate head).
2. Train it to predict returns from each turn state.
3. Use these predictions to compute advantages.
Why it matters: Without a critic, advantage estimates rely on noisy rollouts and can derail training.
Anchor: The critic learns that “having the right product page open with size options visible” is valuable.

🍞 Generalized Advantage Estimation (GAE) at Turn Level

Hook: Judging a step is easier if you compare how it went to how you expected it to go—and also peek a bit into the near future.
What it is: GAE blends immediate outcomes with the critic’s predictions to estimate how much better a turn was than expected.
How it works:
1. Compute a shortfall/excess for each turn: (reward + discounted next value − current value).
2. Smooth across turns using parameters γ (discount) and λ (bias–variance tradeoff).
3. The result is the advantage for that turn.
Why it matters: Without GAE, advantages swing too wildly or get too biased.
Anchor: If a turn puts you one click away from “Buy Now,” the advantage is likely positive even before the final purchase.

🍞 PPO Clipped Update at the Turn Level

Hook: Learn to ride faster, but don’t yank the handlebars.
What it is: PPO updates the policy parameters while clipping big changes in probability for chosen actions to maintain stability.
How it works:
1. Compute the probability ratio of new vs old policy for the whole turn action.
2. Multiply by the turn advantage; clip the ratio if it’s too large.
3. Optimize the clipped objective.
Why it matters: Without clipping, a few turns could cause unsafe, destabilizing jumps.
Anchor: If the model suddenly over-commits to an odd click pattern, clipping reins it in.

🍞 Value Loss for the Critic

Hook: The coach also trains by comparing predictions to real outcomes to be a better judge next time.
What it is: The critic is trained to match discounted returns from each turn onward.
How it works:
1. Compute target returns per turn (summing future rewards with discount γ).
2. Minimize the squared error between predicted values and targets.
3. Use a higher learning rate for the critic so it keeps up with the changing policy.
Why it matters: Without accurate values, advantage estimates degrade and learning wobbles.
Anchor: The critic learns to rate “right product, right size options” higher than “generic search page.”

🍞 Batch Construction and Diversity

Hook: Practicing many different problems makes you a stronger all-around player.
What it is: Each training round collects multiple rollouts; PPO favors more unique questions per batch (G=1) for diversity.
How it works:
1. Fix total rollouts; vary how many problems they cover.
2. Use small minibatches before repeating many epochs to reduce overfitting.
3. Maintain a modest number of epochs (often 1) and rely on fresh data.
Why it matters: Without diversity, the critic can overfit to a few cases.
Anchor: Seeing many kinds of web queries in one batch helps generalize value estimates.

Concrete Mini-Examples

WebShop: State = prior actions + current page prompt; Action = “click[vintage camo]”; Reward = final purchase score at episode end. GAE gives turn-by-turn credit so the crucial “choose size then color” gets proper weight.
Sokoban: State = current board; Action = a full textual plan turn; Reward = terminal success with step penalties. Early careful pushes get recognized by higher predicted values, steering learning.

The Secret Sauce

Matching the learning unit (a turn) to the environment’s change unit (also a turn) makes the critic’s job natural and the policy update stable. That alignment—plus PPO’s clipping—yields strong, steady learning without extra compute compared to token-level PPO.

04Experiments & Results

The Test: The team evaluated training stability, average reward, and training efficiency on two multi-turn environments—WebShop (web navigation and shopping with multiple tool interactions) and Sokoban (puzzle planning with irreversible moves and sparse final rewards). These tasks demand long-horizon reasoning where early turns strongly impact final success.

The Competition: They compared three approaches:

GRPO (group-based, sample-only advantages; token-level actions).
Token-PPO (learned critic, but still token-level actions and advantages).
Turn-PPO (this paper’s method: learned critic, turn-level actions and advantages).

Scoreboard with Context:

WebShop
- Qwen2.5-3B: GRPO ≈ 0.72, Token-PPO ≈ 0.73, Turn-PPO ≈ 0.75. Think of this as Turn-PPO nudging from a solid B to a higher B+, due to better turn credit.
- Qwen3-1.7B (reasoning disabled): GRPO ≈ 0.78, Token-PPO ≈ 0.77, Turn-PPO ≈ 0.80. That’s like clinching the top score in a tightly contested class.
- Qwen3-1.7B (reasoning enabled): GRPO often crashed; Token-PPO ≈ 0.54; Turn-PPO ≈ 0.55. Even in a hard, crash-prone setting, PPO-based methods held together, with Turn-PPO slightly ahead.
Sokoban
- Qwen2.5-3B: GRPO crashed; Token-PPO ≈ 1.93; Turn-PPO ≈ 2.29. That’s a big jump—like moving from a C+ to a solid B.
- Qwen2.5-7B: GRPO crashed; Token-PPO ≈ 2.90; Turn-PPO ≈ 3.74. Another strong gain, showing that longer-horizon planning benefits a lot from turn-level credit.

Training Stability Findings:

GRPO frequently collapsed in multi-turn, long-reasoning runs. Removing standard deviation in normalization, removing KL regularization, or increasing batch diversity didn’t fix the root problem; they merely delayed failures or made minor improvements.
PPO-based methods were much steadier. Turn-PPO, in particular, showed smoother reward curves and fewer training hiccups.

Surprising/Illuminating Observations:

Turn-level clipping led to a higher “clip ratio” than token-PPO. Counterintuitively, this is good here: if a full turn’s probability changes too much, Turn-PPO clips the entire turn, preventing unsafe leaps and smoothing training.
Qwen3 with default long “thinking” produced overlong chains that didn’t help these tasks and made training harder. Disabling long reasoning improved results and stability for PPO methods. This suggests matching the model’s reasoning style to the task is important.

Ablations: The PPO Recipe

Learning Rates: The critic must learn faster than the actor (e.g., actor 1e-6, critic 1e-5). If not, learning stalls or diverges.
Batch Shape: With fixed rollout budget, PPO prefers G=1 (one rollout per question, many distinct questions) for diversity; GRPO prefers more rollouts per question to stabilize its sample-based scoring.
Minibatch vs Epochs: It’s better to use smaller minibatches than to reuse the same data over many epochs, which risks overfitting.
GAE Hyperparameters: Turn-PPO supports γ<1 and λ<1 (e.g., γ≈0.99, λ≈0.9) for a good bias–variance balance. Token-level PPO often needs γ=λ=1.0 because token sequences are so long that smaller γ would make early tokens “disappear.” This extra flexibility is a key Turn-PPO advantage.

Bottom Line with Meaning:

Think of Turn-PPO as grading each play in a game rather than every micro-movement. That simple reframing, combined with a coach (critic) who learns to judge plays, yields higher, steadier scores across very different long-horizon tasks, without extra compute per step compared to token-level PPO.

05Discussion & Limitations

Limitations:

Tested on two benchmarks (WebShop and Sokoban). While they span web tool use and puzzle planning, broader trials (richer web tools, complex GUIs, physical robots) are still needed to confirm generality.
Environments were mostly text-simulated. Real-world noise, delays, and multi-modal feedback could pose new challenges for value estimation and stability.
Turn boundaries are assumed natural and well-defined. In some tasks, what counts as a “turn” may be ambiguous and require careful interface design.

Required Resources:

A base LLM (e.g., Qwen-family) with enough capacity to serve as both actor and shared encoder for the critic head.
Rollout infrastructure for multi-turn environments (collection, logging, and replay for PPO updates).
Careful hyperparameter tuning: critic LR > actor LR; batch diversity for PPO; γ≈0.99 and λ≈0.9 for turn-level GAE.

When NOT to Use:

Pure single-turn QA with immediate rewards: token-level approaches may suffice and be simpler.
Tasks where “turns” don’t align with environment changes or where actions must be at fine-grained token-level (e.g., exact string emission with immediate per-token rewards).
Extremely sparse data or ultra-costly rollouts where learning a critic is impractical (though PPO is generally sample-efficient).

Open Questions:

How best to define turns in mixed-tool or streaming settings (e.g., partial tool outputs, concurrent tools)?
Can turn-level value functions be augmented with auxiliary predictions (e.g., success likelihood, remaining steps) to further stabilize long-horizon learning?
What curricula help the critic learn faster on very long tasks without overfitting early-turn patterns?
How does Turn-PPO interact with reward shaping, rejection sampling, or preference models (RLHF) in multi-turn agents?
Can off-policy replay or hybrid on/off-policy variants further improve sample efficiency without hurting stability?

06Conclusion & Future Work

Three-Sentence Summary: This paper shows that training LLM agents for multi-step tasks works better when each whole turn is treated as the action and is judged by a learned critic. By combining a turn-level MDP with PPO and turn-level GAE, Turn-PPO delivers more stable and higher rewards than GRPO and token-level PPO on WebShop and Sokoban. The method keeps compute similar to token-PPO while aligning learning with how environments actually change—turn by turn.

Main Achievement: Turn-PPO cleanly solves the mismatch between token-level learning and turn-based environments, yielding accurate advantage estimates, safer updates (turn-level clipping), and consistently steadier training across long-horizon tasks.

Future Directions: Extend to richer, real web agents with multiple tools and GUI actions, and to embodied settings with real sensors and delays. Explore hybrid methods that mix turn-level PPO with preference models, rejection sampling, or off-policy replay for more sample efficiency. Develop automatic turn segmentation and better auxiliary signals to help the critic on very long missions.

Why Remember This: Sometimes the biggest gain comes from choosing the right “unit of learning.” By switching from tokens to turns and letting a critic guide credit assignment, Turn-PPO turns unstable, crash-prone training into steady progress for multi-turn AI agents.

Practical Applications

•Train web-browsing agents that consistently find and purchase correct items with fewer training crashes.
•Build GUI automation assistants that reliably navigate multi-step application workflows (fill forms, export data, verify results).
•Improve puzzle and planning bots (like Sokoban-style or logistics) where early moves affect long-term success.
•Stabilize training for tool-using LLMs that call search engines, databases, or APIs across multiple steps.
•Enhance customer-support agents that must gather info, check systems, and resolve tickets over several exchanges.
•Develop research assistants that plan multi-stage queries (search, filter, summarize, cite) as coherent turns.
•Create tutoring systems that adapt lesson turns based on student responses and long-term learning goals.
•Optimize data-labeling or QA agents that must follow multi-step guidelines consistently before submitting results.
•Prototype embodied or simulated-robot controllers that plan turn-by-turn with language and tool feedback.
•Integrate Turn-PPO into RLHF pipelines for multi-turn dialogues where turn-level credit improves alignment.

Version: 1