Turn-PPO: Turn-Level Advantage Estimation with PPO for Improved Multi-Turn RL in Agentic LLMs
Key Summary
- âąTurn-PPO is a new way to train chatty AI agents that act over many steps, by judging each conversation turn as one whole action instead of judging every single token.
- âąIt replaces GRPOâs noisy, sample-only scoring with PPOâs learned critic, which gives steadier and fairer feedback for long missions.
- âąModeling the problem at the turn level matches how agents really work (decide, act, see result), so the value function learns better and advantage estimates become more accurate.
- âąOn two tough multi-step tasks (WebShop and Sokoban), Turn-PPO trained more stably and often scored higher than both GRPO and token-level PPO.
- âąTurn-PPO prevents update explosions by clipping at the turn level, stopping risky big jumps for entire turns that would destabilize training.
- âąAblations show PPO needs careful settings: the critic learns faster than the actor, batches should be diverse, and turn-level GAE works best with Îłâ0.99 and λâ0.9.
- âąGRPO frequently collapsed on long-reasoning runs because it gives the same advantage to all tokens and suffers high-variance estimates in multi-turn worlds.
- âąTurn-PPO costs about the same compute per step as token-PPO but turns that compute into more reliable learning.
- âąThis approach offers a practical recipe for building more capable tool-using, multi-turn LLM agents.
Why This Research Matters
Reliable multi-turn agents power practical tools like web assistants, UI automation, and planning bots. Turn-PPO makes these agents learn steadily instead of crashing, which means fewer failures and better results in real workflows. By judging whole turns, the method matches how real environments change, improving credit assignment and planning. This stability lets developers train larger, smarter agents that handle long missions with fewer tweaks. In everyday life, that could mean assistants that actually complete tasks end-to-endâfinding items, filling forms, booking servicesâwithout getting stuck. For businesses, it translates to higher success rates, lower iteration costs, and safer deployments.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
đ Top Bread (Hook) You know how when you play a board game, you donât score points for each tiny thought you haveâyou score for each turn you take and how it changes the game? Judging turns, not tiny thoughts, makes the score fair.
đ„Ź Filling (The Actual Concept)
- What it is: Training large language model (LLM) agents for real multi-step tasks (like shopping on a website or solving a puzzle game) works best when we treat each turnâone full decision and actionâas the basic unit of learning, not each token.
- How it works: Historically, people used reinforcement learning (RL) with token-level updates (every word counts as an action). But in multi-turn tool use, the world changes in big chunks at each turn (you click, the website responds). So researchers reframed the problem to learn per turn using PPO with a critic that estimates how good each whole turn was, which stabilizes long-horizon training.
- Why it matters: Without turn-level learning, the training signal becomes noisy and unfairâsome tokens get blamed or praised equally even if they mattered very differently, and the model can crash during training.
đ Bottom Bread (Anchor) Imagine a shopping helper bot: âsearchâ is one turn, then âclick a productâ is another, then âchoose size,â then âbuy now.â Scoring each whole move makes more sense than scoring every character it typed.
Now letâs introduce the key ideas in the right order, using the Sandwich pattern each time we first meet a new concept.
- đ Reinforcement Learning (RL)
- Hook: Imagine training a puppy with treats. Good actions get rewards; bad actions donât.
- What it is: RL is a way for AI to learn by trying actions and using rewards to get better over time.
- How it works:
- The agent sees a situation.
- It picks an action.
- The world reacts and gives a reward.
- The agent updates its strategy to earn more reward next time.
- Why it matters: Without rewards and trial-and-error, the agent canât learn what actually works in the real world.
- Anchor: A web-browsing bot tries âsearch,â sees results, and learns which searches lead to successful purchases.
- đ Markov Decision Process (MDP)
- Hook: Think of a treasure hunt where your next step depends on where you are now and the move you choose.
- What it is: An MDP is a formal way to describe decision-making over steps: states, actions, transitions, and rewards.
- How it works:
- State: what the agent knows now.
- Action: what it does next.
- Transition: how the world changes.
- Reward: how good that move turned out.
- Why it matters: Without the MDP map, learning becomes guessy and tangled, especially over many steps.
- Anchor: In WebShop, the âstateâ is the page and history, the âactionâ is click/search, the ârewardâ is buying the right item.
- đ Multi-Turn Tasks
- Hook: You donât solve a maze in one step; you take many turns.
- What it is: Multi-turn tasks need several decisions in sequence, where each choice affects the next.
- How it works:
- Take a turn (decide and act).
- See the new situation.
- Repeat, building on history.
- Why it matters: Without modeling turns, agents forget context and canât plan ahead.
- Anchor: A bot that shops must search, skim items, pick one, then buyâeach a separate turn.
- đ Long-Horizon Reasoning
- Hook: Planning a road trip across many cities needs decisions that work well far into the future.
- What it is: Long-horizon reasoning means planning across many steps where early choices shape later options.
- How it works:
- Keep track of goals over time.
- Choose actions that help later, not just now.
- Adjust the plan as the world reacts.
- Why it matters: Without it, agents chase short-term wins and fail long missions.
- Anchor: In Sokoban, a single wrong push can block the goal many moves later, so early care is vital.
The World Before: LLMs used RL to learn tool use and interactive skills. Many systems adapted a method called GRPO, which scores whole trajectories by comparing multiple rollouts and then gives that same normalized score to every token in the response. This worked reasonably in single-turn question answering but struggled in true multi-turn agent settings.
The Problem: Multi-turn worlds produce uneven turnsâsome turns matter a lot (e.g., selecting the right product), others a little (e.g., a harmless note). GRPOâs âone score for all tokensâ ignores this. Also, sampling many rollouts in dynamic environments adds noise, making training unstable, especially when chains of reasoning are long.
Failed Attempts: Tweaks to GRPOâremoving standard deviation in normalization, removing KL regularization, and increasing batch diversityâgave only small, temporary gains. Crashes still happened.
The Gap: We needed an advantage estimation that (a) understands turns, not tokens, and (b) reduces variance with a learned value function (critic), not only sample comparisons.
The Fix Introduced: Turn-PPO reframes the MDP so each whole turn is the action. A PPO critic learns turn values; generalized advantage estimation (at the turn level) provides stable, accurate credit assignment. Result: more robust training and better scores on WebShop and Sokoban.
Why You Should Care: Reliable multi-turn agents power web assistants, UI automation, and planning robots. Making them learn stably means fewer crashes, better choices, and more helpful everyday AI.
02Core Idea
đ Top Bread (Hook) Imagine grading a soccer player by each pass they make (every tiny touch) versus grading each play they run (one cohesive turn). Which grade matches the game better? The play-level grade!
đ„Ź Filling (The Actual Concept)
- What it is: The key insight is to train LLM agents with PPO at the turn level (Turn-PPO), so one complete turn (full response) is the action, and a learned critic estimates how good that turn was.
- How it works (recipe):
- Define states as all history plus the current query; define actions as the entire turnâs response.
- Use a value function (critic) to predict how promising each turn is.
- Compute turn-level advantages with generalized advantage estimation (GAE).
- Apply PPOâs clipped update on whole turns to avoid unstable jumps.
- Why it matters: Without turn-level framing, the critic sees mismatched token-by-token transitions (some tiny, some huge), learns blurry values, and advantage estimates get noisyâtraining derails, especially on long tasks.
đ Bottom Bread (Anchor) A shopping agent deciding âclick this productâ should be scored on that entire decision, not on each word it used to say âclick.â
Three Analogies for the Same Idea
- Sports Playbook: Donât grade every footstep; grade the finished play. Turn-PPO grades per turn, not per token.
- Cooking Steps: Taste after each step (saute, simmer, bake) rather than after each grain of salt. The critic judges step-sized chunks (turns).
- School Projects: Assess the whole project checkpoint, not every keystroke you typed. Advantage is the improvement from one checkpoint to the next.
Before vs After
- Before (Token-MDP + GRPO): All tokens got the same trajectory score, even if they mattered differently. High variance across multi-turn environments made training fragile. Critics (when used at token-level PPO) struggled because state jumps were inconsistent across tokens versus tool outputs.
- After (Turn-MDP + PPO): The critic learns meaningful turn values. Advantages line up with actual decision boundaries. PPOâs clipping at the turn level prevents wild updates that often crash training.
Why It Works (Intuition, no equations)
- The world changes between turns, not tokens: tools reply in big chunks; decisions land in full responses. Making turns the action aligns learning with real transitions, so the criticâs job becomes easier and its estimates sharper.
- Learned critic beats sample-only baselines: Instead of relying on noisy rollout groups, the critic smooths estimates over many experiences, lowering variance.
- Clipping at the right granularity: If a new policy would change an entire turn too much, the update is clipped. This keeps learning on a safe path.
Building Blocks (with Sandwich mini-explanations)
- đ Advantage Estimation
- Hook: Choosing a snack is easier if you compare it to other snacks you could have had.
- What it is: Advantage tells how much better an action was compared to the average expected outcome from that state.
- How it works: Predict state value (critic), compare the actual outcome to that prediction, and use the difference as learning signal.
- Why it matters: Without advantage, updates are too noisy or biased.
- Anchor: If buying this product beat your usual success rate, the advantage is positive.
- đ Proximal Policy Optimization (PPO)
- Hook: When trying a new skateboard trick, you donât change your style wildly; you adjust in small safe steps.
- What it is: PPO is an RL method that updates policies with a clipping rule to avoid overly large, risky changes.
- How it works: Measure how the new policy differs from the old on taken actions, clip big jumps, and learn steadily.
- Why it matters: Without clipping, training can explode and collapse.
- Anchor: If the agent suddenly favors a weird turn too strongly, PPO clips that jump.
- đ Group Relative Policy Optimization (GRPO)
- Hook: Judging a performance by ranking it only among the small group you watched can be unfair and bouncy.
- What it is: GRPO scores actions by normalizing trajectory rewards within a group of sampled rollouts, removing the critic.
- How it works: Sample multiple rollouts, compute relative scores, assign the same score to all tokens in a trajectory.
- Why it matters: Without a learned critic and turn-aware credit, scores can be high-variance and misassigned.
- Anchor: In long tasks, one great decision and many filler tokens all get the same credit.
- đ Token-MDP vs Turn-MDP
- Hook: Counting each letter you write as a separate action is weird; counting each sentence (a full thought) feels right.
- What it is: Token-MDP treats each token as an action; Turn-MDP treats each full turn as one action.
- How it works: Token-MDP updates at every token; Turn-MDP updates per full response, matching real environment steps.
- Why it matters: Without Turn-MDP, critics face mismatched transitions and learn fuzzy values.
- Anchor: In WebShop, the environment changes after your click (a turn), not after each word you typed.
Put together, these blocks create Turn-PPO: a turn-aware, critic-guided, safely-updated learning method that fits multi-turn agents like a glove.
03Methodology
At a high level: Input (multi-turn environment state) â Build turn-level state (history + current query) â Actor generates full-turn response â Environment returns next state and reward â Critic estimates values â Compute turn-level advantages (GAE) â PPO clipped updates (actor and critic) â Output: a more capable, stable multi-turn agent.
Step-by-step with Sandwich explanations where new ideas appear:
- đ Turn-Level State Builder
- Hook: Before making your next move in chess, you look at the whole board plus the last moves.
- What it is: The state is the entire conversation and tool results so far, plus the current query for this turn.
- How it works:
- Concatenate all past turns: (query_1, response_1, âŠ, query_{n-1}, response_{n-1}).
- Add the current query_n.
- That combined context is the turn-n state.
- Why it matters: Without full history, the agent canât connect earlier choices to current decisions.
- Anchor: In WebShop, the state includes past searches, clicked products, and the current pageâs prompt.
- đ Actor: Full-Turn Action Generation
- Hook: When you answer a question in class, you give the whole answer, not just one letter.
- What it is: The action is the entire response for the turn (e.g., search[âŠ], click[âŠ], or a structured plan).
- How it works:
- The policy (LLM) reads the turn state.
- It generates the full response for that turn.
- That response is treated as one atomic action for learning.
- Why it matters: Without full-turn actions, credit gets split awkwardly across tokens.
- Anchor: The bot outputs âclick[b01hqtwl6s]â as one action, not scored letter-by-letter.
- đ Environment Transition and Reward
- Hook: After a big move in a game, the board changes and you see if it helped.
- What it is: The environment returns the next state (e.g., new web page or new Sokoban board) and possibly a reward.
- How it works:
- Apply the action to the environment.
- Observe the updated context and any reward (often sparse and final).
- Log the transition for learning.
- Why it matters: Without seeing consequences, the agent canât learn what works.
- Anchor: Click a product â see its details page; finishing Sokoban â get terminal reward.
- đ Critic: Turn-Value Estimation
- Hook: Itâs easier to improve when a coach tells you how promising that last play looked.
- What it is: A learned value function (critic) predicts how good the current turnâs state is, in terms of expected future reward.
- How it works:
- Attach a value head to the LLM (shared encoder, separate head).
- Train it to predict returns from each turn state.
- Use these predictions to compute advantages.
- Why it matters: Without a critic, advantage estimates rely on noisy rollouts and can derail training.
- Anchor: The critic learns that âhaving the right product page open with size options visibleâ is valuable.
- đ Generalized Advantage Estimation (GAE) at Turn Level
- Hook: Judging a step is easier if you compare how it went to how you expected it to goâand also peek a bit into the near future.
- What it is: GAE blends immediate outcomes with the criticâs predictions to estimate how much better a turn was than expected.
- How it works:
- Compute a shortfall/excess for each turn: (reward + discounted next value â current value).
- Smooth across turns using parameters Îł (discount) and λ (biasâvariance tradeoff).
- The result is the advantage for that turn.
- Why it matters: Without GAE, advantages swing too wildly or get too biased.
- Anchor: If a turn puts you one click away from âBuy Now,â the advantage is likely positive even before the final purchase.
- đ PPO Clipped Update at the Turn Level
- Hook: Learn to ride faster, but donât yank the handlebars.
- What it is: PPO updates the policy parameters while clipping big changes in probability for chosen actions to maintain stability.
- How it works:
- Compute the probability ratio of new vs old policy for the whole turn action.
- Multiply by the turn advantage; clip the ratio if itâs too large.
- Optimize the clipped objective.
- Why it matters: Without clipping, a few turns could cause unsafe, destabilizing jumps.
- Anchor: If the model suddenly over-commits to an odd click pattern, clipping reins it in.
- đ Value Loss for the Critic
- Hook: The coach also trains by comparing predictions to real outcomes to be a better judge next time.
- What it is: The critic is trained to match discounted returns from each turn onward.
- How it works:
- Compute target returns per turn (summing future rewards with discount Îł).
- Minimize the squared error between predicted values and targets.
- Use a higher learning rate for the critic so it keeps up with the changing policy.
- Why it matters: Without accurate values, advantage estimates degrade and learning wobbles.
- Anchor: The critic learns to rate âright product, right size optionsâ higher than âgeneric search page.â
- đ Batch Construction and Diversity
- Hook: Practicing many different problems makes you a stronger all-around player.
- What it is: Each training round collects multiple rollouts; PPO favors more unique questions per batch (G=1) for diversity.
- How it works:
- Fix total rollouts; vary how many problems they cover.
- Use small minibatches before repeating many epochs to reduce overfitting.
- Maintain a modest number of epochs (often 1) and rely on fresh data.
- Why it matters: Without diversity, the critic can overfit to a few cases.
- Anchor: Seeing many kinds of web queries in one batch helps generalize value estimates.
Concrete Mini-Examples
- WebShop: State = prior actions + current page prompt; Action = âclick[vintage camo]â; Reward = final purchase score at episode end. GAE gives turn-by-turn credit so the crucial âchoose size then colorâ gets proper weight.
- Sokoban: State = current board; Action = a full textual plan turn; Reward = terminal success with step penalties. Early careful pushes get recognized by higher predicted values, steering learning.
The Secret Sauce
- Matching the learning unit (a turn) to the environmentâs change unit (also a turn) makes the criticâs job natural and the policy update stable. That alignmentâplus PPOâs clippingâyields strong, steady learning without extra compute compared to token-level PPO.
04Experiments & Results
The Test: The team evaluated training stability, average reward, and training efficiency on two multi-turn environmentsâWebShop (web navigation and shopping with multiple tool interactions) and Sokoban (puzzle planning with irreversible moves and sparse final rewards). These tasks demand long-horizon reasoning where early turns strongly impact final success.
The Competition: They compared three approaches:
- GRPO (group-based, sample-only advantages; token-level actions).
- Token-PPO (learned critic, but still token-level actions and advantages).
- Turn-PPO (this paperâs method: learned critic, turn-level actions and advantages).
Scoreboard with Context:
- WebShop
- Qwen2.5-3B: GRPO â 0.72, Token-PPO â 0.73, Turn-PPO â 0.75. Think of this as Turn-PPO nudging from a solid B to a higher B+, due to better turn credit.
- Qwen3-1.7B (reasoning disabled): GRPO â 0.78, Token-PPO â 0.77, Turn-PPO â 0.80. Thatâs like clinching the top score in a tightly contested class.
- Qwen3-1.7B (reasoning enabled): GRPO often crashed; Token-PPO â 0.54; Turn-PPO â 0.55. Even in a hard, crash-prone setting, PPO-based methods held together, with Turn-PPO slightly ahead.
- Sokoban
- Qwen2.5-3B: GRPO crashed; Token-PPO â 1.93; Turn-PPO â 2.29. Thatâs a big jumpâlike moving from a C+ to a solid B.
- Qwen2.5-7B: GRPO crashed; Token-PPO â 2.90; Turn-PPO â 3.74. Another strong gain, showing that longer-horizon planning benefits a lot from turn-level credit.
Training Stability Findings:
- GRPO frequently collapsed in multi-turn, long-reasoning runs. Removing standard deviation in normalization, removing KL regularization, or increasing batch diversity didnât fix the root problem; they merely delayed failures or made minor improvements.
- PPO-based methods were much steadier. Turn-PPO, in particular, showed smoother reward curves and fewer training hiccups.
Surprising/Illuminating Observations:
- Turn-level clipping led to a higher âclip ratioâ than token-PPO. Counterintuitively, this is good here: if a full turnâs probability changes too much, Turn-PPO clips the entire turn, preventing unsafe leaps and smoothing training.
- Qwen3 with default long âthinkingâ produced overlong chains that didnât help these tasks and made training harder. Disabling long reasoning improved results and stability for PPO methods. This suggests matching the modelâs reasoning style to the task is important.
Ablations: The PPO Recipe
- Learning Rates: The critic must learn faster than the actor (e.g., actor 1e-6, critic 1e-5). If not, learning stalls or diverges.
- Batch Shape: With fixed rollout budget, PPO prefers G=1 (one rollout per question, many distinct questions) for diversity; GRPO prefers more rollouts per question to stabilize its sample-based scoring.
- Minibatch vs Epochs: Itâs better to use smaller minibatches than to reuse the same data over many epochs, which risks overfitting.
- GAE Hyperparameters: Turn-PPO supports Îł<1 and λ<1 (e.g., Îłâ0.99, λâ0.9) for a good biasâvariance balance. Token-level PPO often needs Îł=λ=1.0 because token sequences are so long that smaller Îł would make early tokens âdisappear.â This extra flexibility is a key Turn-PPO advantage.
Bottom Line with Meaning:
- Think of Turn-PPO as grading each play in a game rather than every micro-movement. That simple reframing, combined with a coach (critic) who learns to judge plays, yields higher, steadier scores across very different long-horizon tasks, without extra compute per step compared to token-level PPO.
05Discussion & Limitations
Limitations:
- Tested on two benchmarks (WebShop and Sokoban). While they span web tool use and puzzle planning, broader trials (richer web tools, complex GUIs, physical robots) are still needed to confirm generality.
- Environments were mostly text-simulated. Real-world noise, delays, and multi-modal feedback could pose new challenges for value estimation and stability.
- Turn boundaries are assumed natural and well-defined. In some tasks, what counts as a âturnâ may be ambiguous and require careful interface design.
Required Resources:
- A base LLM (e.g., Qwen-family) with enough capacity to serve as both actor and shared encoder for the critic head.
- Rollout infrastructure for multi-turn environments (collection, logging, and replay for PPO updates).
- Careful hyperparameter tuning: critic LR > actor LR; batch diversity for PPO; Îłâ0.99 and λâ0.9 for turn-level GAE.
When NOT to Use:
- Pure single-turn QA with immediate rewards: token-level approaches may suffice and be simpler.
- Tasks where âturnsâ donât align with environment changes or where actions must be at fine-grained token-level (e.g., exact string emission with immediate per-token rewards).
- Extremely sparse data or ultra-costly rollouts where learning a critic is impractical (though PPO is generally sample-efficient).
Open Questions:
- How best to define turns in mixed-tool or streaming settings (e.g., partial tool outputs, concurrent tools)?
- Can turn-level value functions be augmented with auxiliary predictions (e.g., success likelihood, remaining steps) to further stabilize long-horizon learning?
- What curricula help the critic learn faster on very long tasks without overfitting early-turn patterns?
- How does Turn-PPO interact with reward shaping, rejection sampling, or preference models (RLHF) in multi-turn agents?
- Can off-policy replay or hybrid on/off-policy variants further improve sample efficiency without hurting stability?
06Conclusion & Future Work
Three-Sentence Summary: This paper shows that training LLM agents for multi-step tasks works better when each whole turn is treated as the action and is judged by a learned critic. By combining a turn-level MDP with PPO and turn-level GAE, Turn-PPO delivers more stable and higher rewards than GRPO and token-level PPO on WebShop and Sokoban. The method keeps compute similar to token-PPO while aligning learning with how environments actually changeâturn by turn.
Main Achievement: Turn-PPO cleanly solves the mismatch between token-level learning and turn-based environments, yielding accurate advantage estimates, safer updates (turn-level clipping), and consistently steadier training across long-horizon tasks.
Future Directions: Extend to richer, real web agents with multiple tools and GUI actions, and to embodied settings with real sensors and delays. Explore hybrid methods that mix turn-level PPO with preference models, rejection sampling, or off-policy replay for more sample efficiency. Develop automatic turn segmentation and better auxiliary signals to help the critic on very long missions.
Why Remember This: Sometimes the biggest gain comes from choosing the right âunit of learning.â By switching from tokens to turns and letting a critic guide credit assignment, Turn-PPO turns unstable, crash-prone training into steady progress for multi-turn AI agents.
Practical Applications
- âąTrain web-browsing agents that consistently find and purchase correct items with fewer training crashes.
- âąBuild GUI automation assistants that reliably navigate multi-step application workflows (fill forms, export data, verify results).
- âąImprove puzzle and planning bots (like Sokoban-style or logistics) where early moves affect long-term success.
- âąStabilize training for tool-using LLMs that call search engines, databases, or APIs across multiple steps.
- âąEnhance customer-support agents that must gather info, check systems, and resolve tickets over several exchanges.
- âąDevelop research assistants that plan multi-stage queries (search, filter, summarize, cite) as coherent turns.
- âąCreate tutoring systems that adapt lesson turns based on student responses and long-term learning goals.
- âąOptimize data-labeling or QA agents that must follow multi-step guidelines consistently before submitting results.
- âąPrototype embodied or simulated-robot controllers that plan turn-by-turn with language and tool feedback.
- âąIntegrate Turn-PPO into RLHF pipelines for multi-turn dialogues where turn-level credit improves alignment.