AT$^2$PO: Agentic Turn-based Policy Optimization via Tree Search

Zefang Zong; Dingwei Chen; Yang Li; Qi Yi; Bo Zhou; Chengming Li; Bo Qian; Peng Chen; Jie Jiang

AT$^2$PO: Agentic Turn-based Policy Optimization via Tree Search

Intermediate

Zefang Zong, Dingwei Chen, Yang Li et al.1/8/2026

arXiv PDF

Key Summary

•AT2PO is a new way to train AI agents that work in several turns, like asking the web a question, reading the result, and trying again.
•It grows a tree of possible next steps, but only branches where the AI is most unsure, so it explores smartly instead of randomly.
•It fairly shares the final reward back to each turn, so the AI learns which specific steps helped or hurt the outcome.
•It updates the AI policy one turn at a time (not one token or whole sequence), which makes learning steadier and better aligned with how agents actually act.
•Across seven question‑answering benchmarks, AT2PO beats strong baselines by up to 1.84 percentage points on average.
•The biggest gains show up on multi‑hop tasks that truly need several tool calls, where turn‑level learning shines.
•The method is compatible with existing multi‑turn RL pipelines and can plug into others (it’s orthogonal to the search component).
•An entropy (uncertainty) signal chooses where to expand the tree, boosting rollout diversity under a fixed compute budget.
•Turn‑wise credit assignment works best when each turn’s value is used directly as its advantage.
•There is extra compute for tree expansion and multiple iterations, but training becomes more stable and effective.

Why This Research Matters

Many real tasks unfold over several steps—searching, checking, and deciding again. Training at the wrong level (tokens or whole sequences) makes learning noisy, slow, and brittle. AT2PO matches how agents actually work by exploring where they’re unsure, fairly rewarding each turn, and updating the policy per turn. This alignment makes training more stable and efficient, improving accuracy under the same compute budget. In practice, that means smarter assistants that browse better, plan better, and follow instructions more reliably. It also lowers wasted tool calls and focuses compute on the most informative decisions. As agents move into workflows like research, coding, and web tasks, such turn-aware learning will be key.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you’re playing a detective game. You ask a librarian for a clue, read the note, think a bit, ask again, and only at the end do you find out if you solved the mystery. Now, how should we train a detective to get better at that multi-step process?

🥬 Agentic Reinforcement Learning (ARL)

What it is: ARL teaches AI agents to act and think over several steps, using tools (like search) to reach a goal.
How it works:
1. The agent observes the situation (question and history).
2. It thinks and takes an action (e.g., call a tool or give an answer).
3. It gets new observations (tool results) and repeats.
4. At the end, it gets a reward (correct or not).
Why it matters: Without ARL, AIs act in one shot, which fails for tasks that need looking things up or planning. 🍞 Anchor: A web agent that searches, reads, and refines before answering is an agentic RL setup.

🍞 Hook: You know how some puzzles take several moves to solve, not just one?

🥬 Multi-turn Tasks

What: Problems that need multiple back-and-forth steps to succeed.
How:
1. Turn 1: read the question; decide a search query.
2. Turn 2: read results; decide the next query.
3. Repeat until confident; then answer.
Why: If you force everything into one step, the agent misses important intermediate checks. 🍞 Anchor: HotpotQA questions that need two Wikipedia pages are classic multi-turn tasks.

🍞 Hook: Think of a board game where your next move depends on where you are right now.

🥬 Markov Decision Process (MDP)

What: A way to describe decision-making where each step’s choice depends on the current state, and the next state depends on that choice.
How:
1. State: the chat + tool outputs so far.
2. Action: reason + tool call or final answer.
3. Transition: new state with the tool’s reply.
4. Reward: usually at the end (correct/incorrect).
Why: Without MDP framing, it’s hard to reason about good exploration and learning. 🍞 Anchor: After you search once, the results become part of the new state for the next decision.

🍞 Hook: You know how you tweak a recipe after tasting it, to make the next batch better?

🥬 Policy Gradient Methods

What: A family of techniques that adjust the agent’s decision rule (policy) to make good actions more likely next time.
How:
1. Sample attempts from the current policy.
2. Score them with rewards or advantages.
3. Nudge the policy toward better actions using gradients.
Why: Without this, the agent can’t systematically improve its choices. 🍞 Anchor: If a search query led to a correct answer, the policy becomes more likely to choose similar queries in the future.

🍞 Hook: Imagine a teacher who only tells you at the end of the semester if you passed. That’s tough feedback.

🥬 Sparse Rewards

What: Rewards that show up only at the end (like correct/incorrect), not at each step.
How:
1. The agent takes many steps.
2. Only at the end, it learns if the whole chain was right.
3. It must guess which steps helped.
Why: Without a way to split the final reward across steps, the agent learns slowly and unfairly. 🍞 Anchor: If you only tell the web agent “right/wrong” at the end, it can’t tell which search turn mattered most.

The World Before: LLM agents could interleave thinking and tool-use (like ReAct), and RL with verifiable rewards (RLVR) could train them using programmatic checks (like Exact Match). But training treated the whole response as a flat token stream. Exploration was either linear (one path) or tree-like without smart priorities. Rewards came only at the end, making it hard to credit the right steps. And the learning objective operated at token- or sequence-level, not at the natural unit—turns.

The Problem: Three blockers stood out:

Limited exploration diversity: agents wasted rollout budget exploring unimportant spots instead of uncertain, high-potential turns.
Sparse credit assignment: only end rewards; no clear way to credit the right intermediate turns.
Misaligned optimization: policy updates ignored the turn structure, causing unstable learning and poor alignment with how agents act.

Failed Attempts: Token-level or sequence-level objectives (GRPO/GSPO) either had high variance (token) or were too coarse (sequence). Tree rollouts without priority (random/heuristic) missed the most informative branching points. Auxiliary signals added complexity without fixing the root mismatch: decisions happen turn by turn.

The Gap: A turn-aware pipeline that explores where uncertainty is highest, splits the final reward turn by turn, and updates the policy at the same granularity as decisions.

Real Stakes: In daily life, assistants that browse, fill forms, compare products, or plan travel need multiple steps. If they explore poorly, mis-credit steps, or learn at the wrong granularity, they waste compute, make flaky plans, and learn slowly. A turn-level solution means faster, steadier improvement, more reliable answers, and better use of limited budgets.

02Core Idea

🍞 Hook: You know how a good coach watches each quarter of a game, tries new plays where the team is unsure, and then praises or fixes the exact quarter that mattered?

🥬 The Aha! Moment

One sentence: Train agents at the level they actually decide—turns—by (1) growing a tree where uncertainty is highest, (2) sending the final reward back to the exact turns that helped, and (3) updating the policy one turn at a time.

Three Analogies:

Treasure map: Only branch out at foggy spots (uncertainty), mark which forks led to treasure (credit), and update your map-reading skill per fork (turn-level update).
Cooking class: Try variations only on the steps you’re least confident in (entropy-guided), judge which step made it tasty (credit per step), and tweak that specific step in the recipe (turn-based optimization).
Team sport: Run plays more often in plays you haven’t mastered (explore uncertain turns), award MVPs for the quarters that changed the game (credit assignment), and train position-by-position (turn optimization).

🍞 Anchor: A search agent facing a tough multi-hop question branches more on the confusing turns, credits the turns that fetched the key evidence, and then improves precisely those turns.

🍞 Hook: You know how a flashlight wastes battery if you shine it everywhere, but saves power by focusing on the darkest spots?

🥬 Entropy-Guided Tree Expansion

What: Expand the rollout tree at turns where the policy is most uncertain (high entropy) to get diverse, useful trajectories under the same budget.
How:
1. Build a turn-level tree from initial trajectories.
2. Measure uncertainty (entropy) per turn.
3. Pick the top-K uncertain turns, with a small penalty to avoid over-branching one node.
4. Branch from there and continue to the end.
Why: Without prioritizing uncertain turns, compute is wasted on easy or already-known paths, reducing diversity and quality. 🍞 Anchor: If Turn 3’s search query is where the model hesitates most, that’s where we branch to try different queries.

🍞 Hook: Imagine sharing a pizza among friends who actually helped cook it—each gets a fair slice.

🥬 Turn-wise Credit Assignment

What: A way to pass the final success/failure back to each turn, so each turn gets a value/advantage signal.
How:
1. Score the leaves using the final outcome (e.g., Exact Match).
2. Propagate scores backward through the tree to each turn (weighted by uncertainty or child values).
3. Use each turn’s value as its learning signal (advantage).
Why: Without splitting the reward, the agent can’t tell which turns improved the outcome, slowing or misdirecting learning. 🍞 Anchor: If only one search turn found the key Wikipedia sentence, that turn gets higher credit.

🍞 Hook: Picture grading each round of a tournament instead of treating the whole tournament as one giant score.

🥬 Agentic Turn-based Policy Optimization (ATPO)

What: A policy learning objective that computes importance weights and clipping per turn, aligning updates with turn-level credit.
How:
1. For each turn, compute a turn-level importance ratio (how off-policy/on-policy it is).
2. Clip per turn to avoid unstable jumps.
3. Multiply by that turn’s advantage (from credit assignment).
4. Update the policy with the sum over turns.
Why: Token-level is noisy; sequence-level is too coarse. Turn-level matches how agents really act and stabilizes training. 🍞 Anchor: The model updates its habit for “second search turn” using the second turn’s signal and ratio, not mixed with other turns.

Before vs After:

Before: Agents explored randomly or by rough heuristics; all or nothing rewards; updates at token/sequence made learning unstable.
After: Smart branching at uncertain turns; fair, fine-grained reward per turn; stable, aligned updates per turn.

Why It Works (intuition):

Focus exploration on where learning can change most (uncertain turns).
Turn credit bridges the gap from one final reward to many steps.
Turn-level optimization reduces variance (vs token) while keeping locality (vs sequence), so gradients are steadier and more relevant.

Building Blocks:

Turn-level tree that mirrors the agent’s real decisions.
Entropy as an importance hint for where to branch next.
Backward reward propagation to turns for advantages.
Turn-wise importance sampling and clipping to train robustly.

03Methodology

At a high level: Input question → Build initial turn-level trees → Entropy-guided node sampling → Expand new branches → Turn-wise credit assignment → Turn-based policy optimization → Updated policy.

🍞 Hook: Think of exploring a maze. You start with a few paths, peek deeper where you’re most unsure, score which turns led to the exit, and then practice exactly those tricky turns.

🥬 Step A: Initialize Trees (Rollouts)

What happens: For each prompt, sample several complete multi-turn trajectories and store them as branches in a turn-level tree (each node = a turn: state + action).
Why this exists: A tree gives structure—so we can compare different choices at the same turn and share context.
Example: On a HotpotQA question, generate 10 ReAct-style trajectories; each becomes a path from root to leaf (final answer). 🍞 Anchor: You have 10 different attempt paths for the same question; each turn node remembers what was thought, what was searched, and what came back.

🍞 Hook: You know how you ask for help on the parts you’re most confused about?

🥬 Step B: Measure Turn Uncertainty (Entropy)

What happens: Compute policy entropy per turn node (higher = less certain), optionally normalize and apply a small penalty if a parent already has many children.
Why this exists: To spend branching budget where exploring new options is likely to teach us the most.
Example: If the model’s probability spread is flat for Turn 2 (can’t decide which query), entropy is high; if it’s peaky (very sure), entropy is low. 🍞 Anchor: We mark the top 6 most uncertain turns across the tree for focused branching.

🍞 Hook: When exploring a forest, you split paths at the most mysterious forks first.

🥬 Step C: Entropy-Guided Tree Expansion

What happens:
1. Select top-K high-entropy nodes.
2. Fork the state at each selected node.
3. From that fork, regenerate the rest of the trajectory to a new leaf.
4. Repeat for L rounds.
Why this exists: It creates diverse, high-quality rollouts under the same token budget; diversity comes from branching where it counts.
Example with data: Start with M=10 branches; each of L=2 rounds you add K=6 new branches; in total you get M+L*K leaves (here, 22). Because most forks are mid-trajectory, you get more coverage per token spent than regenerating full paths from scratch. 🍞 Anchor: If Turn 3 is tricky, we try multiple different Turn‑3 queries and see which downstream results work best.

🍞 Hook: Imagine clapping at the exact scenes in a play that made the ending great.

🥬 Step D: Turn-wise Credit Assignment

What happens:
1. Score leaves with the final outcome (e.g., Exact Match = 1/0, with format checks).
2. Normalize rewards across the group (GRPO-style) to stabilize learning.
3. Propagate values back to ancestor turns (e.g., child-weighted by entropy or child values).
4. Define each turn’s advantage simply as its value (works best in experiments).
Why this exists: With only end rewards, learning is blind; this spreads the supervision to the exact turns that mattered.
Example: If one branch gets EM=1 and siblings get 0, earlier nodes on the winning path receive higher value; siblings get less. 🍞 Anchor: The turn that issued the query pulling the key paragraph gets the biggest learning signal.

🍞 Hook: Picture weighing how much to trust each replay based on how closely it matches your current style.

🥬 Step E: Agentic Turn-based Policy Optimization (ATPO)

What happens:
1. For each token in a turn, compute a turn-level importance weight (how likely this turn is under the new policy vs old).
2. Clip that weight per turn to avoid huge updates from off-policy turns.
3. Multiply by the turn’s advantage from Step D.
4. Sum across turns/tokens; update the policy.
Why this exists: Token-level is too noisy (ratios vary wildly); sequence-level is too blunt (one weight for everything). Turn-level is the sweet spot for agents.
Example: If Turn 2 is far off-policy, its ratio is clipped, so it won’t destabilize training; Turn 4 may stay unclipped and update strongly if it’s on-policy and helpful. 🍞 Anchor: Each round in the conversation gets its own fair, stable training push.

🍞 Hook: You’ve heard “advantage,” “importance sampling,” and “clipping”—let’s keep them simple.

🥬 Mini-Concept: Advantage

What: A turn’s extra goodness compared to average; here we set advantage = that turn’s value.
How: Compute value from propagated rewards; use it directly as the learning signal.
Why: Direct V worked best; difference-based signals (like TD or global deltas) underperformed. 🍞 Anchor: A high-value turn means “do more like this turn.”

🥬 Mini-Concept: Importance Sampling & Clipping (Turn-level)

What: A way to correct for using old-policy data when updating a new policy, while preventing unstable jumps.
How: Compute a per-turn ratio (new/old likelihood); clip it within a safe range; scale gradients by this.
Why: Keeps updates fair (correcting bias) and stable (limiting extremes). 🍞 Anchor: If a replayed turn is very unlikely under the new policy, we cap its influence.

The Secret Sauce:

Matching the unit of learning (turns) to the unit of decision-making (also turns) reduces variance, respects structure, and stabilizes training.
Smart exploration (entropy-guided) plus fair supervision (turn credit) plus aligned updates (ATPO) is a coherent recipe that compounds gains.

04Experiments & Results

The Test: The team trained search-augmented agents that answer questions by calling a lightweight Wikipedia search tool. They measured Exact Match (EM): did the final boxed answer exactly match the ground truth? This setting ensures rewards are verifiable and consistent.

The Competition: AT2PO was compared with widely used RLVR and agentic RL baselines: GRPO, DAPO, GSPO, AEPO, and Tree-GRPO, plus a ReAct-only prompting baseline. Backbones included Qwen3-4B, Qwen3-8B, and Qwen2.5-7B. All methods ran under comparable token budgets (except DAPO’s dynamic sampling, which can use more budget).

The Benchmarks:

Multi-Hop QA: HotpotQA, 2WikiMultihopQA, MuSiQue, Bamboogle.
Single-Hop QA: Natural Questions (NQ), TriviaQA, PopQA.

Scoreboard with Context:

Across seven datasets and three model sizes, AT2PO achieved the best average results in most settings, improving up to +1.84 percentage points on average over the strongest baseline.
Think of it like class grades: when others averaged a solid B to B+, AT2PO nudged into A− territory, consistently, and especially in the hardest subjects (multi-hop).
Multi-hop gains were largest because those tasks truly need multiple tool calls and careful turn-by-turn decision-making; AT2PO’s turn-level design pays off there.
On some single-hop sets, DAPO slightly edged AT2PO, but it used a larger rollout budget. Under equal budgets, AT2PO comes out on top on average.

Training Dynamics and Stability:

Token-level methods (e.g., GRPO) showed early entropy collapse (the model becomes overconfident too soon), harming exploration.
Some methods with soft clipping (e.g., AEPO) avoided early collapse but saw entropy drift later, hurting stability.
AT2PO maintained the most stable entropy curve, balancing exploration and learning pressure over long training horizons.

Surprising Findings:

Advantage choice matters: Using each turn’s value directly as its advantage worked best. Popular difference-style advantages (e.g., turn minus parent or minus root) underperformed in this agentic setting.
Credit aggregation matters: Child-weighted backward propagation (informed by entropy/child values) edged out simple child-mean or leaf-mean averages.
Turn entropy (a diagnostic measuring variation of update sizes across turns) stayed well below 1.0, confirming that different turns genuinely need different update strengths—supporting the need for turn-level optimization.

Ablations (What if we remove pieces?):

Swapping GRPO loss for ATPO (keeping random tree search) already improved results, showing the value of aligning the learning unit with turns.
Adding Entropy-Guided Expansion on top of ATPO gave a clear bump, proving that smart exploration improves rollout quality under fixed budgets.
Finishing with Turn-wise Credit Assignment produced the overall best scores, showing that fair, fine-grained supervision per turn unlocks the full benefit.

Implementation Notes and Practicalities:

Exact Match rewards were used, with a format constraint (reasoning and final answer tags) to ensure consistent evaluation.
A token-in–token-out pipeline (keeping exact token IDs across rollout and training) avoided retokenization drift, stabilizing training.
Compute: Tree expansion introduces extra overhead due to sequential expansion rounds, but the added stability and performance gains made it worthwhile in experiments.

Bottom line: The numbers and curves tell a coherent story. When you explore where you’re most unsure, give each turn its fair slice of credit, and train at the turn level, the agent learns faster, explores better, and answers more accurately—especially on multi-hop problems that truly need several tool calls.

05Discussion & Limitations

Limitations:

Extra compute: Tree expansion with multiple rounds (pick uncertain turns, branch, continue) adds overhead compared to a single linear rollout.
Latency and parallelism: Some parts are sequential (choose nodes, then expand), which can slow wall-clock time unless carefully parallelized.
Reward design dependence: Results rely on verifiable final rewards (e.g., EM with format checks). Tasks without such clear checks may need alternative reward shaping.
Tool and context costs: More branching can increase tool calls and context length, which may be expensive in practical deployments.

Required Resources:

A stable RL training stack capable of storing and replaying token IDs (to avoid retokenization drift), running tree rollouts, and computing per-turn signals.
Sufficient GPU memory/time to handle multiple branches per prompt and long responses (multi-hop QA uses several turns).
A dependable retrieval tool (here, a local Wikipedia search with an embedding retriever) and deterministic evaluation.

When NOT to Use:

Single-turn or very short tasks where a single search or none is enough—turn-level complexity may not pay off.
Extremely tight compute budgets where any tree branching is infeasible.
Tasks with unverified, noisy rewards and no way to check correctness; turn-wise credit derived from unreliable leaves can mislead training.

Open Questions:

Beyond QA: How well does AT2PO generalize to other agentic domains (coding tools, spreadsheets, web navigation with complex states)?
Better node scoring: Are there uncertainty measures beyond entropy (e.g., ensemble disagreement) that improve branching even more?
Off-policy reuse: Can we safely reuse turns across prompts or tasks to reduce rollout cost while preserving stability?
Richer rewards: How to extend turn-wise credit when partial correctness or human feedback is available, not just binary EM?
Parallel tree growth: Can we design parallelizable tree expansion strategies that retain the benefits without sequential bottlenecks?

06Conclusion & Future Work

Three-Sentence Summary: AT2PO trains AI agents at the level they actually make decisions: turns. It grows rollout trees where uncertainty is highest, fairly shares the final reward back to each turn, and updates the policy per turn for stable, aligned learning. This combination reliably improves performance—especially on multi-hop tasks that truly need several tool calls.

Main Achievement: A unified, turn-aware framework—Entropy-Guided Tree Expansion + Turn-wise Credit Assignment + Agentic Turn-based Policy Optimization—that consistently beats strong baselines and stabilizes exploration and training.

Future Directions: Apply AT2PO to broader agent arenas (web navigation, tool-rich coding, multi-modal research); explore smarter node selection beyond entropy; design efficient, parallel tree growth; and extend reward models for partial credit or human-in-the-loop feedback.

Why Remember This: When tasks unfold over turns, you should learn over turns. By matching exploration, supervision, and optimization to the agent’s natural decision unit, AT2PO turns sparse, end-only rewards into clear lessons for each step—and that simple alignment unlocks steadier learning and better results.

Practical Applications

•Web research assistants that plan multi-hop searches and cite sources more reliably.
•Customer support agents that gather missing info over several turns before resolving a ticket.
•Coding copilots that decide when to search docs, run tests, and apply fixes across multiple steps.
•Data analysts that query databases, refine questions, and validate results iteratively.
•Travel planners that compare flights, hotels, and routes over several tool calls to optimize cost and time.
•Education tutors that ask probing questions, look up references, and adapt hints turn by turn.
•E-commerce assistants that filter products, check specs, and confirm preferences across multiple turns.
•Healthcare triage bots that gather symptoms, review guidelines, and suggest next steps carefully.
•Form-filling and automation agents that navigate multi-page workflows with tool calls and validations.
•Enterprise retrieval agents that sequence searches across multiple knowledge bases to produce verified answers.

Version: 1