🎓How I Study AIHISA
📖Read
📄Papers📰Blogs🎬Courses
💡Learn
🛤️Paths📚Topics💡Concepts🎴Shorts
🎯Practice
🧩Problems📝Daily Log🎯Prompts🧠Review
SearchSettings
InfoPO: Information-Driven Policy Optimization for User-Centric Agents | How I Study AI

InfoPO: Information-Driven Policy Optimization for User-Centric Agents

Intermediate
Fanqi Kong, Jiayi Zhang, Mingyi Deng et al.2/28/2026
arXiv

Key Summary

  • •Many real-life requests to AI helpers are vague, so agents must ask good questions before acting.
  • •Old training methods gave a single end-of-task score, making it hard to tell which turn helped and which didn’t.
  • •InfoPO teaches agents to value each conversational turn by measuring how much new feedback changes the agent’s very next decision.
  • •It does this with a safe counterfactual: compare the policy with the feedback versus with a masked “no info” version and reward the difference.
  • •A smart gate then blends this turn-level information reward with the actual task outcome, turning up information when outcomes are unhelpful and turning it down when outcomes are clear.
  • •Across intent clarification, coding with a collaborator, and tool-using decisions, InfoPO beats strong prompting and RL baselines by about 14–16%.
  • •InfoPO starts learning earlier, is more stable when group outcomes are tied, and learns to clarify early, then act confidently.
  • •Theory shows the turn-level reward equals conditional mutual information, and enough total information gain is necessary to succeed.
  • •It generalizes beyond user chats to environment-interactive tasks (like Sokoban/WebShop) and is robust to different user simulators.
  • •Compute overhead is modest (about 1.6× wall clock in practice) and requires no extra environment calls.

Why This Research Matters

Real assistants must ask the right questions before acting, or they waste time and make mistakes. InfoPO gives agents a clear, per-turn way to learn which questions actually help, even when final outcomes don’t yet separate good from bad attempts. This makes training faster and steadier, producing agents that clarify early and act confidently later. It works across tasks like travel planning, coding with someone, and customer support with tools. The method is theory-backed, practical to implement, and robust when the “user” changes style. As a result, we can build AI helpers that feel more thoughtful, save users time, and improve success rates in everyday workflows.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

Let’s first set the stage and introduce the key ideas using simple stories.

🍞 Hook: You know how when a friend says, “Book me a flight next week,” you can’t do it until you ask for the city, date, and budget? 🥬 The Concept (User-Centric Agents): These are AI helpers designed to talk with people, ask for missing details, and then do things. How it works: (1) Listen to the user, (2) Ask clarifying questions, (3) Use tools or take actions, (4) Check results, (5) Repeat until done. Why it matters: Without asking the right questions, the agent guesses and often fails. 🍞 Anchor: An agent can’t buy the right ticket if it doesn’t know “from where” or “how much you can spend.”

🍞 Hook: Imagine playing a turn-by-turn board game where each move depends on what just happened. 🥬 The Concept (Multi-Turn Interactions): This means solving tasks through back-and-forth steps. How it works: (1) Agent says/does something, (2) User/environment responds, (3) Agent updates its plan, (4) Repeat. Why it matters: Without multi-turn thinking, the agent can’t adjust or refine its plan. 🍞 Anchor: A travel chatbot that first asks your budget, then checks hotels, then confirms booking is doing multi-turn interaction.

🍞 Hook: Think of getting a single grade at the end of a long group project—hard to know who helped and when. 🥬 The Concept (Credit Assignment Problem): It’s hard to tell which specific step in a long process caused success or failure. How it works: (1) Many steps happen, (2) You get one final reward, (3) It’s unclear which steps deserve credit. Why it matters: If we can’t reward the right steps, the agent learns slowly or learns the wrong lessons. 🍞 Anchor: If the agent asked a great question early but messed up the final tool call, a single “fail” score hides that the question was helpful.

🍞 Hook: Imagine a teacher who only gives a final score but no comments on your answers along the way. 🥬 The Concept (Reinforcement Learning, RL): A way to train agents by giving rewards or penalties based on what they do. How it works: (1) Agent tries actions, (2) Gets rewards, (3) Updates policy to do better next time. Why it matters: Without the right rewards at the right time, RL agents struggle on long, chatty tasks. 🍞 Anchor: If a chatbot only gets a score at the very end of a 10-turn conversation, it doesn’t know which question helped.

The world before InfoPO: LLM agents could chat, use tools, and complete simple tasks, but they struggled when user requests were underspecified. Many RL methods for agents used group-relative policy optimization (GRPO), which compares several rollouts for the same prompt to reduce noise. However, they mostly computed a single “trajectory-level” score. This makes learning fragile, especially when the group’s outcomes are all similar (like everyone failing in the same way). Even worse, “honorable failures” (the agent asked great clarifying questions but made one bad final call) gave a zero reward, so good questioning wasn’t reinforced.

Failed attempts and why they didn’t fix it:

  • Trajectory-only rewards: Too coarse. They ignore which turn helped.
  • Task-specific shaping and heuristics: Hard to generalize across domains; lots of manual design.
  • Process reward models (PRMs): Can help, but require extra training data/models and may still miss general, task-agnostic information value.
  • Curiosity/novelty bonuses: Encourage exploration but don’t directly measure if the new info actually changed what the agent does next.

The gap: We needed a principled, general way to reward useful information gathering at the level of each turn, without extra environment calls, and still guide the agent toward the final goal.

The real stakes: In daily life, your assistant should ask the right questions early—about your budget, your preferences, or the exact issue with your phone—before taking actions that cost time or money. In coding collaboration, it should clarify requirements before writing code. In customer support, it should diagnose quickly and then fix. Without good turn-level learning signals, assistants waste time, frustrate users, and fail more often.

Enter InfoPO: It treats multi-turn interaction as “actively reducing uncertainty.” It asks: Did the user’s feedback at this turn actually change what the agent is likely to do next? If yes, reward that turn. Then it blends this information reward with the final task outcome in an adaptive way. This teaches agents to clarify first, then act confidently—exactly how helpful humans behave.

02Core Idea

Let’s introduce the main ideas, each in the Sandwich pattern, then connect them.

🍞 Hook: You know how detectives keep asking targeted questions to shrink the list of suspects? 🥬 The Concept (Information-Driven Policy Optimization, InfoPO): A training method that rewards an agent for gathering information that actually changes its next move, while still aiming at the final task. How it works: (1) After each turn’s feedback, compare the policy’s next-action likelihood with and without that feedback (a masked counterfactual), (2) Give a turn-level “information gain” reward based on the difference, (3) Fuse this with the task’s outcome using an adaptive gate, (4) Update the policy. Why it matters: Without valuing informative turns, agents don’t learn to clarify early and can get stuck when end rewards don’t differentiate rollouts. 🍞 Anchor: If learning you’re “under $250 budget” makes the agent pick different hotels next turn, that turn gets positive credit.

🍞 Hook: Imagine watching two versions of the same scene—one with the clue, one where the clue is blanked out—and asking, “Did that clue change what I’d do next?” 🥬 The Concept (Turn-Level Counterfactual Information Gain Reward): A per-turn reward that measures how much the latest feedback shifts the agent’s next-action distribution compared to a “no info” placeholder. How it works: (1) Take the real conversation with feedback, (2) Create a copy where that feedback is replaced by a neutral mask, (3) Compute how the probability of the real next action changes between the two, (4) Use that difference as the turn’s reward. Why it matters: Without counterfactual comparison, you can’t tell if a turn’s feedback truly mattered or if the agent would act the same anyway. 🍞 Anchor: With the answer “the device shows 2G and data off,” the agent’s next-step probabilities (like “enable data” vs. “change network mode”) shift a lot—so the turn is rewarded.

🍞 Hook: Think of a mixing board where you turn up the “information” slider when the song’s main melody is too quiet, and turn it down when the melody is loud and clear. 🥬 The Concept (Variance-Gated Fusion): A way to blend information rewards and outcome rewards by looking at how different the outcomes are within a rollout group. How it works: (1) Measure variance of end-task rewards across the group, (2) If variance is tiny (all similar), it’s hard to learn from outcomes—turn up the weight on information gain, (3) If variance is big (discriminative), trust outcomes more—turn down info weight, (4) Combine to get a single training signal. Why it matters: Without adaptive blending, training can stall when outcomes don’t distinguish good from bad conversations. 🍞 Anchor: Early in training, many conversations tie at “fail,” so the gate leans on info gain; later, as some succeed, the gate shifts focus to winning strategies.

🍞 Hook: In science class, to see what really causes a change, you change only one thing at a time. 🥬 The Concept (Causal Isolation via Teacher Forcing and Masking): Ensuring that the measured change in the next step comes only from the feedback, not random generation noise. How it works: (1) Compare log-probabilities for the same next action tokens with vs. without the real feedback, (2) Keep everything else fixed, (3) Attribute differences to that feedback. Why it matters: Without isolating causes, you might wrongly reward random fluctuations. 🍞 Anchor: You read the same future answer the agent actually produced, but test it under “with feedback” vs. “with placeholder”—a fair A/B test.

🍞 Hook: Picture water flowing through pipes: information flows from what you observe to what you decide. 🥬 The Concept (Directed Information Flow and Mutual Information Intuition): The turn reward, in expectation, equals how much the feedback tells you about your next action (conditional mutual information), and summed over turns, it equals the directed information from observations to decisions. How it works: (1) Each turn’s reward measures info from feedback to next action, (2) Sum across turns to get total info flow, (3) Theory shows a minimum total info is necessary to reliably solve tasks with hidden goals. Why it matters: Without enough information flow, success is mathematically unlikely—you can’t guess the hidden goal. 🍞 Anchor: If there are M hidden intents, you must gather enough bits of info to narrow to the right one; InfoPO’s reward tracks that progress.

Multiple analogies for the key idea:

  1. Detective analogy: Good questions that change your suspect ranking get points; wild guesses don’t.
  2. Video game fog analogy: Each useful question lifts more fog, changing where you walk next; the reward measures how the path changes when the fog lifts.
  3. Cooking analogy: Tasting the soup and adjusting seasoning changes your next cooking action; the reward is bigger if that taste caused a different choice.

Before vs. After:

  • Before: Agents were often rewarded only at the end; they didn’t know which turn helped. They overfit to short scripts, collapsed when outcomes tied, and missed early clarifications.
  • After: Agents get per-turn credit for informative feedback, learn faster when outcomes tie, and naturally adopt “clarify-then-act” strategies.

Why it works (intuition, no equations): The more a piece of feedback changes the agent’s next-step decision, the more uncertainty it removed. Measuring that change is a clean, task-agnostic signal. Blending it with outcome rewards ensures the agent still aims for real success, not just asking questions forever. The variance gate automatically balances these two over training.

Building blocks:

  • Counterfactual masking: Replace feedback with a neutral placeholder to compute a fair comparison.
  • Teacher forcing: Evaluate probabilities on the actual next action tokens to keep the test stable.
  • Group-relative normalization: Standardize rewards within a group to control scale and stabilize learning.
  • Variance-gated fusion: Automatically tune the blend between information and outcomes.
  • PPO-style update with KL to a reference: Keep updates stable and avoid drifting too far, too fast.

03Methodology

High-level recipe: Input (prompt + environment) → Roll out a group of conversations → Compute two signals (task outcome per trajectory, info gain per turn) → Normalize each → Blend with a variance gate → Update the policy with PPO-style clipping and KL regularization → Output: a better agent that asks smarter questions and acts decisively.

Step-by-step, with what/why/examples:

  1. Grouped rollouts (collect experiences)
  • What: For each task prompt, sample G trajectories (conversations) from the current agent until a turn limit or termination.
  • Why: Comparing similar conversations reduces noise and gives a stable baseline (group-relative learning).
  • Example: For the travel prompt “Hotel in Seattle next weekend,” we generate 5 different conversations in parallel—some ask about budget first, some don’t.
  1. Outcome signal (trajectory-level)
  • What: Score each full trajectory with the benchmark’s external metric (success, accumulated reward, pass rate, etc.).
  • Why: This anchors learning to real goals—finishing the booking, passing unit tests, or resolving the customer’s issue.
  • Example: If the agent booked correctly, success=1; if coding tests passed 7/10, score=0.7.
  1. Turn-level information gain (intrinsic signal)
  • What: For each valid turn t, compute how much the latest feedback changes the policy’s probabilities for the very next action tokens.
  • How: (a) Take the real history including feedback; compute log-prob of the actual next action tokens, (b) Take a counterfactual history where the feedback is replaced by a neutral placeholder (like “No information found.”), compute the log-prob for the same tokens, (c) Subtract to get the shift. Average over the next action segment.
  • Why: This isolates the impact of that feedback on the next decision—dense, turn-level credit without extra environment calls.
  • Example: The user says “Keep hotel under 250.”Withfeedback:thepolicygiveshighprobabilityto“searchunder250.” With feedback: the policy gives high probability to “search under 250.”Withfeedback:thepolicygiveshighprobabilityto“searchunder250.” Without it: probabilities are spread or focus on wrong actions. The difference is the reward for that turn.
  1. Group-relative normalization (stabilize scales)
  • What: Normalize outcome scores across the group to get outcome advantages; normalize turn-level info gains across valid turns in the group to get info advantages.
  • Why: Different tasks and turns have different scales. Normalization keeps learning stable and fair.
  • Example: If all five rollouts scored around 0.4–0.45, their standardized values center near zero; turns with unusually high info gain also stand out after standardization.
  1. Variance-gated fusion (adaptive blending)
  • What: Compute the standard deviation of outcomes in the group. If outcomes barely differ (near-zero variance), increase the weight on info gain; if they differ a lot, reduce it.
  • Why: Early on, outcomes often tie (everyone failing similarly), so info gain keeps learning moving. Later, outcomes become informative, so we prioritize them.
  • Example: At the start, most rollouts fail—gate turns up info advantage. By mid-training, some succeed—gate shifts weight back to outcomes.
  1. Token-level assignment and update (PPO + KL)
  • What: Broadcast the fused advantage to the tokens of each assistant response segment and run a PPO-style clipped update, plus a KL penalty to a reference model.
  • Why: Token-level updates align with how LLMs generate text; clipping and KL keep updates stable and prevent drifting too far from a safe distribution.
  • Example: The advantage for the “ask budget” turn is attached to those generated tokens; the policy is nudged to make such turns more likely next time, but clipping and KL avoid overcorrection.
  1. No extra environment calls for counterfactuals
  • What: Counterfactual evaluations are done offline with teacher forcing; no new actions are executed in the environment.
  • Why: Saves time and avoids altering the environment state. Overhead is typically less than 2× wall-clock (around 1.63× in runs).
  • Example: For each turn, we just run two forward passes (with feedback vs. with placeholder) on the same next-action tokens.

Concrete micro-example (toy numbers):

  • Next action length = 5 tokens. With feedback, average log-prob = −1.0 per token; with placeholder, average log-prob = −1.5. Info gain = (−1.0) − (−1.5) = +0.5 per token, summed/averaged → +0.5 for the turn. After normalization within the group, suppose this becomes +1.2. If outcome variance is tiny, gate weight might be 0.8, so blended advantage adds 0.96 (=0.8×1.2) to that turn’s token advantages.

What breaks without each step:

  • Without grouped rollouts: Noisy baselines; unstable learning.
  • Without outcome signal: Agent might keep asking forever and not finish tasks.
  • Without info gain: Sparse rewards; slow learning; misses clarifications.
  • Without normalization: A few extreme turns dominate; unstable updates.
  • Without gating: Training stalls when outcomes tie, or drifts when outcomes dominate too soon.
  • Without PPO+KL: Policy may collapse or diverge.

Secret sauce:

  • Causal isolation with teacher forcing and masking gives a clean, cheap, per-turn signal.
  • Variance-gated fusion keeps learning shaped and purposeful across training phases.
  • The signal is task-agnostic, avoiding bespoke heuristics or expensive reward models while still aligning with final goals.

04Experiments & Results

The tests and why: The authors evaluated three kinds of interactive skills—asking for missing intent, collaborating on code, and making tool-augmented decisions—because real assistants need all three. They measured success rates, pass rates on hidden tests, and average rewards, which tell us if the agent both clarified well and completed tasks correctly.

The competition: InfoPO was compared to strong prompting strategies (ReAct, Reflexion) and multiple RL baselines (UserRL, RAGEN, Search-R1). These are respected methods that either guide reasoning without training or try to stabilize multi-turn RL with group-relative tricks.

Scoreboard with context:

  • UserGym (8 diverse gyms): On Qwen2.5-7B, InfoPO was best in 7/8 sub-environments and lifted the macro average. It especially shined in held-out generalization gyms where goals are underspecified, like IntentionGym (1.892 vs. 1.826 for the best baseline) and SearchGym (0.480 vs. 0.446). In travel subsets, it improved tough variants too, meaning it’s not just memorizing easy patterns.
  • ColBench (collaborative coding): InfoPO improved both Pass and Success. For example, Pass rose to 0.534 vs. 0.457 for the best baseline, even edging GPT-4.1’s 0.529 in that metric. This shows the agent learns to clarify data schemas and requirements before coding.
  • τ-Bench (airline/retail/telecom, long-horizon with dual control): InfoPO matched or improved the best open-source baselines in all domains (e.g., Telecom 0.181; Retail 0.188; Air 0.150), notable because tasks often exceed 30 turns and use only 178 tasks—hard mode for RL.

Making the numbers meaningful:

  • A 14–16% relative gain over GRPO-style methods is like moving from a “solid B” to an “A-” in a tough class—especially impressive when the tests are long and tricky.
  • Early training often had 30–80% of rollout groups with zero outcome variance (everyone failing in the same way). Where most methods see near-zero learning signals, InfoPO’s info gain kept gradients alive—like switching on headlights in fog.
  • Training curves showed InfoPO starts improving earlier and oscillates less, meaning it learns faster and steadier.

Surprising findings:

  • Emergent “clarify-then-act”: As training progressed, the info-gain rewards concentrated on early turns. The agent learned to ask the sharpest questions first, then execute.
  • Context-aware interaction depth: In UserGym and ColBench, the agent first explored a bit (more turns) to resolve ambiguity, then shortened responses. In τ-Bench (super long), it pruned from the start, cutting wasteful steps immediately.
  • Generalization beyond user chats: On Sokoban and WebShop (environment-interactive tasks), InfoPO avoided the “Echo Trap” collapse seen in standard GRPO and kept improving—evidence that “reduce uncertainty, then act” is a general principle.
  • Robust to user simulator shifts: Using stronger or better-prompted simulators at test time nudged scores up in τ-Bench and sometimes ColBench, with mixed effects in UserGym (reflecting stricter user behavior). This suggests InfoPO learns policies that still work when the “user” style changes.

Efficiency:

  • Counterfactual evaluation adds extra forward passes but no extra environment calls; wall-clock overhead averaged about 1.63×, far from the naive 2× worst case. Considering the stability and gains, this is a practical trade-off.

05Discussion & Limitations

Limitations:

  • Extra compute per turn: The counterfactual pass adds overhead (about 1.6× wall clock), which may matter for very long contexts or tight budgets.
  • Text-centric scope: Experiments focus on text interfaces; extending to multimodal (vision/audio) or embodied agents needs careful design for masking and teacher forcing.
  • Simulator realism: Training quality depends on user simulators. If they’re inconsistent or unrealistic, the learned interaction style might overfit. The paper mitigates this with user-shift tests, but real human-in-the-loop trials are the gold standard.
  • Extremely short tasks: If a task is one-shot or trivially short, info-gain per turn provides little advantage over a standard outcome reward.
  • When counterfactuals are ill-defined: For tasks where “removing” feedback breaks coherence (e.g., creative storytelling), masking may be less meaningful.

Required resources:

  • A capable base LLM, PPO-style training stack, and GPU memory for long rollouts.
  • A simulator or environment to provide feedback and outcomes.
  • Reference model weights for KL regularization.

When not to use:

  • Purely single-turn Q&A with immediate, reliable labels—standard supervised fine-tuning may suffice.
  • Creative tasks where measuring “changed next action” isn’t aligned with quality.
  • Settings with strict latency limits where even modest extra compute per turn is unacceptable.

Open questions:

  • Multimodal extension: How to mask and measure info gain from images, audio, or sensors?
  • Human feedback: Can we combine InfoPO’s intrinsic signal with light-weight human preferences or PRMs for even better guidance?
  • Adaptive gates: Can the gate learn richer signals (beyond variance) to switch emphasis more intelligently?
  • Theory to practice: Are there tighter bounds connecting required information gain to success in messy, real-world tasks?
  • Safety and alignment: How does rewarding “information that changes actions” interact with safety constraints and refusal policies?

06Conclusion & Future Work

Three-sentence summary: InfoPO trains chatty, tool-using agents by rewarding turns that truly reduce uncertainty—measured by how much the latest feedback changes the agent’s next action compared to a masked counterfactual—and by blending this signal with end-task rewards via a variance-aware gate. This gives dense, turn-level credit when outcomes tie and keeps optimization focused on real success when outcomes differentiate, leading to earlier, steadier, and stronger learning. Across intent clarification, collaborative coding, and decision-making with tools, InfoPO consistently beats strong baselines, generalizes to new users and environments, and comes with theory showing information gain is necessary for success.

Main achievement: A simple, principled, and scalable way to do turn-level credit assignment—task-agnostic, causally isolated, and adaptively fused with outcomes—that unlocks robust learning for multi-turn agents.

Future directions: Extend to multimodal and embodied settings, mix with lightweight human/process rewards, learn smarter gates, and validate extensively with real users. Explore safety-aware variants where only safe, on-policy information shifts are rewarded.

Why remember this: InfoPO turns “ask good questions first, act confidently next” into a measurable training signal. It’s a clean bridge between information theory and practical RL for agents, showing that uncertainty reduction per turn isn’t just nice—it’s necessary for reliable success in the real world.

Practical Applications

  • •Train customer support agents to diagnose issues by asking the fewest, highest-impact questions before fixing.
  • •Improve travel and shopping assistants so they elicit budget and preferences early, then finalize bookings or orders correctly.
  • •Enhance coding copilots to clarify ambiguous requirements before writing code, boosting pass rates on unit tests.
  • •Guide tool-using agents (search, databases, APIs) to gather key facts first, then execute accurate, efficient tool calls.
  • •Speed up troubleshooting flows (e.g., telecom, IT helpdesk) by rewarding steps that quickly narrow down the root cause.
  • •Strengthen planning agents (research, project setup) to identify missing constraints up front, avoiding rework later.
  • •Stabilize long-horizon agents (games, simulations) by rewarding information that meaningfully changes the next move.
  • •Adapt to different user styles or simulators by learning general “clarify-then-act” strategies that transfer well.
  • •Reduce token waste in long conversations by pruning low-information turns and focusing on decisive actions.
  • •Combine with lightweight human or process rewards to align agents with domain-specific safety and quality norms.
#Information-driven RL#Turn-level credit assignment#Counterfactual masking#Mutual information#Directed information#Variance-gated fusion#Group-relative policy optimization#Multi-turn agents#LLM tool use#User simulators#Active uncertainty reduction#Teacher forcing#PPO with KL regularization#Interactive benchmarks#Generalization
Version: 1

Notes

0/2000
Press Cmd+Enter to submit