🎓How I Study AIHISA
📖Read
📄Papers📰Blogs🎬Courses
💡Learn
🛤️Paths📚Topics💡Concepts🎴Shorts
🎯Practice
🧩Problems🎯Prompts🧠Review
Search
Agent-Omit: Training Efficient LLM Agents for Adaptive Thought and Observation Omission via Agentic Reinforcement Learning | How I Study AI

Agent-Omit: Training Efficient LLM Agents for Adaptive Thought and Observation Omission via Agentic Reinforcement Learning

Intermediate
Yansong Ning, Jun Fang, Naiqiang Tan et al.2/4/2026
arXivPDF

Key Summary

  • •Agent-Omit teaches AI agents to skip unneeded thinking and old observations, cutting tokens while keeping accuracy high.
  • •Not every turn in a multi-step task needs long thoughts or the full history; the paper proves this with careful measurements.
  • •A small "cold-start" dataset first shows the agent what skipping looks like in both single-turn and multi-turn cases.
  • •Then an omit-aware reinforcement learning stage uses dual sampling and a special omission reward to learn when to omit.
  • •The method safely balances correctness and token savings, with theory showing errors shrink as policies get closer (by KL-divergence).
  • •On five benchmarks (DeepSearch, WebShop, TextCraft, BabyAI, SciWorld), Agent-Omit-8B matches or beats strong agents while using fewer tokens.
  • •Most omissions happen in the middle turns, saving cost without hurting the final answer.
  • •Compared to other efficiency tricks (like generic summarization or fixed pruning), Agent-Omit adapts turn-by-turn and performs better.
  • •Ablations show both the cold-start format and the omit-aware RL (especially partial trajectories and omission reward) are crucial.
  • •This approach makes agents faster, cheaper, and greener, which matters for real apps like web browsing, shopping, and science tasks.

Why This Research Matters

AI agents often waste time and money by over-explaining each step and carrying too much history. Agent-Omit proves that agents can safely skip mid-turn thoughts and prune old observations, making them faster and cheaper without hurting accuracy. This helps real products: smarter shopping assistants, quicker research tools, and more efficient automation. Lower token use also means less energy per task, which is better for the environment. With a clear training recipe and theory guarantees, teams can deploy leaner, more reliable agents. The approach generalizes across search, web navigation, crafting, embodied control, and science tasks. In short, it upgrades agents from “always talk” to “speak when it counts.”

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook) Imagine packing for a trip. On day one, you plan carefully. On day two and three, you just follow the plan—no need to re-pack your whole suitcase each morning. And you don’t keep every old receipt forever; most become clutter.

🥬 Filling (The Actual Concept)

  • What it is: This paper is about teaching AI agents to stop overthinking and to let go of old, unhelpful notes during multi-step tasks.
  • How it works: The authors first measure where thinking and observations actually help. Then they train agents to skip redundant thoughts and to drop outdated observations at the right moments.
  • Why it matters: Without this, AI agents waste tokens and time writing long thoughts and carrying huge histories, making them slow, costly, and sometimes even less accurate.

🍞 Bottom Bread (Anchor) A shopping agent planning which laptop to buy may plan early, then simply click through steps without re-explaining itself every turn, and it can forget last week’s search results that don’t matter now.

— New Concepts (in friendly order) —

🍞 Top Bread (Hook) You know how a librarian uses a giant catalog to answer questions? That’s like a language model.

🥬 Filling (The Actual Concept)

  • What it is: Large Language Models (LLMs) are big text-understanding and text-writing systems trained on lots of examples.
  • How it works: They read your words, predict the next words based on patterns they learned, and can use tools (like search) to get more info.
  • Why it matters: LLMs are the brains of many modern AI agents.

🍞 Bottom Bread (Anchor) When you ask, “What’s the tallest mountain in Africa?”, an LLM writes “Kilimanjaro.”

🍞 Top Bread (Hook) Think of practicing basketball: you try, get a score, and improve your moves.

🥬 Filling (The Actual Concept)

  • What it is: Reinforcement Learning (RL) helps an AI learn by doing actions and getting rewards.
  • How it works: The agent acts, the environment responds, the agent gets feedback (reward), and it updates its policy to get better results over time.
  • Why it matters: RL makes agents adapt to tasks instead of just memorizing.

🍞 Bottom Bread (Anchor) A web agent tries a button; if it gets closer to the goal, it earns reward and repeats good habits.

🍞 Top Bread (Hook) When doing a puzzle, sometimes you must think deeply; other times you just place the next obvious piece.

🥬 Filling (The Actual Concept)

  • What it is: Thought Necessity means some turns truly need detailed reasoning, while others don’t.
  • How it works: The agent checks if reasoning helps now; if not, it can skip thoughts.
  • Why it matters: Skipping unnecessary thoughts saves tokens and avoids clutter.

🍞 Bottom Bread (Anchor) After planning to “search, then sort, then buy,” the next “search” turn may not need extra thinking.

🍞 Top Bread (Hook) Old notes are useful at first, but later, some are just noise.

🥬 Filling (The Actual Concept)

  • What it is: Observation Utility is about which past tool responses still help the current step.
  • How it works: The agent keeps only the observations that inform the next action; it omits stale ones.
  • Why it matters: Carrying every past observation makes later steps slower and more expensive.

🍞 Bottom Bread (Anchor) Yesterday’s search results rarely matter when you’re summarizing the final answer today.

Before this paper, many systems compressed everything equally or kept too much. That wastes tokens because early plans make some mid-turn reasoning redundant, and many early observations don’t affect the last steps. The authors show, with careful measurements, that thoughts and observations don’t help equally at all turns: thinking is heavy at the start; observations pile up linearly and overburden later turns; and mid-turn omissions often save tokens without hurting accuracy. The gap was a missing skill: adaptively omitting only what is safe to skip at this moment. The stakes are real: faster answers, lower bills, greener compute, and better user experiences on web browsing, shopping, games, robots, and science tasks.

02Core Idea

🍞 Top Bread (Hook) You know how a chef preps a recipe: plan once, then chop, stir, and bake without re-planning every minute—and you toss the peels you don’t need.

🥬 Filling (The Actual Concept)

  • What it is: The key insight is that an AI agent can learn to skip unneeded thoughts and drop irrelevant observations at the right turns.
  • How it works: First, teach the agent the “language” of omitting (cold-start). Then, use RL with special rewards and sampling so the agent practices omitting only when it’s safe and helpful.
  • Why it matters: This keeps accuracy high while cutting token costs significantly.

🍞 Bottom Bread (Anchor) A search agent plans a path, then executes obvious tool calls with empty thoughts, and discards old tool outputs that won’t affect the final answer.

— Multiple Analogies —

  1. Backpacking: Pack smart. Don’t re-pack every stop. Drop souvenirs that weigh you down if they’re not useful.
  2. Homework: Write a plan once. For easy steps, just do them. Don’t copy old drafts into every new page.
  3. Detective: Keep the latest clue. Archive early leads that no longer matter.

— Before vs After —

  • Before: Agents often reasoned at every turn and carried all past observations, treating each turn equally.
  • After: The agent selectively omits mid-turn thoughts and prunes stale observations, cutting tokens with the same or better accuracy.

— Why It Works (no equations) —

  • Early turns usually need deep planning; middle turns often follow the plan directly; late turns depend mostly on recent evidence. By matching omission to each turn’s role, you avoid waste.
  • RL fine-tunes this behavior using rewards that only pay for savings when the answer is correct, preventing cheating.
  • A theoretical guardrail says: as the learned policy becomes closer to the best possible one (measured by KL-divergence), both accuracy and token use approach the optimal balance.

— Building Blocks (with Sandwich Explanations) —

🍞 Top Bread (Hook) When building LEGO, you follow steps (thoughts), do actions (place bricks), and look at what you built (observations).

🥬 Filling (The Actual Concept)

  • What it is: Thought = the agent’s chain-of-thought; Action = tool use or final answer; Observation = environment’s feedback.
  • How it works: Think → Act → Observe → repeat.
  • Why it matters: Managing thought and observations is where most tokens are spent.

🍞 Bottom Bread (Anchor) A web agent thinks “search price,” clicks search (action), reads results (observation), then continues.

🍞 Top Bread (Hook) Sometimes you ask: “Do I really need to think this through again?”

🥬 Filling (The Actual Concept)

  • What it is: Thought Necessity: judge whether reasoning helps right now.
  • How it works: If a prior plan already makes the next step obvious, skip the thought (empty think).
  • Why it matters: Saves tokens and keeps the context clean.

🍞 Bottom Bread (Anchor) After deciding to “open page → add-to-cart,” the “open page” turn needs no extra explanation.

🍞 Top Bread (Hook) Old maps can mislead if you’re already on the final street.

🥬 Filling (The Actual Concept)

  • What it is: Observation Utility: keep only tool responses that still inform the next step.
  • How it works: Omit early, irrelevant tool outputs when they won’t affect the current decision.
  • Why it matters: Reduces long-context bloat.

🍞 Bottom Bread (Anchor) When summarizing the answer, you usually need only the latest, most precise results.

🍞 Top Bread (Hook) Learning a sport needs smart feedback: reward good moves and practice the exact moment they matter.

🥬 Filling (The Actual Concept)

  • What it is: Omit-Aware Agentic RL is a training recipe that encourages smart omission.
  • How it works: It uses dual sampling (practice both full tasks and the specific turns where you consider omitting) and an omission reward (pay for true savings only if the task is correct).
  • Why it matters: The agent learns not just to omit, but to omit at the right time.

🍞 Bottom Bread (Anchor) A search agent tries skipping a mid-turn thought; if it still gets the answer right with fewer tokens, it’s rewarded.

🍞 Top Bread (Hook) Comparing two paths to school shows how different your choices are.

🥬 Filling (The Actual Concept)

  • What it is: KL-Divergence measures how different two policies are (the learned one vs. the best one).
  • How it works: Smaller KL means the learned policy behaves more like the ideal; theory shows errors in accuracy and cost shrink as KL shrinks.
  • Why it matters: It’s a safety rail proving training won’t drift too far from optimal.

🍞 Bottom Bread (Anchor) As your route to school matches the fastest route more closely, your trip time approaches the best possible time.

🍞 Top Bread (Hook) Finally, put it all together like a game plan you actually follow.

🥬 Filling (The Actual Concept)

  • What it is: Agent-Omit is the full framework that teaches and reinforces adaptive omission in LLM agents.
  • How it works: Cold-start data shows the format; omit-aware RL perfects the timing with dual sampling and safe rewards; a theory bound keeps learning stable.
  • Why it matters: It delivers strong accuracy with much fewer tokens across many tasks.

🍞 Bottom Bread (Anchor) On WebShop, Agent-Omit-8B-RL gets higher accuracy than many larger agents while cutting average tokens a lot.

03Methodology

At a high level: Question + Environment → (Cold-Start Omission Training) → (Omit-Aware RL with Dual Sampling + Rewards) → Efficient Agent that Omits Wisely.

Step 1: Cold-Start Omission Behavior Synthesis

  • What happens: The agent learns the “grammar” of omission.
    1. Omission Turn Identification: For a recorded trajectory, test omitting either the thought or the observation at each turn and see if accuracy stays the same while tokens drop. Mark those turns as omittable.
    2. Single-Turn Omission Samples: Teach the agent how to show an empty thought (<think></think>) and how to issue an observation-omission command (<omittoolresponseN...>).</n> 3) Multi-Turn Omission Samples: Build full trajectories where several earlier thoughts/observations are already omitted, so the agent learns to continue smoothly without trying to “recover” what was safely dropped.
  • Why this step exists: Without learning the exact format and seeing safe examples, the agent won’t know how to omit or how to keep reasoning coherent afterward.
  • Example with data: Suppose a 5-turn shopping task. Trials show turns 2 and 3 thoughts are safe to skip, and observation from turn 1 is safe to drop. We train examples where <think></think> is used at turns 2–3, and <omittoolresponse1> is issued when moving to turn 4.

Step 2: Omit-Aware Agentic Reinforcement Learning

  • What happens: The agent practices omission live and gets feedback that balances correctness and savings. a) Dual Sampling Strategy (Secret #1)

    • Full Trajectory: Run the whole task with the agent’s omissions and measure final correctness and total tokens.
    • Partial Trajectories: For each turn where the agent omitted, also sample just that one turn’s pre-omission context and decision. This teaches the agent “given what I saw before omitting, was omission a good idea?”
    • Why it exists: If you only see the post-omission world, you never learn from the exact moment of deciding to omit. Partial trajectories solve this.
    • Example: On WebShop, at turn 4 the agent omits an old tool response. We store a short sample showing what the agent saw at turn 4 before omitting, then train the policy on that decision.

    b) Omission-Aware Reward (Secret #2)

    • Task Reward: Points for being correct (Pass@1), applied to both full and partial trajectories.
    • Omission Reward: Extra points for tokens saved in the full trajectory—but only if the answer is correct. If the answer is wrong, the omission bonus is zero, preventing reward hacking.
    • Why it exists: We want savings without breaking accuracy. Tying savings to correctness aligns incentives.
    • Example: If the agent saves 20% tokens and is correct, it gets both accuracy reward and a savings bonus. If it’s wrong, it gets no savings bonus.

    c) Multi-Objective Policy Learning

    • The agent updates its policy to improve both accuracy and efficiency, with a stability term that keeps it close to a safe reference (KL penalty).
    • Why it exists: Balancing tasks is hard; this keeps training steady and prevents wild behavior.

The Secret Sauce

  • Dual Sampling: Lets the agent learn the “exact moment wisdom” of omission decisions.
  • Omission Reward: Pays only for safe savings, so the agent won’t skip something crucial.
  • Theory Guardrail: As training reduces KL-divergence from the optimal policy, the difference in accuracy and token cost stays bounded and shrinks.

Concrete Walkthrough (WebShop-style)

  • Input: Goal = “Find a waterproof hiking jacket under $120.”
  • Turn 1: Think deeply to plan (search brand + filter price). Action: search. Observation: many results.
  • Turn 2: The plan is clear. Thought Necessity says “skip thought.” Action: click filter by price. Observation: filtered results. (Tokens saved: empty thought.)
  • Turn 3: Old observations from turn 1 don’t help now. Observation Utility says “omit turn 1 tool response.” Action: open top result. (Tokens saved: shorter context.)
  • Turn 4: Think a bit to confirm specs. Action: add to cart. Observation: success.
  • Turn 5: Final answer. Only the latest observations matter; early ones removed. You get the right item with fewer tokens.

What breaks without each step?

  • No cold-start format: Agent doesn’t know how to write omissions, causing confusion.
  • No dual sampling: Agent can’t learn from the decision-time context, so omission timing is poor.
  • No omission reward: The agent has no reason to save tokens responsibly.
  • No KL guardrail: Training could drift, harming accuracy or savings.

Training Notes

  • Start with a few thousand synthetic examples; then run RL in the target environments.
  • Use loss masks so the agent learns to generate its own tokens but doesn’t try to learn the environment’s outputs.
  • Keep RL stable with group-relative updates and a small KL penalty.

Where Omissions Tend to Happen

  • Early thoughts: often needed for planning.
  • Middle turns: perfect for skipping redundant thoughts and dropping stale observations.
  • Late turns: keep crucial, recent observations for final summaries.

04Experiments & Results

The Test

  • Goals: Keep Pass@1 accuracy high while reducing average tokens.
  • Environments: Five diverse sandboxes—DeepSearch (search), WebShop (web navigation/shopping), TextCraft (crafting), BabyAI (embodied navigation), SciWorld (science tasks).
  • Metrics: Pass@1 (did the agent get it right?) and average token consumption (how much text it used).

The Competition

  • Frontier agents: DeepSeek-R1-0528, DeepSeek-V3.2, OpenAI o3/o4-mini, Qwen3-235B-A22B, Qwen3-Next-80B-A3B, Qwen3-32B.
  • Efficient-construction methods: Thinking-Retention, DEPO, ToolLight (thought compression), Observation-Mask, DeepMiner (observation pruning), MEM-Agent, ReSum (summarization-based TOM).

The Scoreboard (with context)

  • Agent-Omit-8B-RL shows strong accuracy with lower tokens across the board. • DeepSearch: Pass@1 ≈ 26.56 with ~4,356 tokens. That’s like getting a solid B when others use more words to get a B−. • WebShop: Pass@1 ≈ 23.57 with ~8,764 tokens. Among efficient methods, this is a leading effectiveness–efficiency combo. • TextCraft: Pass@1 ≈ 87.00 with ~7,328 tokens—an A-level score with fewer notes. • BabyAI: Pass@1 ≈ 84.36 with ~6,643 tokens—very strong accuracy at low cost. • SciWorld: Pass@1 ≈ 18.45 with ~9,643 tokens—competitive on a hard domain.
  • Compared to larger reasoning agents (e.g., DeepSeek-R1-0528, Qwen3-32B), Agent-Omit-8B-RL often uses fewer tokens for similar or better accuracy.
  • Compared to non-reasoning modes (like DeepSeek-V3.2), it may use more tokens but gains substantial accuracy—a smart trade.

Surprising Findings

  • Mid-turn omission is the sweet spot: On average, the agent omits 3–4 turns per trajectory, mostly in the middle. This matches the paper’s analysis: early thoughts plan, late observations are crucial; the middle is where waste lives.
  • Omission boosts accuracy sometimes: Skipping unhelpful thoughts can reduce confusion and improve success.
  • Ablations matter: Removing partial trajectory sampling or omission reward hurts results. Single-turn omission examples in the cold-start are especially important for teaching the behavior format.

Why These Numbers Matter

  • “87%” on TextCraft with fewer tokens means the agent is not just fast; it’s truly mastering the task steps, cutting fluff while keeping substance.
  • On WebShop, doing better than several baselines while using fewer tokens means cheaper, faster shopping agents that don’t get lost in their own notes.
  • The consistent pattern across five tasks hints this is not a one-trick pony—it generalizes.

Takeaway Agent-Omit reliably finds and removes the “middle muck”—unnecessary thoughts and stale observations—leading to both speed and quality gains.

05Discussion & Limitations

Limitations

  • Domain shifts: The adaptive omission policy may need re-tuning for very different tools or environments (e.g., highly visual or very long-horizon scientific labs).
  • Synthetic cold-start: Early examples are synthetic; if they miss certain real-world patterns, the agent may need extra RL exposure to adapt.
  • Reward tuning: The omission reward weight (Îź) matters; too little or too much harms the accuracy–efficiency balance.
  • Observation-risky tasks: In tasks where any old observation can suddenly become critical, omission must be very cautious.

Required Resources

  • Compute: Training used up to 8×A100 GPUs for 8B models; RL requires environment simulators and rollouts.
  • Data: A few thousand cold-start samples per domain plus RL interactions.
  • Tooling: A sandbox or APIs for search, web actions, crafting, navigation, or science operations.

When NOT to Use

  • One-shot Q&A with no tools: There’s little to omit; direct answers suffice.
  • Extremely short tasks: Overhead of omission machinery may not pay off.
  • Tasks with fragile dependencies: If any early observation can instantly change the final step, aggressive omission could backfire.

Open Questions

  • Scaling up: How well does omission training integrate into pre-training or very large models?
  • Multimodal omission: Can we omit parts of images or audio histories safely?
  • Better detection: Can the agent predict omission safety without rollouts, e.g., via confidence or uncertainty measures?
  • Team settings: How does omission work when multiple agents collaborate and share context?
  • Personalization: Can omission adapt to user preferences (speed vs. thoroughness) on the fly?

06Conclusion & Future Work

Three-Sentence Summary Agent-Omit teaches AI agents to skip unnecessary thoughts and drop irrelevant observations at the right times. It combines a cold-start format for omission with an omit-aware RL stage that uses dual sampling and safe rewards, plus a theory bound for stability. Across five benchmarks, it matches or beats strong baselines while using fewer tokens.

Main Achievement A practical, unified framework—backed by analysis and theory—that learns adaptive, turn-level omission and achieves the best effectiveness–efficiency trade-off against several strong baselines.

Future Directions Scale omission data generation into pre-training; explore larger models and more diverse domains; extend to multimodal inputs; improve safety signals for omission decisions; and adapt omission to user preferences.

Why Remember This Because it flips the script: the smartest agent doesn’t always say more; it knows when to say less. Agent-Omit shows that selective silence—at the right moments—makes agents faster, cheaper, and often sharper.

Practical Applications

  • •Build web-browsing assistants that omit mid-turn thoughts and drop stale pages to load results faster.
  • •Create shopping agents that plan once, then act directly, trimming tokens and checkout time.
  • •Deploy research helpers that keep only the latest evidence when summarizing answers.
  • •Design game or crafting bots that skip obvious steps and keep inventories concise for faster play.
  • •Develop robotics/embodied agents that prune irrelevant past observations to react quicker.
  • •Offer enterprise copilots that cut meeting and document clutter, focusing on current action items.
  • •Integrate omission-aware policies into API-heavy workflows to reduce cloud inference bills.
  • •Enable mobile assistants to run longer on-device by saving tokens and compute cycles.
  • •Accelerate scientific simulation agents by keeping only the most recent, relevant measurements.
  • •Enhance multi-agent systems by teaching each agent to maintain a lean, useful context.
#LLM agents#reinforcement learning#agentic RL#context management#thought omission#observation omission#dual sampling#omission reward#KL-divergence#efficiency–effectiveness trade-off#token savings#multi-turn reasoning#web navigation#search agents#long-context pruning
Version: 1