Agent-Omit: Training Efficient LLM Agents for Adaptive Thought and Observation Omission via Agentic Reinforcement Learning
Key Summary
- â˘Agent-Omit teaches AI agents to skip unneeded thinking and old observations, cutting tokens while keeping accuracy high.
- â˘Not every turn in a multi-step task needs long thoughts or the full history; the paper proves this with careful measurements.
- â˘A small "cold-start" dataset first shows the agent what skipping looks like in both single-turn and multi-turn cases.
- â˘Then an omit-aware reinforcement learning stage uses dual sampling and a special omission reward to learn when to omit.
- â˘The method safely balances correctness and token savings, with theory showing errors shrink as policies get closer (by KL-divergence).
- â˘On five benchmarks (DeepSearch, WebShop, TextCraft, BabyAI, SciWorld), Agent-Omit-8B matches or beats strong agents while using fewer tokens.
- â˘Most omissions happen in the middle turns, saving cost without hurting the final answer.
- â˘Compared to other efficiency tricks (like generic summarization or fixed pruning), Agent-Omit adapts turn-by-turn and performs better.
- â˘Ablations show both the cold-start format and the omit-aware RL (especially partial trajectories and omission reward) are crucial.
- â˘This approach makes agents faster, cheaper, and greener, which matters for real apps like web browsing, shopping, and science tasks.
Why This Research Matters
AI agents often waste time and money by over-explaining each step and carrying too much history. Agent-Omit proves that agents can safely skip mid-turn thoughts and prune old observations, making them faster and cheaper without hurting accuracy. This helps real products: smarter shopping assistants, quicker research tools, and more efficient automation. Lower token use also means less energy per task, which is better for the environment. With a clear training recipe and theory guarantees, teams can deploy leaner, more reliable agents. The approach generalizes across search, web navigation, crafting, embodied control, and science tasks. In short, it upgrades agents from âalways talkâ to âspeak when it counts.â
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
đ Top Bread (Hook) Imagine packing for a trip. On day one, you plan carefully. On day two and three, you just follow the planâno need to re-pack your whole suitcase each morning. And you donât keep every old receipt forever; most become clutter.
𼏠Filling (The Actual Concept)
- What it is: This paper is about teaching AI agents to stop overthinking and to let go of old, unhelpful notes during multi-step tasks.
- How it works: The authors first measure where thinking and observations actually help. Then they train agents to skip redundant thoughts and to drop outdated observations at the right moments.
- Why it matters: Without this, AI agents waste tokens and time writing long thoughts and carrying huge histories, making them slow, costly, and sometimes even less accurate.
đ Bottom Bread (Anchor) A shopping agent planning which laptop to buy may plan early, then simply click through steps without re-explaining itself every turn, and it can forget last weekâs search results that donât matter now.
â New Concepts (in friendly order) â
đ Top Bread (Hook) You know how a librarian uses a giant catalog to answer questions? Thatâs like a language model.
𼏠Filling (The Actual Concept)
- What it is: Large Language Models (LLMs) are big text-understanding and text-writing systems trained on lots of examples.
- How it works: They read your words, predict the next words based on patterns they learned, and can use tools (like search) to get more info.
- Why it matters: LLMs are the brains of many modern AI agents.
đ Bottom Bread (Anchor) When you ask, âWhatâs the tallest mountain in Africa?â, an LLM writes âKilimanjaro.â
đ Top Bread (Hook) Think of practicing basketball: you try, get a score, and improve your moves.
𼏠Filling (The Actual Concept)
- What it is: Reinforcement Learning (RL) helps an AI learn by doing actions and getting rewards.
- How it works: The agent acts, the environment responds, the agent gets feedback (reward), and it updates its policy to get better results over time.
- Why it matters: RL makes agents adapt to tasks instead of just memorizing.
đ Bottom Bread (Anchor) A web agent tries a button; if it gets closer to the goal, it earns reward and repeats good habits.
đ Top Bread (Hook) When doing a puzzle, sometimes you must think deeply; other times you just place the next obvious piece.
𼏠Filling (The Actual Concept)
- What it is: Thought Necessity means some turns truly need detailed reasoning, while others donât.
- How it works: The agent checks if reasoning helps now; if not, it can skip thoughts.
- Why it matters: Skipping unnecessary thoughts saves tokens and avoids clutter.
đ Bottom Bread (Anchor) After planning to âsearch, then sort, then buy,â the next âsearchâ turn may not need extra thinking.
đ Top Bread (Hook) Old notes are useful at first, but later, some are just noise.
𼏠Filling (The Actual Concept)
- What it is: Observation Utility is about which past tool responses still help the current step.
- How it works: The agent keeps only the observations that inform the next action; it omits stale ones.
- Why it matters: Carrying every past observation makes later steps slower and more expensive.
đ Bottom Bread (Anchor) Yesterdayâs search results rarely matter when youâre summarizing the final answer today.
Before this paper, many systems compressed everything equally or kept too much. That wastes tokens because early plans make some mid-turn reasoning redundant, and many early observations donât affect the last steps. The authors show, with careful measurements, that thoughts and observations donât help equally at all turns: thinking is heavy at the start; observations pile up linearly and overburden later turns; and mid-turn omissions often save tokens without hurting accuracy. The gap was a missing skill: adaptively omitting only what is safe to skip at this moment. The stakes are real: faster answers, lower bills, greener compute, and better user experiences on web browsing, shopping, games, robots, and science tasks.
02Core Idea
đ Top Bread (Hook) You know how a chef preps a recipe: plan once, then chop, stir, and bake without re-planning every minuteâand you toss the peels you donât need.
𼏠Filling (The Actual Concept)
- What it is: The key insight is that an AI agent can learn to skip unneeded thoughts and drop irrelevant observations at the right turns.
- How it works: First, teach the agent the âlanguageâ of omitting (cold-start). Then, use RL with special rewards and sampling so the agent practices omitting only when itâs safe and helpful.
- Why it matters: This keeps accuracy high while cutting token costs significantly.
đ Bottom Bread (Anchor) A search agent plans a path, then executes obvious tool calls with empty thoughts, and discards old tool outputs that wonât affect the final answer.
â Multiple Analogies â
- Backpacking: Pack smart. Donât re-pack every stop. Drop souvenirs that weigh you down if theyâre not useful.
- Homework: Write a plan once. For easy steps, just do them. Donât copy old drafts into every new page.
- Detective: Keep the latest clue. Archive early leads that no longer matter.
â Before vs After â
- Before: Agents often reasoned at every turn and carried all past observations, treating each turn equally.
- After: The agent selectively omits mid-turn thoughts and prunes stale observations, cutting tokens with the same or better accuracy.
â Why It Works (no equations) â
- Early turns usually need deep planning; middle turns often follow the plan directly; late turns depend mostly on recent evidence. By matching omission to each turnâs role, you avoid waste.
- RL fine-tunes this behavior using rewards that only pay for savings when the answer is correct, preventing cheating.
- A theoretical guardrail says: as the learned policy becomes closer to the best possible one (measured by KL-divergence), both accuracy and token use approach the optimal balance.
â Building Blocks (with Sandwich Explanations) â
đ Top Bread (Hook) When building LEGO, you follow steps (thoughts), do actions (place bricks), and look at what you built (observations).
𼏠Filling (The Actual Concept)
- What it is: Thought = the agentâs chain-of-thought; Action = tool use or final answer; Observation = environmentâs feedback.
- How it works: Think â Act â Observe â repeat.
- Why it matters: Managing thought and observations is where most tokens are spent.
đ Bottom Bread (Anchor) A web agent thinks âsearch price,â clicks search (action), reads results (observation), then continues.
đ Top Bread (Hook) Sometimes you ask: âDo I really need to think this through again?â
𼏠Filling (The Actual Concept)
- What it is: Thought Necessity: judge whether reasoning helps right now.
- How it works: If a prior plan already makes the next step obvious, skip the thought (empty think).
- Why it matters: Saves tokens and keeps the context clean.
đ Bottom Bread (Anchor) After deciding to âopen page â add-to-cart,â the âopen pageâ turn needs no extra explanation.
đ Top Bread (Hook) Old maps can mislead if youâre already on the final street.
𼏠Filling (The Actual Concept)
- What it is: Observation Utility: keep only tool responses that still inform the next step.
- How it works: Omit early, irrelevant tool outputs when they wonât affect the current decision.
- Why it matters: Reduces long-context bloat.
đ Bottom Bread (Anchor) When summarizing the answer, you usually need only the latest, most precise results.
đ Top Bread (Hook) Learning a sport needs smart feedback: reward good moves and practice the exact moment they matter.
𼏠Filling (The Actual Concept)
- What it is: Omit-Aware Agentic RL is a training recipe that encourages smart omission.
- How it works: It uses dual sampling (practice both full tasks and the specific turns where you consider omitting) and an omission reward (pay for true savings only if the task is correct).
- Why it matters: The agent learns not just to omit, but to omit at the right time.
đ Bottom Bread (Anchor) A search agent tries skipping a mid-turn thought; if it still gets the answer right with fewer tokens, itâs rewarded.
đ Top Bread (Hook) Comparing two paths to school shows how different your choices are.
𼏠Filling (The Actual Concept)
- What it is: KL-Divergence measures how different two policies are (the learned one vs. the best one).
- How it works: Smaller KL means the learned policy behaves more like the ideal; theory shows errors in accuracy and cost shrink as KL shrinks.
- Why it matters: Itâs a safety rail proving training wonât drift too far from optimal.
đ Bottom Bread (Anchor) As your route to school matches the fastest route more closely, your trip time approaches the best possible time.
đ Top Bread (Hook) Finally, put it all together like a game plan you actually follow.
𼏠Filling (The Actual Concept)
- What it is: Agent-Omit is the full framework that teaches and reinforces adaptive omission in LLM agents.
- How it works: Cold-start data shows the format; omit-aware RL perfects the timing with dual sampling and safe rewards; a theory bound keeps learning stable.
- Why it matters: It delivers strong accuracy with much fewer tokens across many tasks.
đ Bottom Bread (Anchor) On WebShop, Agent-Omit-8B-RL gets higher accuracy than many larger agents while cutting average tokens a lot.
03Methodology
At a high level: Question + Environment â (Cold-Start Omission Training) â (Omit-Aware RL with Dual Sampling + Rewards) â Efficient Agent that Omits Wisely.
Step 1: Cold-Start Omission Behavior Synthesis
- What happens: The agent learns the âgrammarâ of omission.
- Omission Turn Identification: For a recorded trajectory, test omitting either the thought or the observation at each turn and see if accuracy stays the same while tokens drop. Mark those turns as omittable.
- Single-Turn Omission Samples: Teach the agent how to show an empty thought (<think></think>) and how to issue an observation-omission command (<omittoolresponseN...>).</n> 3) Multi-Turn Omission Samples: Build full trajectories where several earlier thoughts/observations are already omitted, so the agent learns to continue smoothly without trying to ârecoverâ what was safely dropped.
- Why this step exists: Without learning the exact format and seeing safe examples, the agent wonât know how to omit or how to keep reasoning coherent afterward.
- Example with data: Suppose a 5-turn shopping task. Trials show turns 2 and 3 thoughts are safe to skip, and observation from turn 1 is safe to drop. We train examples where <think></think> is used at turns 2â3, and <omittoolresponse1> is issued when moving to turn 4.
Step 2: Omit-Aware Agentic Reinforcement Learning
-
What happens: The agent practices omission live and gets feedback that balances correctness and savings. a) Dual Sampling Strategy (Secret #1)
- Full Trajectory: Run the whole task with the agentâs omissions and measure final correctness and total tokens.
- Partial Trajectories: For each turn where the agent omitted, also sample just that one turnâs pre-omission context and decision. This teaches the agent âgiven what I saw before omitting, was omission a good idea?â
- Why it exists: If you only see the post-omission world, you never learn from the exact moment of deciding to omit. Partial trajectories solve this.
- Example: On WebShop, at turn 4 the agent omits an old tool response. We store a short sample showing what the agent saw at turn 4 before omitting, then train the policy on that decision.
b) Omission-Aware Reward (Secret #2)
- Task Reward: Points for being correct (Pass@1), applied to both full and partial trajectories.
- Omission Reward: Extra points for tokens saved in the full trajectoryâbut only if the answer is correct. If the answer is wrong, the omission bonus is zero, preventing reward hacking.
- Why it exists: We want savings without breaking accuracy. Tying savings to correctness aligns incentives.
- Example: If the agent saves 20% tokens and is correct, it gets both accuracy reward and a savings bonus. If itâs wrong, it gets no savings bonus.
c) Multi-Objective Policy Learning
- The agent updates its policy to improve both accuracy and efficiency, with a stability term that keeps it close to a safe reference (KL penalty).
- Why it exists: Balancing tasks is hard; this keeps training steady and prevents wild behavior.
The Secret Sauce
- Dual Sampling: Lets the agent learn the âexact moment wisdomâ of omission decisions.
- Omission Reward: Pays only for safe savings, so the agent wonât skip something crucial.
- Theory Guardrail: As training reduces KL-divergence from the optimal policy, the difference in accuracy and token cost stays bounded and shrinks.
Concrete Walkthrough (WebShop-style)
- Input: Goal = âFind a waterproof hiking jacket under $120.â
- Turn 1: Think deeply to plan (search brand + filter price). Action: search. Observation: many results.
- Turn 2: The plan is clear. Thought Necessity says âskip thought.â Action: click filter by price. Observation: filtered results. (Tokens saved: empty thought.)
- Turn 3: Old observations from turn 1 donât help now. Observation Utility says âomit turn 1 tool response.â Action: open top result. (Tokens saved: shorter context.)
- Turn 4: Think a bit to confirm specs. Action: add to cart. Observation: success.
- Turn 5: Final answer. Only the latest observations matter; early ones removed. You get the right item with fewer tokens.
What breaks without each step?
- No cold-start format: Agent doesnât know how to write omissions, causing confusion.
- No dual sampling: Agent canât learn from the decision-time context, so omission timing is poor.
- No omission reward: The agent has no reason to save tokens responsibly.
- No KL guardrail: Training could drift, harming accuracy or savings.
Training Notes
- Start with a few thousand synthetic examples; then run RL in the target environments.
- Use loss masks so the agent learns to generate its own tokens but doesnât try to learn the environmentâs outputs.
- Keep RL stable with group-relative updates and a small KL penalty.
Where Omissions Tend to Happen
- Early thoughts: often needed for planning.
- Middle turns: perfect for skipping redundant thoughts and dropping stale observations.
- Late turns: keep crucial, recent observations for final summaries.
04Experiments & Results
The Test
- Goals: Keep Pass@1 accuracy high while reducing average tokens.
- Environments: Five diverse sandboxesâDeepSearch (search), WebShop (web navigation/shopping), TextCraft (crafting), BabyAI (embodied navigation), SciWorld (science tasks).
- Metrics: Pass@1 (did the agent get it right?) and average token consumption (how much text it used).
The Competition
- Frontier agents: DeepSeek-R1-0528, DeepSeek-V3.2, OpenAI o3/o4-mini, Qwen3-235B-A22B, Qwen3-Next-80B-A3B, Qwen3-32B.
- Efficient-construction methods: Thinking-Retention, DEPO, ToolLight (thought compression), Observation-Mask, DeepMiner (observation pruning), MEM-Agent, ReSum (summarization-based TOM).
The Scoreboard (with context)
- Agent-Omit-8B-RL shows strong accuracy with lower tokens across the board. ⢠DeepSearch: Pass@1 â 26.56 with ~4,356 tokens. Thatâs like getting a solid B when others use more words to get a Bâ. ⢠WebShop: Pass@1 â 23.57 with ~8,764 tokens. Among efficient methods, this is a leading effectivenessâefficiency combo. ⢠TextCraft: Pass@1 â 87.00 with ~7,328 tokensâan A-level score with fewer notes. ⢠BabyAI: Pass@1 â 84.36 with ~6,643 tokensâvery strong accuracy at low cost. ⢠SciWorld: Pass@1 â 18.45 with ~9,643 tokensâcompetitive on a hard domain.
- Compared to larger reasoning agents (e.g., DeepSeek-R1-0528, Qwen3-32B), Agent-Omit-8B-RL often uses fewer tokens for similar or better accuracy.
- Compared to non-reasoning modes (like DeepSeek-V3.2), it may use more tokens but gains substantial accuracyâa smart trade.
Surprising Findings
- Mid-turn omission is the sweet spot: On average, the agent omits 3â4 turns per trajectory, mostly in the middle. This matches the paperâs analysis: early thoughts plan, late observations are crucial; the middle is where waste lives.
- Omission boosts accuracy sometimes: Skipping unhelpful thoughts can reduce confusion and improve success.
- Ablations matter: Removing partial trajectory sampling or omission reward hurts results. Single-turn omission examples in the cold-start are especially important for teaching the behavior format.
Why These Numbers Matter
- â87%â on TextCraft with fewer tokens means the agent is not just fast; itâs truly mastering the task steps, cutting fluff while keeping substance.
- On WebShop, doing better than several baselines while using fewer tokens means cheaper, faster shopping agents that donât get lost in their own notes.
- The consistent pattern across five tasks hints this is not a one-trick ponyâit generalizes.
Takeaway Agent-Omit reliably finds and removes the âmiddle muckââunnecessary thoughts and stale observationsâleading to both speed and quality gains.
05Discussion & Limitations
Limitations
- Domain shifts: The adaptive omission policy may need re-tuning for very different tools or environments (e.g., highly visual or very long-horizon scientific labs).
- Synthetic cold-start: Early examples are synthetic; if they miss certain real-world patterns, the agent may need extra RL exposure to adapt.
- Reward tuning: The omission reward weight (Îź) matters; too little or too much harms the accuracyâefficiency balance.
- Observation-risky tasks: In tasks where any old observation can suddenly become critical, omission must be very cautious.
Required Resources
- Compute: Training used up to 8ĂA100 GPUs for 8B models; RL requires environment simulators and rollouts.
- Data: A few thousand cold-start samples per domain plus RL interactions.
- Tooling: A sandbox or APIs for search, web actions, crafting, navigation, or science operations.
When NOT to Use
- One-shot Q&A with no tools: Thereâs little to omit; direct answers suffice.
- Extremely short tasks: Overhead of omission machinery may not pay off.
- Tasks with fragile dependencies: If any early observation can instantly change the final step, aggressive omission could backfire.
Open Questions
- Scaling up: How well does omission training integrate into pre-training or very large models?
- Multimodal omission: Can we omit parts of images or audio histories safely?
- Better detection: Can the agent predict omission safety without rollouts, e.g., via confidence or uncertainty measures?
- Team settings: How does omission work when multiple agents collaborate and share context?
- Personalization: Can omission adapt to user preferences (speed vs. thoroughness) on the fly?
06Conclusion & Future Work
Three-Sentence Summary Agent-Omit teaches AI agents to skip unnecessary thoughts and drop irrelevant observations at the right times. It combines a cold-start format for omission with an omit-aware RL stage that uses dual sampling and safe rewards, plus a theory bound for stability. Across five benchmarks, it matches or beats strong baselines while using fewer tokens.
Main Achievement A practical, unified frameworkâbacked by analysis and theoryâthat learns adaptive, turn-level omission and achieves the best effectivenessâefficiency trade-off against several strong baselines.
Future Directions Scale omission data generation into pre-training; explore larger models and more diverse domains; extend to multimodal inputs; improve safety signals for omission decisions; and adapt omission to user preferences.
Why Remember This Because it flips the script: the smartest agent doesnât always say more; it knows when to say less. Agent-Omit shows that selective silenceâat the right momentsâmakes agents faster, cheaper, and often sharper.
Practical Applications
- â˘Build web-browsing assistants that omit mid-turn thoughts and drop stale pages to load results faster.
- â˘Create shopping agents that plan once, then act directly, trimming tokens and checkout time.
- â˘Deploy research helpers that keep only the latest evidence when summarizing answers.
- â˘Design game or crafting bots that skip obvious steps and keep inventories concise for faster play.
- â˘Develop robotics/embodied agents that prune irrelevant past observations to react quicker.
- â˘Offer enterprise copilots that cut meeting and document clutter, focusing on current action items.
- â˘Integrate omission-aware policies into API-heavy workflows to reduce cloud inference bills.
- â˘Enable mobile assistants to run longer on-device by saving tokens and compute cycles.
- â˘Accelerate scientific simulation agents by keeping only the most recent, relevant measurements.
- â˘Enhance multi-agent systems by teaching each agent to maintain a lean, useful context.