ExpSeek: Self-Triggered Experience Seeking for Web Agents

Wenyuan Zhang; Xinghua Zhang; Haiyang Yu; Shuaiyi Nie; Bingli Wu; Juwei Yue; Tingwen Liu; Yongbin Li

ExpSeek: Self-Triggered Experience Seeking for Web Agents

Intermediate

Wenyuan Zhang, Xinghua Zhang, Haiyang Yu et al.1/13/2026

arXiv PDF

Key Summary

•ExpSeek helps web-browsing AI agents ask for help exactly when they feel unsure, instead of stuffing them with tips at the very beginning.
•It measures the agent’s moment-to-moment uncertainty (called step-level entropy) and uses learned thresholds to decide when to trigger guidance.
•Guidance comes from an “experience base” built from past successes and failures, organized into triplets: what happened, what went wrong, and how to steer next time.
•A separate experience model reads the current situation, pulls the most relevant experience topics, and generates step-specific advice without giving away the answer.
•Compared to passive, global tips, ExpSeek’s proactive, step-level help boosts accuracy on four tough web-agent benchmarks by 9.3% (8B) and 7.5% (32B).
•Surprisingly, even a small 4B guidance model can meaningfully improve a much larger 32B agent, showing “weak-to-strong” help is possible.
•Entropy rises during exploration steps (good for trying smarter paths) and falls at the final answer step (good for confidence), matching the explore-then-converge pattern.
•Self-triggering is more efficient than always-on or reward-model triggers, avoiding over-intervention while keeping accuracy high.
•The approach transfers well across tasks and still works with very small experience pools, thanks to topic-based generalization.
•Limitations include reliance on training-set-calibrated thresholds and current focus on web tasks; future work targets broader domains and RL training use.

Why This Research Matters

ExpSeek reduces wrong answers and wasted clicks by giving web agents timely nudges only when they are truly uncertain. This makes AI browsing more reliable for students researching reports, journalists verifying facts, and everyday users comparing products. Organizations can deploy smaller, cheaper agents that still perform well because a small guidance model can significantly help bigger agents. The approach generalizes across tasks, so one well-built experience base can support many question types. Better exploration early and higher confidence at the end creates outputs that are both thorough and trustworthy. Over time, this can cut costs, improve user trust, and speed up workflows that depend on accurate web evidence.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine doing a scavenger hunt on the internet. Sometimes you know exactly which clue to follow, and sometimes you feel stuck and wish a coach could whisper, “Try looking at the map again!”

🥬 The Concept (Web Agents): A web agent is an AI that searches, clicks, and reads web pages to answer questions step by step. How it works:

The agent thinks about what to do next.
It calls tools like search or visit to collect evidence.
It updates its plan and continues until it gives a final answer. Why it matters: Without careful guidance, the agent may chase noisy links, miss key facts, or stop too early. 🍞 Anchor: When asked, “How many Universal films passed $1B since 2020?”, a good agent searches reliably, opens authoritative pages, verifies distributors, and then counts.

The World Before: LLM-powered agents could browse and reason, but the open web is messy: snippets can be misleading, pages are incomplete, and important clues are scattered. Smaller, budget-friendly models often explored poorly or answered too soon. Researchers knew that “experience” (lessons from past tasks) helps. So they built big experience repositories and pasted them into the prompt before the task began.

🍞 Hook: You know how getting a giant handbook at the start of a field trip doesn’t help much when you’re facing a specific puzzle halfway through?

🥬 The Concept (Passive vs Active Experience Injection): Passive injection means giving advice upfront once; active seeking means asking for just-in-time advice during the task. How it works:

Passive: Retrieve advice at the start and keep it fixed.
Active: Check progress each step and pull help only when needed. Why it matters: Context on the web changes every step. Fixed advice can become irrelevant or distracting. 🍞 Anchor: If your map changes mid-hike, you don’t want yesterday’s directions—you want help right now for the current fork.

The Problem: Prior methods added experience globally (before starting). But the agent’s observations change every step: a new search result, a new page snippet, a new mismatch. If the agent keeps following the initial advice, it can drift off course. Also, asking for help at every step wastes time and may confuse the model.

🍞 Hook: Think of a traffic light for asking help: green (you’re fine), red (get help), yellow (maybe help).

🥬 The Concept (Entropy as Uncertainty): Step-level entropy measures how unsure the model is about its next tokens at a step. How it works:

Compute the average token uncertainty for the model’s current response.
Higher entropy = more confusion; lower entropy = more confidence.
Use this score to decide if help is needed. Why it matters: Without a good uncertainty signal, the agent can’t know when to ask for guidance. 🍞 Anchor: When the agent’s words wobble (high entropy) on an answer step, that’s a red light: “Ask for help before locking in the final answer.”

Failed Attempts: Always-on guidance slows everything down and can overwhelm the agent. Reward-model judges are expensive and can still over-trigger. Retrieval-only tips (just showing similar past cases) often add reading time but not accuracy.

The Gap: We needed two things: (1) a self-signal (like entropy) that tells the agent when to seek help, and (2) tailor-made, step-level guidance that fits the exact current context, not just generic advice.

Real Stakes: Better web agents can help students verify facts, journalists check sources, researchers trace citations, and everyday users avoid misinformation. Timely, targeted help means fewer wrong answers, less wasted browsing, and more trustworthy results.

02Core Idea

Aha! Moment in one sentence: Let the agent use its own uncertainty (entropy) to self-trigger just-in-time, step-level guidance generated from a structured experience base.

🍞 Hook: You know how you raise your hand in class exactly when you feel stuck?

🥬 The Concept (Self-Triggered Experience Seeking): The agent decides when to ask for help based on its own confusion signals. How it works:

Measure entropy at each step.
Compare with learned thresholds.
If needed, fetch and generate guidance specifically for this moment. Why it matters: Without self-triggering, help is either too late, too early, or too generic. 🍞 Anchor: During a tricky web search, the agent’s uncertainty spikes—so it asks for guidance that says, “Open the official source next and verify the key attribute.”

Three analogies:

Traffic light: Green = proceed solo; Yellow = maybe ask; Red = ask for help now.
Hiking guide: You only text the guide when the trail signs don’t match your map.
Cooking coach: You consult the recipe notes right when your batter looks off—not at the grocery store entrance.

Before vs After:

Before: Agents read a long advice sheet once and hoped it stayed relevant.
After: Agents check their uncertainty every step and pull focused guidance only when confusion appears, saving effort and improving accuracy.

Why It Works (intuition):

Entropy is the model’s inner “I’m-not-sure” meter.
Using it as a trigger catches moments when guidance matters most (especially at the final answer step).
Generated guidance is personalized to the current page, query, and recent actions, so it’s more actionable than generic tips.
This creates a healthy explore-then-converge rhythm: seek broader clues when stuck, then tighten certainty at the answer.

🍞 Hook: Imagine a labeled toolbox for tricky moments.

🥬 The Concept (Experience Base): A structured library of past mistakes and fixes, grouped by topics. How it works:

Pair successful and failed trajectories for the same question.
For each wrong step, extract a triplet: behavior, mistake, guidance.
Organize these triplets into topics (like “ignored high-relevance evidence”). Why it matters: Without a well-organized memory, the agent can’t find the right lesson quickly. 🍞 Anchor: Facing a snippet-only guess? The “visit-authoritative-source” topic provides guidance to open the official site and verify.

🍞 Hook: Think of a three-card recipe: what you did, what went wrong, and what to try next.

🥬 The Concept (Experience Triplets): Each experience is a mini-lesson: Behavior, Mistake, Guidance. How it works:

Behavior: objectively state the step.
Mistake: identify the error vs the correct trajectory.
Guidance: give directional advice (no answers or spoilers). Why it matters: Without this tight format, guidance can be vague or too revealing. 🍞 Anchor: “You skimmed snippets (behavior), trusted them as facts (mistake), so now visit the official page and confirm the distributor before counting (guidance).”

Building Blocks:

Step-level entropy and learned thresholds (via logistic regression + bootstrap) decide when to intervene.
Topic-organized experience triplets supply reusable, human-like teaching nuggets.
An experience model reads the current history, selects top relevant topics, and generates tailored guidance.
Guidance is injected at process steps (to steer exploration) or at the final step (to re-check or refine answers), with a cool-down so we don’t over-coach.

Resulting shift: Agents intervene only when needed, explore smarter mid-task, and finalize answers with higher confidence.

03Methodology

At a high level: Question and web context → compute step entropy → compare to thresholds → (maybe) generate and inject guidance → continue reasoning → final answer.

Step 0: Framework and Steps 🍞 Hook: Imagine solving a puzzle by thinking, then acting, then thinking again.

🥬 The Concept (ReAct Framework): The agent alternates between reasoning (thoughts) and actions (tool calls) to gather evidence. How it works:

Think about what to search or where to click.
Use tools: Search (get links/snippets) and Visit (open a page and summarize relevant content).
Repeat until ready to answer. Why it matters: Without structured think-act cycles, the agent either overthinks or clicks randomly. 🍞 Anchor: For “How many Universal films passed $1B since 2020?”, the agent searches, visits authoritative pages, then counts.

Step 1: Build the Experience Base 🍞 Hook: Like making a study guide from old quizzes—especially the ones you got wrong.

🥬 The Concept: Construct a topic-organized library of experience triplets from past trajectories. How it works:

For each training question, sample multiple agent trajectories and pair a correct one with a wrong one.
Use a tool model to label each step in the wrong trajectory as correct/incorrect by comparing to the correct one.
For each incorrect step, extract a triplet: Behavior, Mistake, Guidance; then group triplets into topics using iterative clustering via the model. Why it matters: Without step-aligned, topic-labeled lessons, guidance would be too broad or hard to retrieve. 🍞 Anchor: Topic example: “High-relevance evidence ignored” holds triplets that nudge the agent to open the right page and verify.

Step 2: Learn When to Ask for Help (Thresholds) 🍞 Hook: Like setting a personal “ask-a-teacher” meter for homework—when confusion passes a certain level, raise your hand.

🥬 The Concept (Threshold Estimation via Logistic Regression): Learn an entropy cutoff that separates likely-correct from likely-incorrect steps. How it works:

Gather entropy values from steps labeled correct/incorrect during experience construction.
Fit logistic regression with entropy as the single feature to predict incorrectness.
The decision boundary gives a threshold where help should flip from “no” to “yes.” Why it matters: Without a calibrated cutoff, the agent might ask too often or too rarely. 🍞 Anchor: If final-step entropy is above 0.24 (example for an 8B model), trigger guidance before locking in an answer.

🍞 Hook: Imagine checking your confidence rule across many shuffled practice sets to avoid overfitting.

🥬 The Concept (Bootstrap Resampling): Estimate a confidence interval for the threshold by refitting on many resampled datasets. How it works:

Sample with replacement from the step datasets many times (e.g., 1000).
Fit logistic regression each time; record the threshold.
Take the 2.5% and 97.5% quantiles as the lower/upper threshold bounds. Why it matters: Without uncertainty bands, a single noisy threshold could over-trigger or under-trigger. 🍞 Anchor: Process steps get a wider band (more exploration noise), while final steps get a tighter band (answers need certainty).

Step 3: Decide to Intervene (Probabilistic Trigger) 🍞 Hook: Think of a dimmer switch—gentle in the middle, off when you’re sure, full on when you’re lost.

🥬 The Concept (Probabilistic Intervention): Map entropy to an intervention probability using the learned interval. How it works:

If entropy is below the lower bound: don’t intervene.
If it’s above the upper bound: always intervene.
If it’s in-between: interpolate a probability and sample. Why it matters: Without a soft middle zone, we’d either miss borderline cases or over-coach. 🍞 Anchor: A slightly unsure step (in the yellow zone) might get help 40% of the time—enough to catch many near-misses.

Step 4: Generate Step-Level Guidance 🍞 Hook: Like a teacher reading your scratch work and giving just the right nudge, not the full solution.

🥬 The Concept (Guided Intervention via an Experience Model): A separate model reads the current history, selects the most relevant topics, and generates fresh guidance from triplets. How it works:

If triggered, pick top-3 relevant topics from the process or answer-step pools.
Adapt the triplets under those topics to the current context.
Output guidance that is directional (no spoilers) and actionable. Why it matters: Retrieval-only text dumps are slow to read and often too generic; generation tailors the advice to now. 🍞 Anchor: “Don’t trust snippets alone; open the official page and verify the distributor before counting.”

Step 5: Inject Guidance and Continue

Process step: append guidance to the tool’s observation, so the next thought/action can use it.
Final step: treat guidance like a new observation to allow revising the answer or doing one more visit.
Cool-down: skip intervention on the immediate next step to prevent over-coaching.

Concrete Mini Example:

The agent drafts an answer with high entropy (uncertain wording). Threshold says “red.”
Experience model guides: “Re-check the distributor on a reputable site before declaring the count.”
Agent visits Box Office Mojo, confirms, and answers with lower-entropy, more confident text.

Secret Sauce:

Intrinsic signal: Using the agent’s own entropy is cheap, general, and timely.
Step-tailoring: Guidance is generated for the exact moment, not a generic preface.
Topic structure: A small, high-quality experience set still generalizes well via topics.
Explore-then-converge: Entropy rises on process steps (healthy exploration) and drops on final steps (confident answers).

04Experiments & Results

The Test: Four challenging web-agent benchmarks—WebWalkerQA (in-domain train/test split), GAIA, SEAL-HARD, and xbench-DeepSearch (out-of-distribution). Agents used Qwen3-8B and Qwen3-32B. Accuracy was judged by an LLM-as-a-judge; we also tracked steps/time and Pass@3.

The Competition: Two experience-based baselines that inject advice globally:

Training-Free GRPO: Offline mining of good trajectories; insert advice upfront.
REASONINGBANK+: Self-evolving online experience; retrieve into the system prompt before solving.

The Scoreboard (with context):

Qwen3-8B: ExpSeek improved average accuracy by +9.3 points over vanilla ReAct, and by about +6–7 points over global-experience baselines.
Qwen3-32B: ExpSeek improved average accuracy by +7.5 points over vanilla, similarly surpassing global-experience baselines by about +6 points.
On some difficult subsets (e.g., xbench), gains reached up to +14.6 points.
Pass@3 gains (+12.9 for 8B, +8.8 for 32B) were even larger than single-try gains, showing better, more diverse reasoning paths.
Efficiency: Self-triggering avoided the 1.5–2.9× time overhead seen in always-on or reward-model triggers while keeping accuracy high.

Make the numbers meaningful: Think of grades. Vanilla agents were getting a middling C on hard, noisy web questions. Global advice sometimes nudged them to a C+ or small B-, but often didn’t help. ExpSeek’s just-in-time coaching lifted them to a stronger B, even on new kinds of questions they weren’t trained on.

Surprising Findings:

Weak-to-strong help: A small 4B guidance model still boosted a big 32B agent by 5–10 points. So your tutor can be smaller than you and still helpful if the tips are well-structured.
Entropy shift pattern: With ExpSeek, process-step entropy goes up (healthier exploration) and final-step entropy goes down (more confident answers)—exactly the learning behavior we want.
Retrieval-only guidance underperformed: Just pasting similar old cases was slower and not more accurate than generating targeted guidance.
Answer-step guidance alone was strong but still trailed the full method; combining process and final guidance worked best.
Cross-task generalization: Even though the experience base came from WebWalkerQA training data, benefits transferred to GAIA, SEAL-HARD, and xbench, showing the topics and guidance patterns were broadly useful.

Takeaway: Measured, self-triggered, step-level help beats one-size-fits-all, front-loaded advice—especially in the chaotic, ever-changing web.

05Discussion & Limitations

Limitations:

Threshold calibration depends on labeled training data and a tool model’s step-quality judgments; miscalibration can over- or under-trigger guidance, especially in new domains.
Current scope is web tasks with Search/Visit tools; extending to APIs, software control, or robotics needs re-validation of entropy thresholds and new experience topics.
Entropy can be miscalibrated for some models; without calibration, the red/yellow/green signal may be noisy.
Extra components (experience model, tool model, experience construction) add engineering complexity and moderate compute overhead.

Required Resources:

An agent backbone (e.g., Qwen3-8B/32B), a guidance model (4B–235B works), and a tool model for building the experience base.
Storage for triplets organized by topics; code for entropy calculation, logistic regression, and bootstrap.

When NOT to Use:

Very simple, single-hop questions where intervention overhead outweighs gains.
Settings with strict latency budgets that can’t afford intermittent guidance generation.
Domains where entropy poorly reflects uncertainty (e.g., heavy decoding constraints) without recalibration.

Open Questions:

Can we learn thresholds online, adaptively calibrating to new domains and models without labeled step correctness?
How do multi-signal triggers (entropy + consistency + semantic uncertainty) compare to entropy alone?
Can ExpSeek speed up Agentic RL training by improving rollout quality and exploration balance?
How should topics and triplets evolve over time (e.g., curriculum learning or automatic topic merging/splitting)?

06Conclusion & Future Work

Three-sentence summary: ExpSeek lets web agents ask for help exactly when they feel unsure, using their own entropy as a self-trigger. It pulls targeted, step-level guidance from a topic-organized experience base and generates context-fitted advice without revealing answers. This proactive, just-in-time coaching outperforms global, front-loaded advice across multiple hard benchmarks.

Main Achievement: Turning the agent’s intrinsic uncertainty into a reliable traffic light for step-level, generative guidance—producing a robust explore-then-converge behavior that raises accuracy and efficiency.

Future Directions: Calibrate thresholds online and across domains, enrich tools beyond the web, blend entropy with other uncertainty signals, and plug ExpSeek into Agentic RL to boost rollout quality and training convergence. Explore automated topic evolution and lighter-weight guidance models for edge deployments.

Why Remember This: Instead of overloading agents with advice upfront, ExpSeek teaches them to ask for the right help at the right time—and shows that even small tutors can guide big minds when the lessons are structured well.

Practical Applications

•Academic research assistants that verify citations and cross-check claims on authoritative sources.
•Customer support bots that escalate to guided verification when pages are ambiguous or inconsistent.
•Shopping advisors that confirm product specs and seller authenticity before recommending a purchase.
•News fact-checking tools that nudge toward primary sources and labeled metadata (e.g., publisher, date).
•Enterprise knowledge mining that detects uncertainty spikes and prompts targeted policy or doc lookups.
•Healthcare info triage assistants that push to reputable medical sources when confidence is low.
•Legal research bots that verify jurisdiction and statute applicability before drafting conclusions.
•Travel planners that confirm time-sensitive details (visa, closures) when search snippets conflict.
•Coding help agents that ask for context-specific hints when API docs are unclear (extend beyond web later).
•Educational tutors that teach research skills by modeling verify-before-answer habits.

Version: 1