Nested Browser-Use Learning for Agentic Information Seeking
Key Summary
- •This paper teaches AI helpers to browse the web more like people do, not just by grabbing static snippets.
- •It introduces a tiny but complete browser toolkit with four actions: search, visit, click, and fill.
- •A nested two-loop plan keeps the AI’s thinking (outer loop) separate from careful page reading (inner loop).
- •Inside pages, the AI only saves goal-relevant bits to a temporary workspace, so context stays small and focused.
- •A multi-task imitation learning approach trains both the outer loop (reasoning and tools) and inner loop (evidence picking) together.
- •On tough web-browsing tests in English and Chinese, the method beats many bigger systems, even with a small 4B model.
- •Ablations show both toolkit simplification and goal-focused extraction matter, and together they work best.
- •The AI can even use interactive web utilities (like calculators) that simple URL fetching can’t reach.
- •This design makes agents more efficient, more accurate, and better at handling dynamic, multi-step web tasks.
Why This Research Matters
Many real tasks require more than reading a single page—they require clicking, filling forms, and navigating dynamic sites. By teaching AI agents to browse this way efficiently, NestBrowse helps them find accurate answers faster and with less confusion. It reduces wasted effort and context overload by keeping only goal-relevant information in memory. This means better results for research, shopping, travel planning, and fact-checking. It also shows that smart design can make smaller models perform like bigger ones, saving costs. Finally, it opens the door to safely using web-embedded tools, like calculators or filters, that static fetching can’t reach.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
🍞 Top Bread (Hook): Imagine you’re hunting for a specific Lego piece in a huge, messy box. If you only look at the top layer, you’ll miss the cool parts hidden underneath. That’s what many AI helpers used to do with the web—they skimmed the surface.
🥬 Filling (The Actual Concept):
- What it is: Information-Seeking (IS) agents are AI helpers that look things up online, gather clues, and answer questions.
- How it works: They usually think step-by-step and call tools—like a web search or a URL fetch—to get information, then think again using that new info.
- Why it matters: When the web page is simple, that’s enough. But much of today’s web is dynamic and interactive, and simple tools can’t reach the juicy parts hidden behind clicks, forms, and loading scripts.
🍞 Bottom Bread (Anchor): If you ask, “How many papers by an author named Yuri were accepted at NeurIPS 2022 with a ‘certain’ recommendation?”, static snippets rarely contain this exact info. You often must click into the site, filter, and read inside.
🍞 Top Bread (Hook): You know how some games only show the next level after you press a button or enter a code? The web often works like that too.
🥬 Filling (The Actual Concept):
- What it is: The problem is that most IS agents focused on two tools—search (get URLs and snippets) and visit (fetch the static page content at a URL).
- How it works: They send a query, get a list, pick a URL, and fetch its text. Then they try to answer.
- Why it matters: This misses dynamic content—content that only appears after client-side rendering, form submission, or clicking tabs, buttons, or “load more.”
🍞 Bottom Bread (Anchor): Think of a flight website—you can’t see prices until you type cities and dates and click search. Just visiting the URL without filling anything returns none of the needed info.
🍞 Top Bread (Hook): Have you ever opened a super long document and felt overwhelmed? AI feels that too when a single webpage can be longer than a book.
🥬 Filling (The Actual Concept):
- What it is: Context limits are like a backpack size for the AI’s memory during a task.
- How it works: If you shove an entire page (which can be hundreds of thousands of tokens) into the AI’s context, it overflows or becomes noisy and unhelpful.
- Why it matters: Without controlling what content gets saved, the AI may miss key details or run out of space before finishing.
🍞 Bottom Bread (Anchor): It’s like trying to stuff your whole closet into your backpack for a day trip. You’ll get stuck and won’t bring the few things you actually need.
🍞 Top Bread (Hook): Picture earlier attempts as trying to use every single button in a cockpit just to turn on the cabin light—it’s too complicated.
🥬 Filling (The Actual Concept):
- What it is: Failed attempts included making huge action sets (too many fine-grained clicks, scrolls, or mouse moves) or dumping entire pages into the context.
- How it works: Big action menus confuse the agent; big text dumps waste context.
- Why it matters: The agent either spends forever deciding what to do, or quickly clogs its memory, and both hurt accuracy and speed.
🍞 Bottom Bread (Anchor): It’s like giving a kid 50 different crayons when they only need four: red, blue, green, black. More choices can mean slower drawing, not better.
🍞 Top Bread (Hook): Imagine a tidy toolbox with just the tools you really use, and a note-taking system that only keeps relevant clues.
🥬 Filling (The Actual Concept):
- What it is: The gap was a simple, complete browser toolkit and a way to filter pages into only goal-relevant bits as the agent works.
- How it works: Use a small set of actions (search, visit, click, fill) and a nested plan that separates high-level reasoning from in-page exploration.
- Why it matters: This keeps the agent smart, fast, and focused, even when pages are huge and interactive.
🍞 Bottom Bread (Anchor): It’s like doing a science fair project with just the right tools and a neat lab notebook that records only what helps answer your question.
🍞 Top Bread (Hook): Why should you care? Because real-life questions often need real browsing.
🥬 Filling (The Actual Concept):
- What it is: The stakes are everyday tasks like finding accurate health info, comparing products across multiple tabs, checking sources for school, or verifying breaking news.
- How it works: The AI must navigate the web like you do—clicking, filling forms, and reading only what matters.
- Why it matters: Better browsing AI can save time, reduce mistakes, and uncover information that static snippets miss.
🍞 Bottom Bread (Anchor): For example, planning a family trip involves date pickers, filters, and lots of page sections. An agent that really browses can find the best options faster and more reliably.
02Core Idea
🍞 Top Bread (Hook): You know how a librarian first plans which sections to visit (big-picture thinking) and then, once in the aisle, scans only the pages that match your topic (focused reading)?
🥬 Filling (The Actual Concept):
- What it is: The key insight is to use a tiny, complete set of browser actions and a nested two-loop process that separates high-level reasoning from in-page exploration, so only goal-relevant info flows back into the agent’s memory.
- How it works: Outer loop = think and choose tools (search, visit, click, fill). Inner loop (only when a new page opens) = scan the page in segments, extract only the parts tied to the current goal, and return a compact “useful info” workspace to the outer loop.
- Why it matters: This design keeps context small, focuses attention, handles dynamic pages, and makes even small models strong web researchers.
🍞 Bottom Bread (Anchor): If the task is “Find a 2022 NeurIPS entry by Yuri with a certain recommendation,” the agent searches and visits the NeurIPS venue page (outer loop), then, inside that page, clicks the right elements and extracts only the relevant entries (inner loop) to answer precisely.
Three analogies to nail it:
- Field trip: Teacher (outer loop) decides which museum halls to enter; student teams (inner loop) take notes only on exhibits that match the assignment.
- Cooking: Head chef plans the meal (outer loop); sous-chefs prep only the needed ingredients from a big pantry (inner loop).
- File folders: Top folder chooses which subfolder to open (outer loop); inside, you copy only the needed files into a small working folder (inner loop).
New Concept 1 — ReAct-style function-calling 🍞 Hook: Imagine solving a puzzle and being allowed to pick specific tools—like a magnifying glass or a hint card—whenever you choose. 🥬 The Concept: ReAct-style function-calling is when an AI alternates between thinking and calling tools.
- What: The agent thinks, decides a tool, uses it, sees the result, and thinks again.
- How: Repeat this loop until the goal is met.
- Why: Without this pattern, the agent either overthinks without new info or fetches info without a plan. 🍞 Anchor: Asking “Who won the 2018 World Cup?” The agent thinks “I should search,” searches, reads results, and then answers “France.”
New Concept 2 — Static vs. Dynamic Web Information 🍞 Hook: Some doors are open; others need a key or a button press. 🥬 The Concept: Static info loads right away; dynamic info appears only after actions (click, fill, load more).
- What: Two categories of web content: static (one load) and dynamic (needs interaction).
- How: Dynamic needs clicks/fills; static needs only visiting.
- Why: If you ignore dynamic info, you miss what’s behind the door. 🍞 Anchor: Product reviews that appear only after clicking “Show all reviews” are dynamic.
New Concept 3 — Minimal Browser Toolkit (search, visit, click, fill) 🍞 Hook: Like a Swiss Army knife with just the blades you actually need. 🥬 The Concept:
- What: Four core actions cover real browsing: search (get candidates), visit (open page), click (trigger page transitions), fill (enter text in forms).
- How: Use search for candidates, visit the best URL, click to navigate deeper, fill forms to unlock data.
- Why: Too many actions confuse the agent; too few can’t reach dynamic content. 🍞 Anchor: Booking a ticket needs search (find site), visit (open it), fill (dates), and click (search button).
New Concept 4 — Nested Browser-Use Framework (Outer vs. Inner Loop) 🍞 Hook: Like planning a trip (outer) and taking photos of just the landmarks you need (inner). 🥬 The Concept:
- What: Two loops—outer picks tools and reasons; inner explores the page and extracts goal-relevant info only when a new page is opened.
- How: Outer calls visit/click; inner segments the page, picks relevant passages, bundles them into a small workspace, returns it; outer continues.
- Why: Without nesting, you overload context with raw pages or get lost in details. 🍞 Anchor: On an academic site, the inner loop returns just the paper titles, authors, and decision tags needed for the task.
New Concept 5 — Goal-Relevant Extraction and Workspace 🍞 Hook: When doing a report, you don’t copy the whole book—only the parts that answer your question. 🥬 The Concept:
- What: A temporary workspace that stores only passages tied to the current goal.
- How: Split the page into chunks; for each chunk, extract relevant bits; merge into a compact summary-plus-evidence bundle.
- Why: Keeps memory tidy and focused so reasoning stays sharp. 🍞 Anchor: On a long forum thread, only posts mentioning the target product version and release date are saved.
New Concept 6 — Multi-Task Imitation Learning 🍞 Hook: Learning to ride a bike and signal hand turns at the same time. 🥬 The Concept:
- What: Train the agent to do two things together: outer-loop reasoning with tools and inner-loop extraction.
- How: Use example trajectories; score token-by-token to imitate good behavior in both loops, weighted and combined.
- Why: Training them separately risks a mismatch—great browsing but poor reasoning, or vice versa. 🍞 Anchor: The model learns both how to choose to click the “NeurIPS 2022” tab and how to pull just the relevant author entries from that tab.
Before vs. After:
- Before: Agents relied on search/visit, often missed dynamic info, and overstuffed context with raw pages.
- After: With search, visit, click, fill and nested loops, agents target the right places and bring back only what matters, staying within context limits and boosting accuracy.
Why it works (intuition):
- Limit choices to reduce decision fatigue.
- Separate planning (outer) from gathering (inner) to stay organized.
- Filter pages to avoid drowning in text and keep focus on the goal.
Building blocks:
- Minimal browser toolkit.
- Outer-loop ReAct reasoning.
- Inner-loop, goal-conditioned extraction.
- Joint training (multi-task imitation) so the two loops cooperate smoothly.
03Methodology
High-level recipe: Input (user question) → Outer loop thinks and calls tools (search/visit/click/fill) → If a new page opens, run inner loop to extract goal-relevant info → Return compact workspace → Continue outer loop until final answer.
New Concept 7 — Headless Browser and Semantic DOM Snapshot 🍞 Hook: Imagine a robot browsing the web with the screen turned off but still seeing the page’s structure. 🥬 The Concept:
- What: A headless browser (Playwright) loads pages and turns raw HTML into a semantic DOM snapshot with IDs for interactive elements.
- How: Programmatically open pages, detect clickable items and fields, and read text content in a structured, LLM-friendly way.
- Why: Without structure, the agent sees a messy wall of text and can’t reliably click or fill the right elements. 🍞 Anchor: The system can identify a button labeled “NeurIPS (id=btn-42)” so the agent can click it by ID.
Step-by-step:
- Outer-loop reasoning and tool choice
- What: The agent reads the current context, thinks inside <think> tags, and selects a tool to use inside <tool_call> tags.
- Why: Tool-integrated reasoning ensures actions are purposeful.
- Example: “I’ll search for ‘OpenReview NeurIPS venue 2022.’” Then search returns top-10 results.
- search tool
- What: Batch Google queries, return top-10 results (URLs, titles, snippets).
- Why: Cast a wide net to find promising pages.
- Example: Results include openreview.net and neurips.cc links.
- visit tool
- What: Fetch a webpage by URL and pass along the current goal.
- Why: Open the promising page and tell the inner loop what to look for.
- Example: Visit openreview.net with the goal: “Find NeurIPS 2022 entry by author Yuri with ‘certain’ recommendation.”
- Inner loop: goal-conditioned intra-page exploration New Concept 8 — Page Segmentation and Goal-Conditioned Extraction 🍞 Hook: Like slicing a giant pizza so you can examine each piece for the topping you want. 🥬 The Concept:
-
What: The page is split into segments; for each, the model extracts only goal-related parts into a temporary workspace wrapped in <useful_info> tags.
-
How: For each segment Pi and goal g, compute f(Pi, g) that returns relevant passages and a concise summary; append to workspace W.
-
Why: Without segmentation, the model risks missing key bits or copying too much. 🍞 Anchor: On a paper list page, only segments that mention “Yuri,” “NeurIPS 2022,” and “certain” are kept.
-
What happens: The inner loop aggregates all relevant snippets (evidence) and a short summary tied to the goal, then returns W⋆—a compact bundle—to the outer loop.
-
Why this step: It prevents flooding the outer loop with raw, redundant page text.
-
Example: The returned workspace contains the exact entries, their recommendation tags, and a mini-summary.
- click tool
- What: Clicks a specified element (by identifier) and optionally updates the goal for the new page state.
- Why: Trigger tab switches, pagination, or deeper navigation.
- Example: Click “NeurIPS (identifier)” to switch from a conference list to the NeurIPS venue page.
- fill tool
- What: Enters text into input fields or forms.
- Why: Unlocks interactive workflows like search forms or calculators.
- Example: Fill function and starting value for an online Newton’s Method calculator, then click “Calculate.”
- Controlled information flow and serialization New Concept 9 — Context Window and Token Budgeting 🍞 Hook: Your backpack can only carry so much; pack the essentials. 🥬 The Concept:
- What: The outer loop maintains a manageable context length by receiving only W⋆ (goal-relevant workspace) from page interactions.
- How: Raw page P is never fully injected; only curated evidence is appended as <tool_response>.
- Why: Prevents early termination due to context overflow and keeps reasoning focused. 🍞 Anchor: After 20 tool calls, raw processed text might exceed 128k tokens, but the outer loop context stays within limits thanks to filtering.
- Training: multi-task imitation learning
- Outer-loop objective L_out: The model imitates good trajectories—reasoning, tool calls, and final answers—by minimizing token-level negative log-likelihood.
- Inner-loop objective L_in: The model learns to produce high-quality goal-relevant workspaces segment by segment.
- Combined loss L_MT = λ_out L_out + λ_in L_in with default λ_out = λ_in = 1.
New Concept 10 — Rejection Sampling for Quality Filtering 🍞 Hook: Like only saving your best practice essays to study later. 🥬 The Concept:
- What: Keep only good trajectories that follow format rules, use valid tools/arguments, and end with a correct final answer.
- How: Discard rollouts with broken tags, hallucinated tools, or wrong answers; keep the rest for training.
- Why: Avoid learning bad habits and spurious patterns. 🍞 Anchor: If a rollout forgets to put the final answer inside <answer> tags, it gets rejected.
- Why omit scrolling and in-page search?
- What: Scrolling limits exposure but doesn’t inherently improve goal alignment; it’s often redundant once you have segment-wise extraction.
- Why: The goal is not to read more but to read what matters; extraction solves that more directly.
- Example: Instead of scrolling forever, the inner loop pulls only the author entries matching the query.
- Secret sauce
- Minimal action space lowers decision burden.
- Nested loops perfectly match how humans browse: plan first, then skim smartly.
- Goal-conditioned, segment-wise extraction ensures only relevant bits survive, boosting accuracy and efficiency.
Throughout, inputs are user questions plus web content; outputs are final answers supported by compact, goal-aligned evidence.
04Experiments & Results
The Test: The authors evaluate on deep, real browsing benchmarks where answers are not easily found by a single query. Datasets include BrowseComp (English), GAIA (103 text-only tasks), BrowseComp-zh, and XBench-DeepSearch (Chinese). The metric is final answer accuracy (pass@1), verified with LLM-as-a-Judge (GPT-4.1) using official prompts.
New Concept 11 — LLM-as-a-Judge 🍞 Hook: When two players disagree on an answer, you sometimes ask a trusted referee. 🥬 The Concept:
- What: A strong language model checks whether an agent’s final answer is correct under the benchmark rules.
- How: Feed the answer and the task into the judge with a scoring prompt.
- Why: Some answers are free-form text, so exact matching doesn’t work. 🍞 Anchor: For GAIA, GPT-4.1 decides if the explanation and final number meet the task’s requirements.
The Competition: Baselines include proprietary agents (e.g., OpenAI o3/o4-mini, DeepResearch, Claude variants) and open-source agents (e.g., DeepDive-32B, WebExplorer-8B, WebSailor variants, Kimi-K2, GLM-4.5). Many use only search/visit or ad-hoc browser tools without a unified, minimal toolkit and nested inner-loop extraction.
The Scoreboard (selected):
- BrowseComp (EN): NestBrowse-4B scores 22.4%; NestBrowse-30B-A3B scores 31.6%.
- GAIA (EN, text-only subset): 68.9% (4B) and 75.7% (30B-A3B).
- BrowseComp-zh (ZH): 28.4% (4B) and 42.6% (30B-A3B).
- XBench-DeepSearch (ZH): 74.0% (4B) and 75.0% (30B-A3B); on the tougher † version, 38.0% (4B) and 45.0% (30B-A3B). Context: On several benchmarks, NestBrowse-30B-A3B beats or rivals proprietary systems and clearly outperforms many larger open-source baselines. Even the small 4B model outperforms some much bigger models, showing that smart tool design and interaction matter as much as raw parameter count.
Ablations (Why each piece matters): Using GPT-OSS-120B as the constant outer agent to isolate strategy effects:
- Naive (no toolkit simplification, no extraction): GAIA 46.6, XBench 40.0.
- Simplified toolkit only: GAIA 55.3, XBench 40.0.
- Extraction only (compress pages): GAIA 60.2, XBench 61.0.
- NestBrowse (both): GAIA 73.8, XBench 71.0. Interpretation: Both simplification and extraction help, and together they deliver the biggest jump—like going from a messy backpack and random notes to a tidy toolkit and a crisp summary sheet.
Context Efficiency: After ~20 tool calls, the total raw text processed can exceed a 128k-token limit. Normally, this would end the run early while most tasks are still unsolved. NestBrowse fixes this by feeding only goal-relevant workspaces back into the outer loop, keeping the context under limit and allowing long-horizon browsing to finish.
Inner-Loop Quality Drives Outcomes: Judged by GPT-4.1, NestBrowse improves both raw snapshot retention (keeping the necessary originals to continue interacting) and goal-relevant extraction accuracy over the base model. Swapping inner-loop quality shows a clear correlation with final task accuracy on BrowseComp: a weaker inner loop drags down results (24.0%), while stronger ones boost them (35.0% with NestBrowse-30B-A3B; 36.0% with GPT-OSS-120B as inner loop). Conclusion: Better in-page exploration leads to better answers.
Surprising Findings:
- Small but well-trained agents (4B) can compete with or beat larger models when equipped with the right browsing abstractions and nested control.
- English-only training still generalized well to Chinese benchmarks, suggesting the browsing policy and extraction logic transfer across languages.
- The agent can leverage web-embedded tools, like an online Newton’s Method calculator, to offload hard computation—showing that browsing is not just reading but also using the web’s built-in utilities.
05Discussion & Limitations
Limitations:
- Text-only browsing: The system deliberately ignores visual content (images, charts, buttons without reliable text labels). Real websites often require visual cues; without them, some tasks remain out of reach.
- Website variability: Dynamic sites may use anti-bot measures, CAPTCHAs, or unusual front-ends that break standard automation.
- Context still finite: While much improved, very long, multi-branch investigations can still push limits if many pages are partially relevant.
- Dependence on training data quality: Imitation learning relies on strong example trajectories; poor data can teach bad habits.
Required Resources:
- Models: Qwen3-4B-Thinking-2507 and Qwen3-30B-A3B-Thinking-2507 variants.
- Compute: Roughly 1,344 GPU-hours (4B) and 4,096 GPU-hours (30B-A3B) on NVIDIA H20s.
- Runtime: Headless Playwright backend, with up to 100 tool calls and 128k token context per episode.
When NOT to Use:
- Vision-heavy tasks: If the answer hides in images, graphics, or canvas-rendered elements without text.
- Strictly static Q&A: If a single search snippet suffices, the nested browsing overhead may be unnecessary.
- Sites with aggressive anti-automation: If clicking/filling is consistently blocked, a different approach is needed.
Open Questions:
- Multimodal extension: How to integrate reliable vision (and possibly audio/video) while keeping the toolkit minimal and decisions simple?
- Learning to generalize workflows: Can the agent learn reusable click/fill schemas for common sites (e.g., flight search) and adapt them quickly?
- Safety and ethics: How to ensure respectful browsing, rate limiting, and compliance with site policies?
- Memory beyond context: Can long-term external memory help track investigations across sessions while preserving privacy and correctness?
- Training at scale: What is the best blend of imitation, rejection sampling, and reinforcement learning for even stronger, safer browsing policies?
06Conclusion & Future Work
Three-sentence summary: NestBrowse equips AI agents with a tiny but complete browser toolkit (search, visit, click, fill) and a nested plan that separates high-level reasoning from focused in-page extraction. By returning only goal-relevant workspaces to the agent’s context, it keeps memory small and sharp, enabling reliable deep-web investigation even on dynamic, multi-step sites. Joint multi-task imitation learning bakes both outer-loop reasoning and inner-loop extraction into a single model, delivering strong results across tough English and Chinese benchmarks.
Main achievement: Showing that simple, principled browser abstractions plus nested, goal-conditioned extraction let even small models perform powerful, efficient deep web research—often rivaling or beating much larger systems.
Future directions: Add vision for multimodal pages, learn reusable site schemas, enhance safety and policy compliance, and scale training with a mix of imitation and reinforcement learning. Explore memory systems to support very long, multi-session investigations without bloating the context. Investigate adaptive weighting between outer and inner objectives for different task types.
Why remember this: NestBrowse flips the script from “bigger model = better researcher” to “better browsing design = better researcher,” proving that focused tools, clean information flow, and layered control can unlock dynamic web intelligence efficiently and robustly.
Practical Applications
- •Academic research assistant that navigates conference portals, filters years, and extracts author-specific decisions.
- •Product comparison bot that clicks filters (price, brand) and summarizes only relevant specs and reviews.
- •News verifier that follows links, opens sources, and extracts key quotes and dates for cross-checking.
- •Travel planner that fills date/location forms, applies filters, and returns a concise list of best options.
- •Customer support helper that navigates FAQs and troubleshooting flows, clicking through steps to extract fixes.
- •Legal or policy lookup that opens statutes and guidance pages, extracting only clauses relevant to a query.
- •Data entry automation that logs into dashboards (where allowed), fills forms, and records confirmed outputs.
- •Scientific calculator agent that uses online tools (e.g., Newton’s Method) to compute results and show steps.
- •Developer doc navigator that clicks tabs, versions, and examples to return only goal-matching code snippets.
- •Grant or scholarship finder that fills eligibility forms and extracts only matching opportunities.