WideSeek-R1: Exploring Width Scaling for Broad Information Seeking via Multi-Agent Reinforcement Learning

Zelai Xu; Zhexuan Xu; Ruize Zhang; Chunyang Zhu; Shi Yu; Weilin Liu; Quanlu Zhang; Wenbo Ding; Chao Yu; Yu Wang

WideSeek-R1: Exploring Width Scaling for Broad Information Seeking via Multi-Agent Reinforcement Learning

Intermediate

Zelai Xu, Zhexuan Xu, Ruize Zhang et al.2/4/2026

arXiv PDF

Key Summary

•WideSeek-R1 teaches a small 4B-parameter language model to act like a well-run team: one leader plans, many helpers work in parallel, and everyone learns together with reinforcement learning.
•Instead of only thinking longer (depth scaling), it thinks wider (width scaling) by adding more subagents that search at the same time with separate notes so they don’t mix up information.
•A new training recipe (multi-agent RL with shared rewards, group normalization, and dual-level reweighting) lets the leader and helpers improve together without arguing over credit.
•On the WideSearch benchmark, WideSeek-R1-4B reaches 40.0% item F1, about as good as the giant DeepSeek-R1-671B, using roughly 170× fewer parameters.
•Performance keeps going up as you add more parallel subagents, showing true width scaling, while depth scaling alone quickly plateaus.
•A 20k-task dataset of broad, table-style queries was auto-built with self-consistency checks to train agents to gather and organize many facts reliably.
•Ablation studies show you need both a trained leader and trained helpers, and mixing ‘wide’ and ‘deep’ training data works best.
•The system stays strong on standard QA tasks, beating several larger multi-agent baselines, so it doesn’t trade away general reasoning.
•The work highlights a shift from ‘how smart is one agent?’ to ‘how well does the team organize itself?’ for broad information seeking.
•This approach can make powerful research assistants more affordable and faster by coordinating many small agents instead of relying on one huge model.

Why This Research Matters

Many real tasks are about breadth—gathering lots of small facts across many items—and this system is built for exactly that. By coordinating many small agents with clean notes, it delivers strong, organized tables faster than a single agent can. It shows that you don’t always need a massive model; a well-trained team of smaller ones can compete, cutting costs and making advanced AI more accessible. This helps businesses, journalists, scientists, and students compile reliable datasets with fewer mistakes. It also opens a new research path: training better teams, not just bigger solo models.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine your class needs to create a giant world atlas in one day. If only one student tries to do everything, they’ll run out of time and start mixing notes. But if the class splits up—some find capitals, others gather populations—and one leader organizes the pieces, you finish faster and cleaner.

🥬 The Concept (Multi-Agent Systems): A multi-agent system is when many AI helpers each do part of a job instead of one AI doing everything. How it works:

A big task is split into smaller, independent jobs.
Different agents take different jobs.
They work at the same time and share results. Why it matters: One agent’s memory gets messy (context pollution) and time runs out when tasks are done one-by-one. 🍞 Anchor: Making a movie: writers draft the story, artists design scenes, sound editors add music—all in parallel—so the film ships on time.

🍞 Hook: You know how a video game gives you points for doing well? That makes you try better strategies.

🥬 The Concept (Reinforcement Learning): Reinforcement learning (RL) teaches an AI by giving it rewards when its choices lead to good outcomes. How it works:

The AI tries actions.
It gets a score (reward) based on results.
It updates its strategy to earn higher rewards next time. Why it matters: Without rewards, the AI doesn’t know if it’s improving. 🍞 Anchor: Practicing basketball: each swish is a reward, so you repeat what worked.

🍞 Hook: Picture a relay race where each runner learns when to pass the baton and how fast to run to help the entire team win.

🥬 The Concept (Multi-Agent Reinforcement Learning): MARL is RL for a whole team of AIs that learn to cooperate. How it works:

Many agents act in the same task.
The team gets a shared score.
Everyone updates their behavior to boost the shared score. Why it matters: Without learning as a team, agents can work at cross-purposes and hurt the final result. 🍞 Anchor: Soccer practice: strikers, midfielders, and defenders learn patterns that raise the team’s chance of scoring.

The world before: Recent LLM advances focused on “depth scaling.”

🍞 Hook: You know how you can sometimes solve a tough math problem by thinking through many careful steps?

🥬 The Concept (Depth Scaling): Depth scaling means giving one AI more turns and longer reasoning to solve a job. How it works:

The single agent plans, searches, and thinks over many steps.
It accumulates notes in one long conversation.
It slowly builds to an answer. Why it matters: For very deep, single questions, more careful steps can help—but it also risks long, messy contexts. 🍞 Anchor: One detective reading every clue in a giant file—thorough, but slow and easily overwhelmed.

The problem: Broad information-seeking tasks (like “make a table with dozens of items and attributes”) are about breadth, not just depth. Two big pains appear:

Context pollution: one agent’s notebook fills with unrelated bits from many subtasks.
Sequential bottleneck: independent subtasks are done one at a time.

Failed attempts:

Hand-crafted workflows: designers hard-code who does what, but it’s inflexible as tasks or agent counts change.
Turn-taking systems: agents wait for each other, which ruins parallel speed.
Sample-more tricks (best-of-N): repeat the same plan many times to pick a good try—helps reliability but not breadth, because subtasks aren’t split.

The gap: We need a system that learns to organize a team and truly parallelize work, keeping each subtask’s notes separate while still producing a single, clean table at the end.

🍞 Hook: Think of an orchestra conductor who knows who should play when, and keeps each section on their own sheet music.

🥬 The Concept (Orchestration): Orchestration is the leader’s skill to assign clear subtasks, time them, and combine the results. How it works:

Break the big job into independent parts.
Send each part to a helper with instructions.
Collect and merge the answers. Why it matters: Without orchestration, helpers duplicate work, miss pieces, or collide. 🍞 Anchor: A kitchen head chef assigns appetizers, mains, and desserts to different cooks, then plates the full meal.

Real stakes: This isn’t just about benchmarks. It affects:

Business research: compiling competitive product tables.
Journalism: summarizing many sources into clean datasets.
Science: aggregating related studies’ attributes.
Education: building resource catalogs.
Everyday life: planning trips with multiple cities, budgets, and dates.

WideSeek-R1 was built to fill this gap: a learned leader-helper team that scales “width,” not just “depth,” to conquer broad info-seeking without drowning in its own notes.

02Core Idea

🍞 Hook: Imagine building a giant LEGO city: one kid plans the neighborhoods, many friends each build a block at the same time, and together they finish faster and cleaner than any single super-builder could.

🥬 The Concept (Aha!): Train one lead agent and many parallel subagents together (with MARL) so the leader learns to split the job and the helpers learn to search well, all using separate notes and specialized tools. How it works (high level):

The lead agent breaks the big question into independent subtasks.
Each subagent searches and summarizes in its own clean mini-notebook (context isolation).
The lead collects the pieces and writes a structured final table.
A shared team reward and a special RL recipe teach everyone to cooperate. Why it matters: Without learned team play, parallel workers create chaos or idle; without isolation, notes get tangled; without shared rewards, they don’t improve as a group. 🍞 Anchor: A newsroom editor (lead) assigns reporters (subagents) different beats, each files their clean report, and the editor assembles a front-page story.

Three analogies:

Restaurant kitchen: head chef (lead) assigns stations; line cooks (subs) prepare dishes at once; the pass (final table) is neat and on time.
Bee colony: a queen (lead) signals tasks; foragers (subs) gather nectar in different fields; honeycomb (table) fills quickly without mixing pollen.
School project: a team captain (lead) splits topics; teammates (subs) research in separate docs; captain merges into a tidy report.

Before vs After:

Before: One big agent thinks longer; info gets mixed; tasks queue up; hand-crafted multi-agent flows struggle to scale.
After: A trained leader delegates; subagents run truly in parallel; each keeps a clean context; the final table is stronger and faster.

Why it works (intuition, not equations):

Context isolation is like separate notebooks: fewer mix-ups.
Parallel execution is like many hands making light work.
A shared backbone model keeps the team speaking the same “language,” so prompts and summaries mesh.
Group-normalized rewards and dual-level reweighting stabilize learning so no single chatterbox (a long response or too many subagents) dominates training.
The dataset forces breadth: many entities, columns, and row counts train the system to cover and organize widely.

Building blocks (each is a mini concept):

🍞 Hook: Picture a smart project manager who only gives out tasks and waits for results.

🥬 The Concept (Lead-Agent–Subagent Framework): One leader assigns subtasks; many helpers solve them. How it works:

Leader uses a single tool: call_subagent.
Creates clear prompts for subagents.
Waits, then consolidates findings into the answer. Why it matters: Without a leader, helpers overlap or leave gaps. 🍞 Anchor: A librarian assigns shelf sections to volunteers and later checks all books are in order.

🍞 Hook: Think of adding more checkout counters at a grocery store.

🥬 The Concept (Width Scaling): Improve performance by adding more parallel agents instead of only adding more steps to one agent. How it works:

Keep turn count fixed.
Increase number of subagents working at once.
Merge their outputs. Why it matters: Without width, independent subtasks block each other in line. 🍞 Anchor: Ten cashiers serve shoppers faster than one cashier taking ten times as long.

🍞 Hook: You know how you don’t bring your math notes into art class?

🥬 The Concept (Context Isolation): Each subtask has its own clean workspace so irrelevant info doesn’t spill over. How it works:

Separate contexts per subagent.
Strip out unnecessary “thinking” when reporting back.
Leader only sees final, tidy summaries. Why it matters: Without isolation, errors cascade across subtasks. 🍞 Anchor: Separate Google Docs per teammate avoid messy edits.

🍞 Hook: Imagine a scoreboard that only cares about the final team result, not who shouted the most.

🥬 The Concept (Group Advantage + Dual Reweighting): A training trick that shares one team reward across agents, then balances influence by tokens and by agents. How it works:

Compute one outcome score per multi-agent rollout.
Normalize it within a group (fair comparison).
Reweight by token counts (longer, meaningful turns count) and by number of agents (crowds don’t drown others). Why it matters: Without this, agents could game the system by being overly long or by spawning too many helpers. 🍞 Anchor: In a relay, the final time trains the whole team, and coaching attention is fairly split among runners.

Put together, these pieces let WideSeek-R1 turn a broad question into a crisp, structured table—faster and with fewer mix-ups than a lone agent trying to juggle everything.

03Methodology

At a high level: Input query → Lead agent decomposes (call_subagent) → Subagents search in parallel (search/access) → Subagents return clean summaries → Lead agent synthesizes → Output structured table.

We’ll walk through each step like a recipe, and introduce new ideas with the Sandwich pattern.

Input and Goal

What happens: The system receives a broad info-seeking query that asks for a table (e.g., “List Ivy League universities with name, city, and founding year”).
Why it exists: The request needs many facts across multiple entities, so breadth and organization matter.
Example data: Ivy League universities → rows are schools; columns include name, city, founding year.

Lead Agent: Decompose and Delegate

🍞 Hook: Think of a coach who writes a play, sends players to their spots, and waits for the pass back.

🥬 The Concept (Task Decomposition): Breaking a big job into independent subtasks that can be done at the same time. How it works:

Parse the requested table’s schema (columns) and coverage (row count or full list requirement).
Identify independent units (e.g., one row per entity), and split them into prompts.
Create up to N subtasks in one turn via call_subagent. Why it matters: Without decomposition, the leader can’t distribute work or ensure full coverage. 🍞 Anchor: For 8 Ivy League schools, create 8 prompts: “Find city and founding year of Harvard,” “...of Yale,” etc.

The leader only has one tool: call_subagent. This keeps the leader’s context clean and prevents context pollution.
The leader waits until subagents finish, then either launches another parallel wave or finalizes the answer.

Subagents: Parallel Information Seeking

🍞 Hook: Picture many reporters going to different libraries at once.

🥬 The Concept (Parallel Execution): Multiple helpers work simultaneously on separate tasks. How it works:

Each subagent receives one subtask and has its own context.
Subagents use tools: search (find snippets + URLs) and access (open a chosen URL and summarize details).
They iterate: search to discover, access to confirm and extract, then return a concise, structured summary. Why it matters: Without parallel execution, total time balloons and context gets tangled. 🍞 Anchor: Ten subagents each research one university’s city and founding year, finishing in one or two turns instead of ten turns.

🍞 Hook: You know how you first Google a topic, then click a promising link to read more closely?

🥬 The Concept (Tool Use: search + access): Two-step evidence gathering: broad discovery, then deep reading. How it works:

search: return candidate URLs and brief snippets.
access: fetch and summarize content from a selected URL tailored to the subtask.
Summarize findings with citations back to the leader. Why it matters: Without this two-step flow, you either miss coverage (no discovery) or waste time reading random pages (no focus). 🍞 Anchor: Search “Harvard founding year,” then access the official page or a reliable wiki to confirm it’s 1636.
Returning Results to the Leader

What happens: Subagents send compact summaries (thinking removed) back to keep the leader’s context small and neat.
Why it exists: The leader must merge many pieces and avoid drowning in long notes.
Example: Each subagent returns: School=Harvard, City=Cambridge, MA, FoundingYear=1636, Evidence=URL.

Leader: Synthesize and Finalize Table

What happens: The leader checks coverage (all rows present?), consistency (any conflicts?), and formats a single Markdown table.
Why it exists: To deliver exactly what the user asked for—one clean, complete table.
Example: 8 rows, 3 columns; if any school is missing, the leader launches another round for just the missing bits.

Training the Whole Team (MARL)

🍞 Hook: Think of a class graded on a group project, with fair grading so both the presenter and the researchers improve.

🥬 The Concept (Outcome Reward): One score based on the final answer’s quality (e.g., table cell accuracy), plus small bonuses for correct format and using tools. How it works:

Compare the generated table to ground truth to get an Item F1 score.
Add rewards for proper Markdown format and at least one tool use; subtract a penalty if responses are too long.
Share this single rollout reward with all agents in that attempt. Why it matters: Without a clear, verifiable outcome, the team can’t learn what really leads to better answers. 🍞 Anchor: If the final table’s cells match well, the whole team gets a higher score; if formatting breaks, the score drops to zero for that run.

🍞 Hook: Imagine normalizing test scores so teams in the same class are compared fairly, not across wildly different conditions.

🥬 The Concept (Group-Normalized Advantage + Dual Reweighting): Stabilizes training by comparing runs fairly and balancing influence. How it works:

Normalize rewards across multiple rollouts in a group (fair baseline).
Token-level reweighting: longer, contentful turns contribute proportionally (not drowned by averaging).
Agent-level reweighting: average over agents so spawning more helpers only helps if quality improves. Why it matters: Without these, the system might game the reward by being verbose or by spawning excessive subagents. 🍞 Anchor: In practice, models that simply write more don’t win; models that write better and coordinate better do.
Data to Teach Breadth

Auto-built 20k-task dataset: • Stage 1: Turn intents into strict, schema-bound queries (columns fixed; 10–50 rows targeted). • Stage 2: Generate two independent answers and identify unique column(s) for row matching. • Stage 3: Keep only pairs that agree (high cell match) and are hard enough (not tiny or trivial).
Why it matters: Without broad, table-focused data, the team wouldn’t practice wide coverage and reliable synthesis.
Example: “Top 20 countries by population (2025) with Rank, Country, Population (millions, 1 decimal).”

Practical Controls

Shared 4B backbone (Qwen3-4B, thinking mode) for both leader and subagents keeps the team fluent.
Limits: up to 10 parallel subagents/turn; leader up to 10 turns; subagent up to 20 turns.
Tools: often backed by an offline Wiki2018 KB during training to reduce cost and keep evaluation consistent.

Secret sauce (why this recipe is clever):

Narrow, clean leader tool (call_subagent) avoids self-mess.
True parallel subagents with isolated contexts prevent cross-contamination and speed coverage.
A group-outcome reward plus normalization and dual reweighting keeps training fair, stable, and team-oriented.
The dataset forces the skills that matter for breadth: consistent schemas, many rows, careful row matching.

04Experiments & Results

The test: WideSearch—a benchmark of 200 broad, table-style tasks (100 English, 100 Chinese). It checks whether the system can gather many attributes across many entities and produce one correct table.

🍞 Hook: Think of grading a big class science fair with three rubrics: every measured number correct (cells), each project fully right (rows), and perfect blue-ribbon projects (entire table).

🥬 The Concept (Metrics: Item F1, Row F1, Success Rate): How it works:

Item F1: compares table cells—like checking each data cell.
Row F1: compares whole rows—must match exactly to count.
Success Rate: a perfect full-table match. Why it matters: Without multi-level grading, you can miss whether errors are scattered (cells) or whole rows are wrong. 🍞 Anchor: If most cells are correct but a few rows are mismatched, Item F1 will look okay, Row F1 will drop, and Success Rate may be zero.

Competition: Single-agent baselines (Qwen3-4B, Search-R1-7B, ASearcher-7B, DeepSeek-R1-671B) and multi-agent baselines (AgentFlow-7B, OWL-8B, MiroFlow-8B, and a Qwen3-4B multi-agent setup). WideSeek-R1-4B is the proposed system.

Scoreboard (with context):

WideSeek-R1-4B gets 40.0% Item F1 on WideSearch—like getting an A when many others got a C+—and roughly ties the giant DeepSeek-R1-671B while using about 170× fewer parameters.
It beats all 4B and 8B baselines on five of six metrics (Item/Row F1 Avg@4, Max@4, and Success Rate variants).
Compared to the same backbone used as a basic multi-agent (Qwen3-4B), MARL training adds about +8.8% Item F1—clear proof that learned orchestration and execution matter.

Surprising findings:

Depth scaling plateaus: letting a single agent take more turns helps at first but quickly saturates—like studying longer without changing strategy.
Width scaling shines—but only if learned: adding more untrained subagents eventually hurts (too much conflicting noise). With WideSeek-R1’s training, performance keeps rising as subagents increase (up to 10 in tests).
Joint optimization matters: ablations show that training only the leader or only the subagents helps, but training both together is best—teamwork is learned.
Data mixing helps: models trained on hybrid (wide + deep) data beat those trained on only-wide or only-deep, showing these skills complement each other.

Standard QA (beyond tables): On seven single-hop and multi-hop QA sets (NQ, TriviaQA, PopQA, 2Wiki, HotpotQA, Bamboogle, MuSiQue), WideSeek-R1-4B maintains strong average performance, surpassing some larger multi-agent systems (like OWL-8B, MiroFlow-8B). This means the system didn’t sacrifice general reasoning to get good at tables.

Behavioral patterns after training:

More thoughtful, useful tool use: more parallel subagents and a higher fraction of deep-access calls to fetch richer evidence.
Multi-agent answers use more total turns—expected because many helpers are working—but this turns into better coverage and cleaner final tables.

Bottom line: Width scaling with learned orchestration and parallel execution converts extra compute into real accuracy gains on broad info-seeking, something depth scaling alone couldn’t sustain.

05Discussion & Limitations

Limitations:

Compute cost: Even with a 4B model, training took about 3,000 H100 GPU-hours. Scaling up or exploring larger backbones would cost more.
Credit assignment: The training uses one final outcome reward shared by all agents. This is simple and stable, but it’s coarse: it can’t always tell whether the leader’s plan or a subagent’s execution caused a failure.
Fixed hierarchy: Training uses a two-layer design (leader → subagents). Allowing subagents to spawn more subagents (recursion) could help on very complex tasks but destabilized training in early trials.
Training latency: Synchronous rollouts made training stable but slow; most time was spent waiting for long generations.
Tool coverage: Results depend on the search/access sources; offline KBs are cheaper but may miss some live facts.

Required resources:

A cluster with enough GPUs for multi-agent rollouts.
A search + access tool setup (online APIs or a local wiki index).
The 20k wide dataset and optionally deep QA data for hybrid training.

When NOT to use:

Very deep, single-focus reasoning tasks where one strong agent with long chain-of-thought is enough (width adds overhead).
Ultra-low-latency settings where spinning up multiple subagents per query isn’t acceptable.
Tasks without clear subtask boundaries (hard to decompose cleanly).

Open questions:

Can we design role-specific rewards to separate orchestration quality from execution quality?
Can we train dynamic hierarchies (recursive delegation) without destabilizing MARL?
Will asynchronous rollouts (no waiting for the slowest subagent) keep stability and speed training?
How to merge conflicting subagent answers better—confidence weighting, source reliability scoring, or learned aggregation heads?
What’s the best way to balance number of subagents vs. turns per subagent for a fixed compute budget?

06Conclusion & Future Work

Three-sentence summary: WideSeek-R1 trains a leader and parallel helpers to tackle broad, table-style information seeking, focusing on width (more agents at once) instead of only depth (more turns for one agent). A multi-agent RL recipe—with group-normalized rewards and dual-level reweighting—teaches the team to decompose tasks, search in parallel with clean contexts, and merge results into one structured table. This delivers 40.0% item F1 on WideSearch with a 4B model, rivaling far larger single-agent systems, and performance keeps improving as more subagents are added.

Main achievement: Proving that learned orchestration plus true parallel execution unlocks real width scaling—turning more agents into more correct answers, not just more noise.

Future directions:

Role-specific rewards to pinpoint and fix planning vs. execution errors.
Dynamic, possibly recursive hierarchies for harder tasks.
Asynchronous training to cut rollout latency.
Smarter evidence merging (confidence and source reliability) and better safety guardrails.

Why remember this: It marks a shift from asking, “How deep can one agent think?” to “How well can a team coordinate?”—a practical path for faster, cheaper, and more reliable broad information seeking with small-but-many agents acting in concert.

Practical Applications

•Competitive product analysis: build side-by-side feature/price tables across dozens of brands.
•Market landscaping: map companies, locations, and founding years for an industry overview.
•Academic literature reviews: tabulate studies, methods, datasets, and results.
•Travel planning: compile city-by-city tables of attractions, hours, prices, and transit options.
•Public data dashboards: gather municipal metrics (schools, parks, services) into unified tables.
•E-commerce cataloging: enrich product listings with specs, certifications, and availability.
•News research: summarize multiple sources into a clean fact table for an investigative story.
•Education resources: build course topic tables with references, examples, and difficulty levels.
•Healthcare info aggregation: list clinics with services, locations, hours, and contact details.
•Policy comparison: tabulate laws or regulations across regions with dates and key clauses.

Version: 1