FS-Researcher: Test-Time Scaling for Long-Horizon Research Tasks with File-System-Based Agents

Chiwei Zhu; Benfeng Xu; Mingxuan Du; Shaohan Wang; Xiaorui Wang; Zhendong Mao; Yongdong Zhang

FS-Researcher: Test-Time Scaling for Long-Horizon Research Tasks with File-System-Based Agents

Intermediate

Chiwei Zhu, Benfeng Xu, Mingxuan Du et al.2/2/2026

arXiv PDF

Key Summary

•FS-Researcher is a two-agent system that lets AI do very long research by saving everything in a computer folder so it never runs out of memory.
•One agent (Context Builder) acts like a librarian: it searches the web, takes organized notes with citations, and files raw pages into a big, tidy knowledge base.
•The other agent (Report Writer) acts like an author: it writes the report section by section using only the saved knowledge, not the whole internet at once.
•Using the file system as external memory means the AI can keep working across many sessions without forgetting what it learned.
•On the DeepResearch Bench test, FS-Researcher with Claude-Sonnet-4.5 beat strong baselines in overall quality, especially on comprehensiveness and insight.
•On the DeepConsult test, FS-Researcher won most head-to-head comparisons, showing it helps on consulting-style research too.
•Giving the Context Builder more computation (more rounds) made reports better, proving the method supports test-time scaling.
•Writing the report section by section worked much better than one-shot writing, leading to deeper analysis and clearer structure.
•The approach needs capable models and careful handling of web content to avoid errors, costs, and privacy risks.
•This matters because it points to a practical way for AI to help with real-world, weeks-long research—like business analyses, literature reviews, and policy studies.

Why This Research Matters

Many real problems—business strategy, policy analysis, scientific reviews—don’t fit in an AI’s short-term memory. FS-Researcher shows a practical way to keep going: save everything in a shared workspace, split roles, and improve step by step. This turns extra thinking time into better reports without needing to retrain the model. It also makes results more trustworthy by grounding claims in archived sources with precise citations. Teams can pause and resume work across days, sharing a living library instead of starting from scratch. In short, it’s a blueprint for AI that can do serious, long-form research the way people do—carefully, iteratively, and with receipts.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you’re making a giant school project—hundreds of pages! If you tried to keep every fact in your head while writing, you’d run out of room fast.

🥬 The Concept: Context window (what it is, how it works, why it matters)

What it is: A context window is the amount of text an AI can look at at once when thinking and writing.
How it works: 1) The AI loads the current question and notes; 2) It reads them together; 3) It produces the next words; 4) If the notes are too long, some must be left out or shortened.
Why it matters: Without enough space, the AI can’t see all the facts it needs, so it may miss evidence or forget earlier details. 🍞 Anchor: If you ask the AI to read 300 web pages and write a 20-page report, the context window is like a backpack that’s too small—you can’t carry everything at once.

🍞 Hook: You know how a chef plans meals to use ingredients wisely so nothing is wasted?

🥬 The Concept: Token utilization

What it is: Token utilization is how efficiently the AI uses its limited reading-and-writing budget (tokens) for a task.
How it works: 1) Decide what to read; 2) Save the most important bits; 3) Reuse saved bits later; 4) Spend more tokens where they bring the most benefit.
Why it matters: If the AI wastes tokens on fluff, it has fewer tokens left for key facts and final writing. 🍞 Anchor: Like saving your best colored pencils for the poster title, good token use keeps the report strong where it counts.

🍞 Hook: Think about building a Lego city over many afternoons—you need a table where your buildings can stay between playtimes.

🥬 The Concept: Persistent workspace (external memory)

What it is: A persistent workspace is a place outside the AI’s short-term memory to store files, notes, and plans between sessions.
How it works: 1) Create folders and files; 2) Save notes and sources; 3) Keep to-do lists and logs; 4) Reopen next time to continue exactly where you left off.
Why it matters: Without it, every session is like starting over and facts get lost. 🍞 Anchor: A computer folder with your research notes, sources, and plan is a persistent workspace for the AI—just like your Lego table.

🍞 Hook: You know how we say, “Measure twice, cut once” when crafting?

🥬 The Concept: Test-time scaling

What it is: Test-time scaling means letting the AI spend more thinking steps—more compute—on a hard problem after you give it the question.
How it works: 1) Allow more rounds of searching and note-taking; 2) Keep saving to the workspace; 3) Write with richer materials; 4) Repeat until quality is good.
Why it matters: Without the ability to add more compute at test time, tough tasks stay shallow and incomplete. 🍞 Anchor: If a math problem is tricky, you do more steps; with test-time scaling, the AI can also “do more steps” to solve big research problems.

🍞 Hook: Imagine trying to write a book while you’re still running around the library. You’ll either read too little or write too little.

🥬 The Concept: Deep research challenge before FS-Researcher

What it is: Long-horizon research needs hundreds of sources and long reports, but single-agent, in-context setups run out of space and mix reading with writing.
How it works: 1) The agent searches and summarizes; 2) Summaries replace details; 3) Details vanish due to space; 4) The final report lacks depth and citations.
Why it matters: Reports become list-like, with missing coverage and weak grounding. 🍞 Anchor: It’s like trying to memorize an encyclopedia, then write from memory—things get lost or muddled.

🍞 Hook: Think of sticky notes that keep falling off—by tomorrow, half your plan is gone.

🥬 The Concept: Ephemeral state vs. persistent artifacts

What it is: Ephemeral state is information that disappears when a session ends; persistent artifacts are files that remain.
How it works: 1) In many agents, thoughts vanish after the loop; 2) In a file-based approach, notes/logs live on; 3) Future sessions reuse them; 4) Progress compounds.
Why it matters: Without persistence, you can’t steadily refine large projects. 🍞 Anchor: Keeping a science journal (persistent) beats scribbles on a whiteboard that get erased (ephemeral).

The world before: Many research agents tried to stretch the context window by summarizing tool outputs, or splitting browsing into sub-agents but still stuffing the main agent’s context. This helped a little but hit a hard ceiling: the context limit. Internal thoughts, plans, and tool traces evaporated after each run, so iteration restarted from scratch. The problem: How can we keep gathering evidence and keep writing better reports without being trapped by the context window? Failed attempts: Heavier summarization (lossy), more sub-agents (still context-bound), or one-shot long writing (shallow and list-like). The missing piece: a durable external workspace that holds a growing, well-organized knowledge base with citations, plus a workflow that separates evidence-gathering from writing. Real stakes: Business analysts, journalists, scientists, and policy teams need reliable, traceable reports that can evolve over days or weeks. A method that scales thinking time and preserves artifacts can turn AI from a short sprinter into a marathon researcher.

02Core Idea

🍞 Hook: Picture a relay team: one runner collects the baton from many places, the next runner sprints it cleanly to the finish.

🥬 The Aha Moment

One sentence: Split research into two expert agents and save everything in a file-system workspace so you can scale thinking time far beyond the AI’s short-term memory.

🍞 Anchor: One teammate (librarian) gathers and files sources; the other (author) writes chapter by chapter from that library.

🍞 Hook: You know how big school plays need both backstage crew and actors on stage?

🥬 Multiple Analogies

Library-and-author: The Context Builder makes the library; the Report Writer is the author who cites that library. Without a library, the author forgets facts; without an author, the library never becomes a story.
Construction site: One crew lays foundations (plans, materials, checklists); the next crew builds rooms (sections). The shared site (workspace) keeps blueprints and materials safe between days.
Cooking show: Prep cook chops, labels, and stores ingredients; head chef plates dishes one course at a time. The fridge and shelves (workspace) keep everything fresh and findable.

🍞 Anchor: If the prep cook doesn’t label containers, the chef wastes time and makes mistakes.

🍞 Hook: Imagine upgrading from a small backpack to a whole locker room.

🥬 Before vs. After

Before: Single agent shoves everything into one context; details get summarized away; citations get messy; long reports feel like lists.
After: Two agents share a persistent workspace; the knowledge base grows without limit; the report is written section-wise with grounded citations.
Change: You can add more compute at test time (more Context Builder rounds) and actually get better results. 🍞 Anchor: More prep rounds mean a fuller pantry; the final meal tastes better.

🍞 Hook: Think of a treasure map where X marks the spot, and paths guide you from clue to clue.

🥬 Why It Works (intuition, no equations)

Structured saving beats compress-and-forget: Raw sources + distilled notes + citations keep detail and provenance.
Role separation reduces cognitive collisions: Browsing and writing don't fight for space; each agent focuses.
Section-wise writing creates local focus: Work on one chunk at a time with the exact notes loaded on demand.
Persistence enables iterative refinement: Checklists, todos, and logs carry learning across sessions. 🍞 Anchor: Instead of juggling all balls at once, you toss and catch one ball per section—fewer drops.

🍞 Hook: You know how a good binder has tabs for units, homework, and tests?

🥬 Building Blocks (small pieces)

Knowledge base
- What: A tree of folders/files with distilled notes and raw sources, each note with citations.
- How: 1) Search; 2) Read; 3) Distill into markdown notes; 4) Archive pages; 5) Link every claim to a source.
- Why: Without it, the report loses details and traceability.
- Example: “Allianz Solvency II 209% [citation: sources/allianz_2024_ar.md]”.
File-system workspace
- What: The shared home for the KB, report, todos, checklists, and logs.
- How: 1) Create index.md as a table of contents; 2) Track todos; 3) Save logs after each session; 4) Reopen next time.
- Why: Without it, multi-day research falls apart.
- Example: folders like knowledge_base/, sources/, and report.md living together.
Dual-agent framework
- What: Two specialists—Context Builder (librarian) and Report Writer (author).
- How: 1) Librarian builds KB; 2) Author writes one section at a time; 3) Both use checklists.
- Why: Without separation, the agent half-reads, half-writes, and does neither well.
- Example: Builder finishes KB → Writer drafts “Methodology” using only KB.
Multi-session workflow
- What: Work continues across multiple sessions with preserved state.
- How: 1) Start session by inspecting workspace; 2) Update todos; 3) Save logs; 4) Repeat.
- Why: Without sessions, you can’t scale thinking time.
- Example: “Round 2: add UNECE sources; Round 3: validate citations.”
Iterative refinement
- What: Improve by small fixes each round.
- How: 1) Self-check vs. checklist; 2) Mark gaps; 3) Fetch more evidence; 4) Tighten writing.
- Why: Without iteration, errors and gaps stay hidden.
- Example: Add missing primary PDFs or better tables. 🍞 Anchor: Like polishing a science fair poster each evening until it shines.

03Methodology

At a high level: Input → Context Builder rounds (search, read, distill, archive) → Shared workspace (KB + control files) → Report Writer rounds (outline, write one section, review) → Final report.

🍞 Hook: Think of a well-run classroom project with roles, a shared folder, and a daily log.

🥬 Step 0: Set up the shared workspace

What happens: Create a file-system folder with subfolders (knowledge_base/, sources/), an index.md (table of contents), control files (todos.md, checklist.md, logs.md).
Why this exists: Without a home base, work scatters and can’t survive between sessions.
Example with data: index.md lists the research topic, planned KB hierarchy (e.g., company_profiles/, comparative_analysis/), and initial todos like “Fetch top-10 insurer rankings; create ‘metrics_definitions’ note.” 🍞 Anchor: It’s your group’s Google Drive for the whole semester.

🍞 Hook: You know how you first search for books and only then take notes?

🥬 Step 1: Context Builder round (search-before-read)

What happens: The agent uses search_web to find candidate URLs and read_webpage to fetch pages. It updates index.md and creates/edits KB notes. Raw pages go into sources/; distilled notes go into knowledge_base/ with citations.
Why this exists: Early searching casts a wide net; reading too soon can lock into narrow sources.
Example data: Query “global top insurers by assets” → add sources: Allianz AR 2024, AXA Solvency report; note: “Allianz Solvency II ratio ~209% [sources/allianz_2024_ar.md].” 🍞 Anchor: Like scanning the library catalog first, then reading the most relevant books.

🍞 Hook: Imagine a table of contents that grows as you learn.

🥬 Step 2: Dynamic KB building

What happens: The KB tree is expanded as understanding improves. Folders and filenames are descriptive (e.g., company_profiles/allianz/credit_ratings.md). Every claim includes a citation to a saved source.
Why this exists: A static plan misses surprises; a living structure fits real evidence.
Example data: comparative_analysis/dividends_payouts_comparison.md includes a table across companies with yield history and payout ratios, each row citing the right source. 🍞 Anchor: Your binder gets new tabs once you realize a better way to organize.

🍞 Hook: You know how checklists keep airplanes safe?

🥬 Step 3: End-of-round self-check and logs

What happens: The agent reviews against the checklist: tasks complete, hierarchy matches, no placeholders, full traceability, exhaustive coverage, information density. It updates todos and writes a log entry describing gaps and next actions.
Why this exists: Without self-checks, gaps and errors pile up.
Example: “Round 2: Missing UNECE primary PDFs—add via alternate mirrors; expand reputation-and-ratings comparison.” 🍞 Anchor: The log is the captain’s logbook ensuring steady progress.

🍞 Hook: When the library is ready, the author takes the stage.

🥬 Step 4: Handoff to Report Writer and outline creation

What happens: Remove web tools. The Report Writer treats the KB as the only fact source. It drafts an outline (report_outline.md) that doubles as a todo list with [PENDING]/[IN-PROGRESS]/[COMPLETE].
Why this exists: Prevents drifting back to the open web; forces writing to stay grounded in saved evidence.
Example: Outline includes sections like “Methodology,” “Dimension-wise Comparisons,” “Shortlist & Rationale.” 🍞 Anchor: The playwright now writes scenes using the research binder, not random internet tabs.

🍞 Hook: Have you ever written a big essay one paragraph at a time?

🥬 Step 5: Section-wise writing with reviews

What happens: The Report Writer picks exactly one section per session, reads relevant KB files, drafts the section, and runs a section-level checklist (content coverage, clarity, tables, citations). Only after passing does the section become COMPLETE.
Why this exists: One-shot writing becomes list-like and sloppy; section-wise writing enables local focus and quality gates.
Example data: In “Financing & Solvency,” the writer cites specific ratios (e.g., Allianz 209%, AXA 216%) from KB notes and aligns each claim to its source file. 🍞 Anchor: Finish one chapter perfectly before starting the next.

🍞 Hook: You know how detectives think-then-act, then look at what they found?

🥬 Step 6: ReAct-style loop under the hood

What it is: ReAct is a pattern where the AI alternates between thoughts (T), actions (A) with tools, and observations (O), saving artifacts each step.
How it works: 1) Think about the next move; 2) Call a tool (search, read, ls, grep, edit file); 3) Observe the result; 4) Update files and plan; 5) Repeat.
Why it matters: Without a disciplined loop, the agent meanders or forgets what changed. 🍞 Anchor: It’s like “plan → do → check → act” in mini-cycles.

🍞 Hook: What’s the secret ingredient that makes the cookies chewy?

🥬 The Secret Sauce

Persistent workspace: Keeps raw sources, distilled notes, and control files together, so progress compounds.
Role separation: Context Builder maximizes evidence quality; Report Writer maximizes narrative clarity.
Section-wise gates: Local checklists catch issues early; global checklist ensures whole-report integrity.
Test-time scaling: Simply allow more Context Builder rounds to improve coverage and depth—no retraining needed. 🍞 Anchor: A well-labeled pantry, a great prep cook, and tasting each course before serving make a far better meal.

Example mini-walkthrough (insurance query):

Input: “Compare the world’s top 10 insurers across financing, reputation, 5y growth, dividends, and China potential; pick 2–3 likely future asset leaders.”
Builder rounds: Search rankings; archive annual reports; distill solvency ratios, ratings, dividend histories; build comparative tables, all with citations.
Writer: Outline → “Methodology” section → “Dimension-wise Comparisons” → “Shortlist & Rationale,” citing KB modules. Final report passes section and report checklists.

04Experiments & Results

🍞 Hook: Picture a science fair where judges grade projects on depth, accuracy, and writing quality.

🥬 The Test: What they measured and why

DeepResearch Bench (RACE + FACT):
- RACE focuses on report quality—comprehensiveness, insight, instruction-following, readability—scored relative to a strong reference.
- FACT checks citation reliability—how many claims are supported (Effective Citations) and how precise the citations are (Citation Accuracy).
DeepConsult: A head-to-head judge compares two reports for instruction-following, comprehensiveness, completeness, and writing quality.
Why it matters: These tests simulate real research demands: cover the space, think deeply, and show your receipts (citations). 🍞 Anchor: It’s like being graded on content, thinking, neatness, and whether you showed your work.

🍞 Hook: You know how races are more exciting with strong competitors?

🥬 The Competition: Baselines

Proprietary: OpenAI Deep Research, Claude-Research, Gemini-2.5-Pro DeepResearch.
Open-source/recent systems: LangChain-Open-Deep-Research, WebWeaver, RhinoInsight, EnterpriseDeepResearch.
Why compare: To show the framework helps across different base models and against strong systems. 🍞 Anchor: Running against the track team’s best shows if your new shoes really help.

🍞 Hook: Report cards make numbers meaningful.

🥬 The Scoreboard: Results with context

DeepResearch Bench (overall quality RACE):
- FS-Researcher (Claude-Sonnet-4.5) scored about 53.94 RACE, topping the best baseline (~50.92) by +3.02—like moving from a strong B to an A-.
- Big gains in Comprehensiveness (+3.74) and Insight (+4.40) show broader coverage and deeper thinking.
- Citation Accuracy was competitive (only behind Gemini-2.5-Pro DeepResearch).
Same backbone comparison (framework advantage): With GPT-5, FS-Researcher beat LangChain Open Deep Research by +2.16 RACE, proving the file-system + dual-agent design adds value beyond model choice.
DeepConsult (pairwise wins): FS-Researcher (GPT-5) won ~73% of matchups; with Claude-Sonnet-4.5 (on a subset) it won ~80% and had the highest average score, especially on instruction following, comprehensiveness, and completeness. 🍞 Anchor: It’s like getting more As across subjects, not just one.

🍞 Hook: If you study a bit longer, do your grades improve?

🥬 Surprising and insightful findings

Test-time scaling works: More Context Builder rounds (3 → 5 → 10) steadily improved most quality scores; reports got longer with more citations—proof that extra compute turns into better evidence and synthesis.
Diminishing returns: The biggest KB growth was from 3 to 5 rounds; 5 to 10 added less, suggesting the knowledge base becomes more complete over time.
Readability trade-off: Readability peaked around 5 rounds and slightly dipped at 10; denser facts and more domain terms can make text heavier.
Model behavior nuance: GPT-5 sometimes stacked multiple citations at the end of paragraphs, reducing citation-to-claim alignment versus Claude.
Tool-usage patterns: Clear search-before-read behavior; early ls (folder inspection); late edits (replace/delete) cluster near the end for self-check fixes. 🍞 Anchor: Studying more helps, but after a point you get smaller gains—and if you cram too many facts, your essay can feel dense.

Bottom line: Across two tough benchmarks and several ablations, FS-Researcher’s persistent workspace and two-stage design translate extra test-time compute into higher-quality, better-grounded reports.

05Discussion & Limitations

🍞 Hook: Even a great backpack has limits—you still have to pack wisely and carry the weight.

🥬 Limitations

Needs capable base models: File operations, planning, and long-form writing benefit from stronger LLMs; smaller models may stop early or make file-edit mistakes.
Cost and latency: More Context Builder rounds mean more API calls and time.
Web content risks: Sources can be outdated, biased, or wrong; even with citations, errors can sneak in.
Security/privacy: Archiving web pages and notes can store sensitive or copyrighted material; workspace could be a target for prompt-injection attempts. 🍞 Anchor: It’s like needing a careful librarian and good locks for a valuable library.

🍞 Hook: Not every tool is right for every job.

🥬 When not to use

Short, closed-book tasks (e.g., “What is 7×8?”) don’t need a file-system workspace.
Highly sensitive data that can’t be stored on disk.
Environments without reliable web access.
When quick, one-shot answers are preferred over thorough, sourced reports. 🍞 Anchor: Don’t bring a moving truck when a backpack will do.

🍞 Hook: If we keep improving, what could get even better?

🥬 Open questions

Can we adapt the framework for smaller, cheaper models (e.g., stronger guardrails, more automation in file ops)?
Better citation alignment (sentence-level tagging) and de-duplication.
Smarter search expansion: auto-detect gaps and query reformulations.
Readability boosters: style passes that preserve depth but improve clarity.
Safety: robust defenses against malicious pages and prompt injections during tool use. 🍞 Anchor: It’s like adding better labels, clearer maps, and stronger doors to an already good library.

06Conclusion & Future Work

🍞 Hook: Think of turning a messy pile of research into a neat library and a well-written book.

🥬 3-Sentence Summary

FS-Researcher splits deep research into two roles—Context Builder and Report Writer—and saves everything in a file-system workspace that outlives any single session.
This persistent, citation-grounded knowledge base lets the system scale its thinking time at test time: more Context Builder rounds lead to better, deeper reports.
Across two benchmarks, the framework achieves state-of-the-art report quality and proves that external memory plus role separation beats context-bound, one-shot approaches.

Main Achievement

Showing that a file-system-based, dual-agent workflow enables practical test-time scaling for long-horizon research: evidence grows, citations stay precise, and writing improves section by section.

Future Directions

Make it friendlier to smaller models, strengthen citation alignment, add safety against malicious content, and build readability-enhancing passes.

Why Remember This

Because it’s a recipe for turning AI into a marathon researcher, not just a sprinter: save everything, split roles, iterate with checklists, and let extra effort at test time pay off in real report quality.

Practical Applications

•Business and consulting reports: Build comparative market analyses with traceable sources and section-wise write-ups.
•Journalism research: Gather multi-source evidence, organize quotes and data, and draft investigative pieces with citations.
•Academic literature reviews: Archive PDFs, extract key results, and produce synthesis sections with source-linked claims.
•Competitive intelligence: Track rival product specs, pricing, and news in a persistent KB and generate monthly briefs.
•Policy analysis: Compile regulations, case law, and stakeholder views; write balanced, evidence-grounded recommendations.
•Financial research: Compare solvency ratios, dividends, ratings, and growth across firms with reproducible tables.
•Medical guideline scans: Catalog clinical studies and summaries with provenance before writing recommendations.
•Legal memo preparation: Organize statutes, cases, and commentary in a KB; draft sections with precise citations.
•Grant and RFP responses: Reuse a structured KB of requirements, capabilities, and evidence to craft strong proposals.
•Enterprise knowledge curation: Create durable, navigable research spaces teams can refine over time.

Version: 1