AgentCPM-Report: Interleaving Drafting and Deepening for Open-Ended Deep Research

Yishan Li; Wentong Chen; Yukun Yan; Mingwei Li; Sen Mei; Xiaorong Wang; Kunpeng Liu; Xin Cong; Shuo Wang; Zhong Zhang; Yaxi Lu; Zhenghao Liu; Yankai Lin; Zhiyuan Liu; Maosong Sun

AgentCPM-Report: Interleaving Drafting and Deepening for Open-Ended Deep Research

Intermediate

Yishan Li, Wentong Chen, Yukun Yan et al.2/6/2026

arXiv

Key Summary

•This paper teaches small, local AI models to write deep, insightful research reports by letting writing and planning work together instead of staying separate.
•The key idea, called WARP (Writing As Reasoning Policy), lets the AI switch between drafting with evidence and digging deeper when it spots gaps.
•Instead of freezing an outline up front, the outline stays dynamic and evolves as the AI writes and learns new things.
•A special training path (cold start → atomic skill RL → holistic pipeline RL) helps an 8B-parameter model learn long, multi-step decisions stably.
•Trajectory pruning shows the AI the best stopping point by trimming overlong teacher examples to the highest-scoring draft.
•Across three benchmarks, the system beats or matches bigger, closed-source systems, especially on Insight and Comprehensiveness.
•Because it runs locally, it protects private data and avoids sending documents to external servers.
•The method balances depth and cost by only deepening when it truly improves the report.
•Even without special training, the WARP policy helps larger models outperform the classic plan-then-write approach.
•This work shows that smart policies and training can matter more than just making models bigger.

Why This Research Matters

Being able to run deep research locally means you can keep sensitive documents—like medical notes or company files—private while still getting strong, insightful reports. A small model that writes as it thinks can match or beat bigger, closed systems, making powerful research tools more accessible. This reduces dependence on remote services and lowers cost, since you do less uploading and pay fewer cloud fees. It also improves usefulness: the agent only deepens where it truly helps, saving time while increasing depth. Most importantly, it proves that smart strategies can unlock big gains without just making models larger. That opens the door to safer, cheaper, and smarter AI research tools for schools, businesses, and individuals.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

You know how when you start a school report, you don’t know everything yet, and as you write you realize what you still need to look up? Writing helps you think. Computers that write long research reports have had trouble doing that kind of thinking while writing.

🍞 Top Bread (Hook): Imagine trying to bake a cake with every step fixed before you even open the fridge. Once you start mixing, you’ll discover you’re out of eggs and need to adjust—but a rigid plan won’t let you. 🥬 The Concept (Retrieval-then-Write): It’s a simple approach where the AI first fetches information and then writes it in order. How it works: (1) search; (2) read; (3) write what was found. Why it matters: It’s flexible but can drift and become messy over long reports. 🍞 Bottom Bread (Anchor): Like copying notes from several websites into paragraphs—fine for short answers, but it gets jumbled on a 10-page report.

🍞 Top Bread (Hook): Think of making a Lego instruction booklet first and then building exactly that, step by step. 🥬 The Concept (Plan-then-Write): The AI creates a detailed outline first, then follows it strictly to write. How it works: (1) build a full outline; (2) lock it; (3) write each section by the plan. Why it matters: Structure is strong, but the plan assumes you already know what you’ll need. If you discover new insights mid-writing, you can’t easily update the plan. 🍞 Bottom Bread (Anchor): If your outline forgot a key chapter, the finished essay looks neat but shallow.

The world before: Many deep-research agents either (a) wrote as they retrieved, which grew inconsistent, or (b) froze an outline first, which missed new ideas discovered while writing. Both approaches often relied on very large, online models to build strong outlines or recover when things went off track. That created privacy problems (uploading sensitive data) and made on-device use hard.

The problem: Real research is full of surprises—gaps, contradictions, and new directions pop up when you try to explain things on paper. Systems that separate planning and writing hit an “insight ceiling”: they look organized but don’t dig deep.

Failed attempts: People tried adding better retrieval, longer contexts, and stricter outline rules. These helped with order but not with discovering and shaping new ideas mid-writing. Small models especially struggled to make great outlines up front, so systems leaned on giant closed models.

The gap: We needed a way for the plan to grow and improve while writing—like a thinking loop that checks the draft, finds what’s missing, looks up new evidence, and updates the outline, over and over.

🍞 Top Bread (Hook): You know how you sketch a rough map first, then redraw it cleaner as you explore the neighborhood? 🥬 The Concept (Dynamic Outline): A flexible outline that changes as the AI writes and learns. How it works: (1) start with high-level sections; (2) write a bit; (3) spot gaps; (4) expand the outline where needed; (5) repeat. Why it matters: Without a dynamic outline, new insights discovered during writing can’t reshape the structure, so depth stays limited. 🍞 Bottom Bread (Anchor): While writing “Cognitive Foundations,” you realize social-emotion topics are thin, so you split a new subsection and add evidence.

Real stakes: In school, at work, or in research, we need reports that are not just organized but also insightful. People also want to keep their data private—medical notes, internal company docs, or personal files shouldn’t leave the device. If small, local models can reach deep-research quality, more people can safely use them.

This paper answers that need by showing that a small, 8B-parameter model, guided by a smarter policy that treats writing as a form of reasoning, can achieve top-tier insight on tough benchmarks while running locally.

02Core Idea

The “Aha!” in one sentence: Let writing and planning interleave, so the outline evolves as the AI writes—switching between drafting with evidence and deepening reasoning where the draft shows gaps.

🍞 Top Bread (Hook): Imagine a detective building a case board. As new clues appear, they rearrange threads and add new pins—planning and investigating happen together. 🥬 The Concept (WARP: Writing As Reasoning Policy): A policy that treats writing as thinking, alternating between writing supported by evidence and reasoning that updates the outline. How it works: (1) start with a light outline; (2) write a section grounded in retrieved facts; (3) reread the draft to find gaps; (4) expand the outline where needed; (5) repeat until the argument feels complete. Why it matters: Without this loop, insights remain hidden and reports stay shallow. 🍞 Bottom Bread (Anchor): While drafting a section on “AI and empathy,” the model spots a missing contrast with “anthropomorphism,” updates the outline to add it, searches for supporting studies, and writes a better, deeper subsection.

Three analogies for the same idea:

Chef analogy: Taste-as-you-cook. You don’t finalize the recipe before cooking; you taste, adjust seasoning, add a new step if needed, and keep iterating.
Drawing analogy: Sketch-then-refine. Start with loose lines, then add shading where the picture feels flat, and redraw parts that don’t fit.
Hiking analogy: Trail-and-map co-evolve. You begin with a rough map, explore, then redraw the map to include a better path you actually found.

Before vs After:

Before: Outline is frozen; writing just fills boxes; missed connections stay missed.
After: Outline is alive; writing reveals gaps; the plan adapts; depth and coverage rise together.

Why it works (intuition):

Drafts concentrate knowledge. When you try to explain something, weak spots become obvious—like squeaky steps on a staircase.
Targeted deepening increases return-on-effort. Instead of expanding everything, the agent pinpoints the thin parts and adds just the right subsections and evidence.
Feedback loop stabilizes quality. Each new pass tightens logic, increases faithfulness with citations, and balances breadth with depth.

Building blocks (in the paper’s order):

🍞/🥬/🍞 Evidence-Based Drafting (see sandwich below)
🍞/🥬/🍞 Reasoning-Driven Deepening (see sandwich below)
A small, clear action set (initialize, search, write, expand, terminate) keeps the loop simple but expressive.
Multi-Stage Agentic Training and trajectory pruning teach the agent when to deepen and when to stop.

🍞 Top Bread (Hook): You know how teachers ask, “Where did you find that?” when you make a claim. 🥬 The Concept (Evidence-Based Drafting): Write sections grounded in retrieved sources, with citations. How it works: (1) form a search query from the current draft and section goal; (2) retrieve documents; (3) write a paragraph that synthesizes ideas and cites sources. Why it matters: Without evidence-based writing, claims can drift or be wrong. 🍞 Bottom Bread (Anchor): Writing “Cognitive Foundations,” the agent retrieves “Computers Are Social Actors” and weaves it into a supported paragraph.

🍞 Top Bread (Hook): When you read what you wrote out loud, you often notice what’s missing. 🥬 The Concept (Reasoning-Driven Deepening): After drafting, the AI inspects the text for thin logic or missing angles and expands the outline right there. How it works: (1) analyze semantic density and coherence; (2) pick a section to deepen; (3) add targeted subsections; (4) return to drafting. Why it matters: Without deepening, reports stay surface-level and miss key contrasts or mechanisms. 🍞 Bottom Bread (Anchor): Seeing a gap in “Socioaffective Alignment,” the agent adds “Misalignment Risks,” retrieves studies, and strengthens the argument.

By combining these pieces under one policy, WARP turns writing into a guided thinking loop that raises insight without needing a giant model.

03Methodology

At a high level: User Query → Initialize a light outline → Evidence-Based Drafting (search → write) → Reasoning-Driven Deepening (analyze → expand) → Repeat until Terminate → Final report.

State and actions like a recipe:

State = (User query, Dynamic outline, Current draft, Retrieved context).
Actions = {INITIALIZE, SEARCH, WRITE, EXPAND, TERMINATE}.

🍞 Top Bread (Hook): Imagine a tiny set of powerful buttons on a research robot’s dashboard. 🥬 The Concept (Action Space): Five actions that cover planning, retrieving, writing, deepening, and stopping. How it works: (1) INITIALIZE makes a high-level outline; (2) SEARCH fetches sources; (3) WRITE produces cited paragraphs; (4) EXPAND adds subsections where needed; (5) TERMINATE stops when quality saturates. Why it matters: Without clear actions, the agent can’t control the research journey or learn what to do next. 🍞 Bottom Bread (Anchor): For “Theoretical Foundations,” the agent initializes two subsections, searches for key papers, writes a cited paragraph, expands to add “Cognitive Offloading,” then terminates when the chain of ideas feels complete.

Step-by-step with mini-examples:

INITIALIZE (coarse-to-fine): Start with a sparse Level-1 outline—titles plus brief intents—based on the query and quick background retrieval. Why needed: A too-detailed outline may be wrong; a light one avoids early mistakes and anchors the scope. Example: “Theoretical Foundations” with mini-plans like “cover cognitive frameworks and alignment.”
Evidence-Based Drafting:
- Form a query using the user need, the section intent, and the current draft.
- Retrieve new context and write a paragraph that synthesizes sources with citations.
- Why needed: Ensures each claim is grounded and fits the flow; otherwise sections drift or repeat.
- Example: Query “AI as social actors Nass Reeves summary,” retrieve, then write a supported paragraph with citations.
Reasoning-Driven Deepening:
- Read the updated draft to measure clarity and depth; detect missing contrasts or weak chains.
- Choose exactly one spot to deepen; add subsections to decompose the idea.
- Why needed: Prevents shallow coverage and local blind spots.
- Example: Add “Long-term trust calibration” under “Socioaffective Alignment,” then return to drafting.
TERMINATE:
- Stop when the draft is coherent, sufficiently deep, and matches the query’s complexity.
- Why needed: Avoids over-expanding and wasting compute; without it, the agent can loop forever.
- Example: After nine deepening rounds, new sections add little; the agent stops.

Training the small model to do this well:

🍞 Top Bread (Hook): Learning to play a sport starts with drills, then scrimmages, then full games. 🥬 The Concept (Multi-Stage Agentic Training): A staged path that first teaches basics, then skills, then whole-game strategy. How it works: (1) Cold Start (SFT) to learn formats and follow instructions; (2) Atomic Skill RL to reward good planning, searching, writing, and stopping; (3) Holistic Pipeline RL to optimize the full report’s quality (comprehensiveness, insight, etc.). Why it matters: Jumping straight to the full game confuses small models; stepwise training stabilizes learning. 🍞 Bottom Bread (Anchor): The agent first learns to write a good paragraph, then to choose precise keywords, then to decide when to deepen and when to stop.

🍞 Top Bread (Hook): Gardeners prune branches so the tree grows in the best shape, not just the biggest one. 🥬 The Concept (Trajectory Pruning): From an overlong teacher run, keep only the path up to the best-quality draft and relabel that point as “stop.” How it works: (1) force the teacher to over-expand; (2) score each intermediate draft; (3) cut at the highest score; (4) use that as supervision for the student. Why it matters: Teachers often stop too early or too late; pruning teaches the student when the report is actually “done.” 🍞 Bottom Bread (Anchor): If step 8 beats step 12 in quality, keep up to step 8 and teach the agent to terminate there.

Example with actual flow:

Input: “Write a paper on how AI interaction may change human relationships.”
Initialize: Sections for “Cognitive Foundations” and “Socioaffective Alignment.”
Draft: Search and write “Computers Are Social Actors” with citations.
Deepen: Split “Cognitive Foundations” into “Theory of Mind in HCI” and “Cognitive Offloading.”
Draft more: Retrieve relevant studies and write those sections.
Repeat: Keep deepening targeted areas until the argument feels balanced, then stop.

The secret sauce:

Draft-aware planning: The outline listens to the text it guides.
Targeted deepening: Expand only where the return on information is high.
Curriculum RL + pruning: Teaches a small model big-model decisions without copying big-model mistakes.
Unified policy: Same brain decides when to write, when to plan, and when to stop, so the whole report stays coherent.

04Experiments & Results

The test: Can a small, local 8B model, guided by WARP and trained with the staged curriculum, produce reports that are not only correct but also insightful and comprehensive? The paper measures Comprehensiveness (breadth and coverage), Insight (depth and synthesis), Instruction Following, and Readability across tough benchmarks.

The competition: The system is compared with proprietary deep-research tools (OpenAI, Gemini, Claude, Doubao), prompt frameworks (WebWeaver, Enterprise DR, RhinoInsight), and trained open models (WebShaper, WebThinker, DR Tulu). Some baselines use much larger or closed models.

Scoreboard with context:

DeepResearch Bench (100 PhD-level tasks): AgentCPM-Report (Pipeline RL) gets Overall 50.11, Comprehensiveness 50.54, and Insight 52.64. That’s like earning an A when Gemini-2.5-Pro-deepresearch gets just shy of that (Overall 49.71, Insight 49.45). The standout is Insight: +3.19 points over Gemini-2.5-Pro-deepresearch.
DeepResearch Gym (100 info-seeking tasks): State-of-the-art average 98.48 with perfect or near-perfect scores on depth, breadth, and insightfulness—like getting straight A+ in a six-category report card.
DeepConsult (business/finance): Average 6.60 with a win rate of 57.60% against strong baselines. That’s winning more than half the head-to-head matchups.

Why these numbers matter: “52.64 Insight” is not just a percentage; it means the system is producing connections, comparisons, and diagnoses that judges find more thoughtful than many competitors, even the large proprietary ones.

Surprising findings:

WARP helps even without special training: With a larger model (Qwen3-235B), WARP beats plan-then-write across all metrics (+1.19 Insight), proving the policy itself adds value, not just the training.
RL shifts behavior toward depth: After RL, the agent doubles its Expand actions and creates far more fine-grained Level-3 subsections. The structure literally gets richer where it matters.
Right amount of deepening: Forcing extra expansions shows a curve that rises then plateaus around nine deepening steps. The trained agent’s natural stopping behavior hovers around the sweet spot, indicating it learned a good “when to stop” policy.
Trajectory pruning helps even SFT: Using pruned trajectories in SFT boosts all metrics versus raw, unpruned ones—evidence that smart data curation teaches better termination.

Big picture: On average, this small, on-device agent reaches or outperforms closed giants, especially in insight and coverage. It’s like a well-coached small team beating a taller team by playing smarter and adjusting plays mid-game.

05Discussion & Limitations

Limitations (specific and honest):

Presentation/layout: Making beautiful tables and figures is different from writing paragraphs. The 8B model’s presentation can lag behind very large models. The paper suggests splitting rendering into a separate agent.
Data coverage and freshness: The local knowledge base (arXiv abstracts + web summaries) is stable and private but may miss the newest or niche sources. No images or videos yet.
Evaluation dependency: Some training/evaluation signals use LLM-as-judge, which can carry biases, although multiple metrics and references are used to reduce this.
Compute budget: Although the model is small, iterative deepening costs time; the system caps deepening steps and hierarchy levels to keep it efficient.

Required resources:

A capable local LLM (MiniCPM4.1-8B) with sufficient VRAM; vector DB (e.g., FAISS) for retrieval; a prepared local corpus (~2.86M docs); and the UltraRAG pipeline.
For training: multiple GPUs (the authors use 8×A100) and staged RL infrastructure; for inference: a strong single GPU or edge device with optimizations.

When not to use:

Purely creative fiction (no evidence required) or extremely short answers where dynamic deepening overhead isn’t worth it.
Tasks needing up-to-the-minute breaking news or heavy multimodal reasoning (images/videos) unless the local corpus is updated and multimodal tools are added.
Reports where polished visual layout (tables/figures) is the main evaluation criterion without a dedicated rendering agent.

Open questions:

Better stopping criteria: Can we predict saturation more directly from semantic signals without external judges?
Multimodal deepening: How to extend the loop to images, charts, and videos with faithfulness and citation discipline?
Trust and safety: How to measure and bound hallucination at scale in dynamic, evolving drafts?
Teamwork policies: Can multiple small agents specialize (planner, writer, renderer) and coordinate under a shared WARP-like policy?
Continual learning: How to update the local knowledge base and adapt the policy over time without forgetting past skills?

06Conclusion & Future Work

Three-sentence summary: This paper introduces WARP, a policy that treats writing as thinking so the outline evolves during drafting. Combined with a multi-stage training pipeline and trajectory pruning, even an 8B local model learns when to deepen and when to stop, producing more insightful, comprehensive reports. Experiments show the system matching or beating larger closed systems, all while running privately on-device.

Main achievement: Proving that policy design—interleaving evidence-based drafting with reasoning-driven deepening—can lift insight and coverage enough for a small, local model to rival big proprietary tools.

Future directions: Add a separate rendering agent for tables/figures, expand the local corpus with multimodal and personalized data, improve automatic stopping signals, and explore multi-agent cooperation under WARP.

Why remember this: It shows that smarter thinking loops can matter more than bigger models, opening the door to private, powerful deep research that many people can run locally.

Practical Applications

•On-device research assistant for students writing literature reviews with private notes.
•Enterprise internal analysis reports (market scans, risk memos) without sending data outside the firewall.
•Healthcare teams drafting evidence-backed summaries from local guidelines and studies.
•Legal teams assembling case overviews from in-house document repositories with proper citations.
•Product teams generating competitive analyses from local knowledge bases and past project docs.
•Researchers drafting surveys from a curated local arXiv corpus with targeted deepening.
•Consultants producing client-ready briefs while keeping proprietary data on secure devices.
•Journalists assembling backgrounders from offline archives with traceable sources.
•Educators creating course notes that evolve as new references are added to a local library.
•Policy analysts generating white papers that dynamically deepen on high-impact sections.

Version: 1