Step-DeepResearch Technical Report

Chen Hu; Haikuo Du; Heng Wang; Lin Lin; Mingrui Chen; Peng Liu; Ruihang Miao; Tianchi Yue; Wang You; Wei Ji; Wei Yuan; Wenjin Deng; Xiaojian Yuan; Xiaoyun Zhang; Xiangyu Liu; Xikai Liu; Yanming Xu; Yicheng Cao; Yifei Zhang; Yongyao Wang; Yubo Shu; Yurong Zhang; Yuxiang Zhang; Zheng Gong; Zhichao Chang; Binyan Li; Dan Ma; Furong Jia; Hongyuan Wang; Jiayu Liu; Jing Bai; Junlan Liu; Manjiao Liu; Na Wang; Qiuping Wu; Qinxin Du; Shiwei Li; Wen Sun; Yifeng Gong; Yonglin Chen; Yuling Zhao; Yuxuan Lin; Ziqi Ren; Zixuan Wang; Aihu Zhang; Brian Li; Buyun Ma; Kang An; Li Xie; Mingliang Li; Pan Li; Shidong Yang; Xi Chen; Xiaojia Liu; Yuchu Luo; Yuan Song; YuanHao Ding; Yuanwei Liang; Zexi Li; Zhaoning Zhang; Zixin Zhang; Binxing Jiao; Daxin Jiang; Jiansheng Chen; Jing Li; Xiangyu Zhang; Yibo Zhu

Step-DeepResearch Technical Report

Intermediate

Chen Hu, Haikuo Du, Heng Wang et al.12/23/2025

arXiv PDF

Key Summary

•Search is not the same as research; real research needs planning, checking many sources, fixing mistakes, and writing a clear report.
•Step-DeepResearch trains a medium-size 32B model to act like a careful researcher by teaching small building-block skills called atomic capabilities.
•A special data-making recipe turns expert reports, knowledge graphs, and multi-document walks into training stories the model can learn from.
•Training goes in three waves: mid-training to install skills, supervised fine-tuning to connect them end to end, and reinforcement learning to improve decisions with checklist rewards.
•A checklist-style judge turns report rubrics into clear rewards so the model learns to meet professional standards without noisy signals.
•On the ResearchRubrics benchmark, Step-DeepResearch scores 61.42—near top performance—while costing less than 0.50 RMB per run.
•On the new Chinese ADR-Bench with expert reviewers, it rivals or beats popular closed systems in many head-to-head matchups.
•A streamlined single-agent ReAct loop plus smart tools (like an authority-aware searcher and token-saving file edits) keeps costs low and results stable.
•The paper shows that careful training and data design can make medium models behave like expert researchers.
•Limits remain: fragile tool use, hard truth-checking in noisy web pages, and making every report perfectly readable and auditable.

Why This Research Matters

Accurate, affordable research agents can help small teams make big decisions—in business, law, education, and policy—without paying premium costs. By teaching medium models to plan, verify, and write like experts, more people can access high-quality, cited analysis in their own language. This reduces the risk of confident-but-wrong answers that come from shallow search. It also shows a path forward for building trustworthy AI: focus on skills, structure, and clear rewards instead of just bigger models. In the long run, this approach can make digital research assistants reliable partners rather than fancy autocomplete tools.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: You know how googling a quick fact is easy, but writing a school report that’s true, complete, and well organized is hard? That’s because report writing needs more than just finding one answer—it needs planning, checking, and explaining.

🥬 The Concept: Deep Research is what an AI must do to handle open-ended, long, messy tasks—like studying a new policy or comparing technologies—by planning, gathering from many sources, cross-checking, and writing a reasoned report. How it works (why people care):

Break the big question into smaller steps.
Search across the web and papers.
Compare sources to avoid believing the first thing you read.
Fix your own mistakes when you notice problems.
Organize findings into a clear, cited report. Why it matters: Without this, AIs act like fast web crawlers: many snippets, weak logic, and risky errors.

🍞 Anchor: Imagine you ask, “Should our town build more bike lanes?” Deep Research means the AI finds data on safety, costs, laws, and examples from other cities, double-checks sources, and then writes a balanced, evidence-backed plan.

The world before: Most AI agents were tuned for search or short multi-hop Q&A with a single correct answer. They looked good on benchmarks that ask, “What is the capital of X?” or “Which scientist did Y before Z?” But everyday research asks, “What should we do and why?” It expects structure, tradeoffs, and citations.

The problem: When agents chase multi-hop QA scores, they learn to retrieve quickly, not to think carefully. In the wild, that leads to information fragments, broken reasoning, and hallucinations—especially when evidence is noisy or contradictory.

Failed attempts: Many systems tried to bolt on heavy workflows: multi-agent teams, complex planners, giant context windows, or hand-coded pipelines. These can work but are costly, brittle, and hard to maintain. Others trained end-to-end but mostly optimized “better search” rather than the full researcher’s loop (plan → seek → verify → reflect → synthesize → report).

🍞 Hook: Imagine building a LEGO castle with only random bricks and no instructions—you can make a wall, but a full castle is tough.

🥬 The Concept: Atomic Capabilities are the small, reusable skills—planning, deep information seeking, reflection and verification, and report writing—that snap together to make full research. How it works:

Name the key skills.
Create training data that targets each one.
Practice combining them end to end. Why it matters: If the model lacks even one—say, verification—the final report can be confidently wrong.

🍞 Anchor: Like teaching a team sport: passing, dribbling, shooting, and defense before full 5-on-5 games.

The gap this paper fills: Step-DeepResearch reshapes training around those atomic skills, then connects them with a progressive pipeline (mid-training → supervised fine-tuning → reinforcement learning). It also fixes evaluation by creating ADR-Bench, a Chinese, real-world benchmark using expert rubrics and human comparisons.

Real stakes: In business (market analyses), law (risk memos), health (policy briefs), and education (study guides), bad research wastes money and can harm people. This paper shows a medium 32B model, trained the right way, can deliver expert-like research at a fraction of the cost—putting trustworthy reports within reach for more users and languages.

02Core Idea

🍞 Hook: Imagine teaching a student not just to look up answers, but to think like a scientist—plan experiments, check results, and write a paper.

🥬 The Concept: The key insight is to train a medium-size model to internalize the full researcher’s loop by practicing four atomic capabilities with purpose-built data, then connecting them via a staged training path and grading with checklist-style rubrics. How it works:

Decompose research into atomic capabilities (planning, deep seeking, reflection/verification, reporting).
Synthesize targeted data: reverse-engineered plans from expert reports, graph-based and multi-document quests, and reflection trails.
Train progressively: mid-training installs skills, SFT composes them end to end, RL polishes behavior with rubric rewards in real tool environments. Why it matters: This turns the objective from “predict the next token” into “decide the next atomic action,” making the agent robust and cost-efficient.

🍞 Anchor: It’s like coaching: drills for footwork and passing, scrimmages to combine skills, then real games with a referee’s checklist.

Three analogies for the same idea:

Kitchen: Teach chopping, simmering, seasoning (atomic skills); follow a recipe (plan); taste-and-adjust (reflection); plate beautifully (report). The judge uses a rubric: doneness, balance, cleanliness.
Detective: Map suspects (plan), gather clues from multiple places (deep seeking), cross-check alibis (verification), write a case report (report), and a supervisor rates it on thoroughness and evidence links.
LEGO City: Bricks (atomic skills), instruction booklet (plan), test for sturdiness (verification), then showcase (report) graded by criteria.

Before vs after:

Before: Agents excelled at fast retrieval, struggled to integrate conflicting sources, and wrote uneven reports.
After: The agent plans first, seeks deeper, cross-checks claims, self-corrects, and writes structured, cited reports that align with rubrics.

Why it works (intuition):

Action subspace: By teaching “atomic actions,” the model operates in a smaller, clearer decision space than raw token-by-token guessing. Fewer bad branches, more stable learning.
Better supervision: Reverse-engineered plans and verified subgraphs provide clean, logically guided examples.
Reliable rewards: Binary rubric mapping avoids noisy halfway scores, giving crisp feedback.

Building blocks (with mini sandwich intros):

🍞 Hook: You know how a good to-do list keeps you on track. 🥬 The Concept: Planning & Task Decomposition means turning a fuzzy goal into ordered steps and revising them as you learn. How: Reverse engineer plans from expert reports; filter for plan–execution consistency. Why: Without it, the agent wanders. 🍞 Anchor: “Compare 3 EV battery chemistries” becomes: define metrics → gather lab data → cross-check safety recalls → write summary.

🍞 Hook: Imagine a treasure hunt where each clue points to the next. 🥬 The Concept: Deep Information Seeking is multi-hop, multi-document discovery that connects hidden entities. How: Use knowledge-graph subgraphs (verified via web) and hyperlink walks to craft hard quests; filter out easy ones. Why: Without it, the agent misses key evidence. 🍞 Anchor: From a company to its suppliers to policy changes to price swings.

🍞 Hook: Everyone makes mistakes; great students fix them. 🥬 The Concept: Reflection & Verification teaches the agent to find its own errors and check facts across sources. How: Closed loops of attempt → verify → reflect → retry; multi-agent teacher traces for fact checking. Why: Without it, confident errors slip through. 🍞 Anchor: “My date is wrong—source A says 2022, but the official site says 2023—update and cite.”

🍞 Hook: A pile of notes isn’t a report. 🥬 The Concept: Report Generation turns fragments into a clear, cited narrative. How: Mid-training for style and depth; SFT to obey formats and plans. Why: Without it, readers can’t use the findings. 🍞 Anchor: Executive summary → sections → citations matching claims.

Together these blocks, plus staged training and rubric rewards, deliver an affordable single-agent ReAct researcher that rivals larger, pricier systems.

03Methodology

At a high level: User query → Plan (atomic actions) → Search and gather (tools) → Cross-verify and reflect → Draft and edit report → Judge with rubrics → Output.

🍞 Hook: Think of a school project where you first learn core skills, then practice with guided worksheets, and finally get graded with a clear rubric. 🥬 The Concept: Progressive Training Pipeline is a step-by-step path—mid-training, supervised fine-tuning (SFT), and reinforcement learning (RL)—that installs skills, connects them, and polishes decisions. How it works:

Mid-training (32K→128K context): teach atomic skills and long-context handling; later, add tool use.
SFT: stitch skills into full trajectories, align with instructions and formats.
RL: practice in the real tool environment; earn rewards from a checklist judge. Why it matters: Skipping stages leads to brittle behavior and messy reports. 🍞 Anchor: Drills → scrimmage → real game with a referee.

Inputs and tools:

Base model: Qwen2.5-32B-Base (128K context, strong reasoning/value for cost).
Tool suite: authority-aware search, knowledge-dense retrieval, token-efficient file edits, to-do memory, sandboxed shell, resilient browser, multimodal parsers.
Judge: checklist-style Rubrics Judge with binary mapping for crisp signals.

Step A: Agentic Mid-training (skills and long context)

32K phase (no tools): Teach planning, deep reading (active QA and rewrites), reflection trails, and report summarization using expert content and synthetic reasoning traces. The goal is robust comprehension and internal “thinking steps,” not tool juggling yet.
128K phase (with tools): Add URL QA, deep search steps, web browsing, and planning with tool calls. The agent learns when to call which tool, how to integrate returns, and how to keep context efficient. Why this step exists: It creates stable internal representations for planning, evidence integration, and self-correction, so later stages can assemble these into reliable behavior. Example: Given a long policy PDF, the model practices extracting key clauses, paraphrasing, and organizing them into a plan.

Step B: Supervised Fine-Tuning (SFT) (compose end to end)

Data: high-quality, full-chain Deep Search (ground truth) and Deep Research (open-ended) trajectories.
Cleaning: • Keep “correct and shortest” successful paths to reward efficient reasoning and minimal tool calls. • Inject realistic noise (tool errors) plus correct reflections to build robustness. • Deduplicate repetitive cognitive loops. • Enforce strict citations (e.g., \cite{}) and plan alignment. Why this step exists: It teaches the model to deliver complete, formatted reports by combining skills under constraints. Example: For “Compare hydrogen policies in three countries,” the best trajectories minimize redundant searches and place citations at every key claim.

Step C: Reinforcement Learning (RL) (polish decisions with feedback)

Real environment: The agent alternates reasoning with tool calls under budgets. Each episode ends with a report scored by the Rubrics Judge.
Two-step reverse synthesis: Generate hidden task summaries and atomic rubrics first; then create user-facing tasks that fit those rubrics; verify consistency with an independent judge.
Strict reward mapping: Convert ternary judgments into binary signals per rubric (positive/negative), weighted by importance, to avoid noisy “partial credit.”
Optimization: PPO actor–critic with GAE (γ=1, λ=1) for stability over long sequences. Why this step exists: Only real, tool-rich practice with consistent rewards teaches smart action timing, evidence selection, and structured writing. Example: The agent plans, searches authoritative sources, cross-validates stats, and writes a report; the judge rewards full satisfaction of key rubrics like “explicit assumptions,” “conflict resolution,” and “traceable citations.”

🍞 Hook: You know how we often pick a smaller set of actions to keep choices manageable (like preset camera modes: portrait, night, sport)? 🥬 The Concept: Action Subspace (Atomic Actions) limits the agent to high-level moves—plan, search, verify, reflect, write—rather than arbitrary token-by-token flailing. How it works: Data and prompts steer behavior toward these moves; mid-training and SFT reinforce them; RL rewards them. Why it matters: Fewer wrong branches and clearer credit assignment. 🍞 Anchor: Instead of guessing letters to type a paragraph, choose “outline → draft → revise.”

Secret sauce highlights:

Reverse-engineered plans from expert reports: teaches feasible, logically tight planning.
Graph- and link-walk synthesis: builds real multi-hop challenges beyond trivia.
Error-reflection loops and multi-agent verification teachers: normalize self-correction and factual rigor.
Authority-aware retrieval and token-efficient editing: keep costs low without losing signal.
Binary rubric rewards: clear, stable guidance that matches human standards.

End-to-end flow with example data:

Input: “Explain if city X should subsidize heat pumps in 2026; compare costs, grid impact, and emissions; cite sources.”
Steps: Plan sub-questions → search authority index (government, IEA) → cross-verify conflicting cost ranges → reflect on outdated 2022 numbers → update with 2025 datasets → draft with sections and in-text citations → judge applies rubrics (coverage, correctness, citations, clarity) → output polished report.

04Experiments & Results

The test: Two tracks measure both usability and rigor.

ResearchRubrics: 101 tasks with 20–43 fine-grained criteria each. A deterministic LLM judge evaluates rubric compliance (ternary per criterion, ensemble of three runs), emphasizing explicit and implicit requirements, citation quality, synthesis, clarity, and instruction following.
ADR-Bench: A Chinese, application-driven suite. General (70 tasks) uses human side-by-side comparisons across four dimensions (completeness, depth, requirement fitness, readability). Professional (40 tasks in Finance and Law) applies expert-built rubrics with strict negative scoring for fatal errors.

The competition: Step-DeepResearch (single-agent ReAct, 32B) is compared to ReAct baselines (MiniMax-M2, DeepSeek-V3.2, GLM-4.6, Kimi-k2-thinking) and closed commercial systems (OpenAI DeepResearch, Gemini DeepResearch, Kimi-Researcher, MiniMax Agent Pro, Qwen DeepResearch).

The scoreboard with context:

ResearchRubrics score: 61.42. That’s like getting an A when most are getting B’s. It beats the best open ReAct baseline Kimi-k2-thinking (56.17) by +5.25 and is close to Gemini DeepResearch (63.69), while surpassing several complex agent systems.
Cost efficiency: Under 0.50 RMB per run—less than one-tenth the cost of premium closed systems (≈5–7 RMB). That’s like riding the bus and arriving as fast as a taxi.
ADR-Bench human head-to-head: Consistent win/tie advantages against many systems; even versus top contenders, Step-DeepResearch reaches strong non-inferiority (wins+ties ≈ two-thirds). An ablation shows mid-training clearly boosts human preference (30 wins vs 21 losses vs the version without mid-training).
ADR-Bench Finance & Law: A three-tier spread emerges. Gemini leads, while Step-DeepResearch sits at the top of the second tier alongside Kimi and OpenAI—suggesting domain knowledge coverage drives the ceiling, and strong agent process can’t fully mask knowledge gaps. Still, Step-DeepResearch performs competitively due to targeted domain training and strict negative-error control.

Category insights (ResearchRubrics):

Implicit and explicit criteria: High scores (≈54.5 and 72.0) show strong alignment with both spelled-out and “understood” expectations.
Citation quality: ≈57.0, tying for the top—evidence is carefully linked back to sources.
Communication quality: ≈58.2, best among compared systems—clear structure and professional readability.
Slightly behind leaders in instruction following on some complex, multi-constraint prompts—likely due to diversity in SFT instructions; planned to refine.

Surprising findings:

Medium, well-trained beats bigger, orchestrated: The single-agent 32B model, with internalized atomic skills, challenges and sometimes surpasses heavyweight multi-agent or deeply orchestrated solutions.
Binary rubric mapping matters: Collapsing “partial” into fail for positive criteria (and vice versa for negative ones) made rewards cleaner and training more stable, improving user-perceived quality.
Style vs substance tension: Pure checklist chasing risks listy, shallow writing. The team added a synthesis-driven drafting filter to boost insight without sacrificing coverage.

Ablations and dynamics:

Mid-training tokens (≈150B at 32K phase): Accuracy on structured reasoning benchmarks improved steadily and had not fully plateaued—indicating more headroom.
RL training rewards rose smoothly over steps, suggesting stable policy improvement in the real tool environment.

Bottom line: Measured by both automated rubrics and human preferences, Step-DeepResearch delivers near top-tier quality at dramatically lower cost, proving that training design can trump sheer parameter count.

05Discussion & Limitations

Limitations:

Tool fragility: Changes in APIs, flaky search returns, or complex multi-tool chains can still derail runs. The model is better than before but not invincible.
Factuality under noise: When the web is messy or sources conflict subtly, the agent can produce “plausible but not provable” claims.
Readability and auditability: While structure is strong, some reports still mix depths unevenly or could map claims to citations more explicitly.
STEM and philosophical depth: Extra-hard logical chains and abstract reasoning still trail the very best closed models.

Required resources:

Base compute for 32B training over staged curricula (32K then 128K context), plus tooling for web search, browser control, sandboxed shell, and multimodal parsing.
A curated authority index and a knowledge-dense retrieval library to keep signal strong and token costs low.
A trained Rubrics Judge and a rubric synthesis pipeline to scale RL affordably.

When not to use:

Purely factual single-answer lookups (a standard search engine or small QA model is cheaper and faster).
Domains with sparse or rapidly changing ground truth where you cannot verify claims (risk of outdated or uncheckable outputs).
High-stakes legal or medical decisions without expert review (use as a drafting assistant, not a final arbiter).

Open questions:

Can multi-agent consensus further cut hallucinations while keeping costs low?
How to provide finer-grained, mid-trajectory rewards (not just at the end) without exploding judging costs?
What’s the best way to grow domain knowledge coverage for professional tiers without overfitting formats?
How to make temporal reasoning airtight so the agent never drifts to stale years or misreads date contexts?
Can we standardize tool APIs and feedback structures to reduce brittleness and improve transfer across environments?

06Conclusion & Future Work

Three-sentence summary: Step-DeepResearch shows that teaching a 32B model small, vital research skills—planning, deep seeking, reflection, and reporting—then stitching them together with staged training and crisp checklist rewards creates an affordable, expert-like research agent. It reaches near top-tier scores on ResearchRubrics (61.42) and wins strong human preferences on ADR-Bench, all at under 0.50 RMB per run. Careful data synthesis and a progressive pipeline matter more than giant parameter counts.

Main achievement: Turning “predict the next token” into “decide the next atomic action,” backed by reverse-engineered plans, graph/link-walk challenges, reflection loops, and binary rubric rewards—delivering reliable, cited, and readable reports in a streamlined single-agent ReAct loop.

Future directions: Add lightweight multi-agent consensus roles (planner, verifier, writer), design mid-trajectory evaluators for cheaper, finer rewards, expand domain knowledge for professional tiers, and harden temporal reasoning and audit trails. Tool robustness can improve via standardized interfaces and richer structured feedback.

Why remember this: It’s a blueprint for making research-grade agents practical—showing that with the right skills, data, and rewards, medium models can think and write like experts, opening high-quality analysis to more people, languages, and budgets.

Practical Applications

•Market and competitor analysis reports with source-backed metrics and risk assessments.
•Legal memo drafts that cite relevant statutes and flag compliance risks with clear evidence links.
•Policy briefs comparing options, costs, timelines, and tradeoffs across authoritative sources.
•Technical surveys that synthesize academic papers, benchmarks, and engineering docs with citations.
•Procurement evaluations that verify vendor claims and consolidate reviews from trusted databases.
•Educational study guides that plan topics, gather multiple perspectives, and check facts before summarizing.
•Healthcare landscape scans (non-clinical) that compare guidelines, costs, and access data with careful sourcing.
•Enterprise knowledge ops: keep internal reports updated by patch-editing sections with fresh citations.
•R&D scouting: map emerging technologies via graph-like multi-hop exploration and cross-validation.
•Crisis updates: assemble fast, cited situation reports from verified sources while suppressing rumor.

Version: 1