Learning Query-Specific Rubrics from Human Preferences for DeepResearch Report Generation
Key Summary
- •DeepResearch agents write long, evidence-based reports, but teaching and grading them is hard because there is no single 'right answer' to score against.
- •This paper trains a tool that writes a custom grading rubric for each question, learned directly from what humans prefer when comparing two reports.
- •The rubric generator is optimized with a hybrid reward that mixes human preference consistency, an LLM judge’s quality score, and a format check.
- •A new Multi-agent Markov-state (MaMs) workflow splits work into search, memory-updating, and report-writing steps so the agent can handle long, complex research.
- •On a 5,000+ query preference dataset, learned rubrics rank human-preferred reports correctly about 65% of the time, beating generic and naïve LLM-made rubrics.
- •When these learned rubrics train research agents, the agents score higher on DeepResearch Bench than all open-source baselines and approach closed-source systems.
- •Reinforcement learning with the hybrid reward works best; using only generic rubrics or only LLM judges is weaker and can misalign with people.
- •The method is scalable: once trained, the rubric generator creates question-specific checklists automatically, saving expert time.
- •The MaMs workflow reduces confusion from long contexts and improves coherence, readability, and instruction following in final reports.
Why This Research Matters
Long reports shape real decisions—health, finance, policy—so training AI to write them well needs reliable grading signals. This work turns human choices into per-question checklists, making the feedback sharper and more aligned with what people value. It scales: once trained, the rubric generator writes useful grading sheets automatically, saving expert time. The MaMs workflow helps the agent handle long, messy information more safely and coherently. Together, they push open-source research agents closer to top-tier closed systems, opening broader access to trustworthy, explainable research tools.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
🍞 Hook: Imagine your class has to write long science reports using books and websites, but there’s no answer key. How would the teacher grade fairly every time?
🥬 The Concept: DeepResearch report generation is when AI writes long, evidence-based reports by searching, reading, and reasoning across many sources.
- How it works: (1) Plan what to look for, (2) search and collect sources, (3) summarize and connect evidence, (4) write a structured report with references.
- Why it matters: Without reliable grading and training signals, the AI can wander, over-trust weak sources, or write pretty but shallow reports.
🍞 Anchor: When you ask, “How did electric cars become popular?”, a DeepResearch agent must gather timelines, policies, battery tech, and sales data, then explain the story clearly.
The World Before: LLMs were great at short answers with clear right/wrong (like a quiz). For long reports, quality is fuzzy: there is no single perfect answer, only better or worse reports. People started using rubrics—checklists with weighted criteria like accuracy, depth, structure, citations—to grade outputs. But two big issues appeared.
The Problem: 1) Generic, one-size-fits-all rubrics miss important, query-specific details (e.g., for a medical policy question, 'risk trade-offs' might matter a lot; for a timeline, 'chronological accuracy' matters more). 2) Handcrafted, query-specific rubrics require lots of expert time and don’t scale.
Failed Attempts:
- Predefined rubrics: Simple to apply but too coarse. They struggle to spot subtle differences across diverse topics.
- LLM-generated rubrics (without human data): Quick to produce but often misaligned with what people truly prefer. They can reward style over substance or be inconsistent.
- Direct LLM judging without rubrics: Opaque and sometimes biased; hard to debug or improve when scores feel off.
The Gap: We needed query-specific rubrics that are (a) aligned with human preferences, (b) discriminative enough to separate good from great, (c) scalable across thousands of varied questions, and (d) structured so they can supervise reinforcement learning.
Real Stakes:
- Everyday users want trustworthy research: health summaries, policy comparisons, product evaluations.
- Teams need consistent, fair evaluation to train better agents without hiring experts for every question.
- Without strong, human-aligned feedback, agents can ‘reward-hack’—optimize for what the grader incorrectly likes, not for what people actually value.
New Ingredients Introduced (Sandwich style):
🍞 Hook: You know how teachers use a grading sheet that changes depending on whether it’s a lab report or a book review? 🥬 The Concept: A rubric is a weighted checklist used to grade work.
- How it works: List criteria (accuracy, structure, depth, reasoning), give each a short description, assign weights, then score and combine.
- Why it matters: Without rubrics, grades become vague and inconsistent; with bad rubrics, students optimize for the wrong things. 🍞 Anchor: For a history report, criteria might include 'timeline accuracy' (weight 5), 'source diversity' (4), and 'clear thesis' (4).
🍞 Hook: Imagine two desserts, and students vote which tastes better. 🥬 The Concept: Human preference data records which of two reports people prefer for the same question.
- How it works: Show experts two candidate reports and ask which is better and why; store the preferred vs. rejected pair.
- Why it matters: Preferences give direct signals of what people value, capturing subtle differences general rubrics miss. 🍞 Anchor: Given two climate-policy reports, experts pick the one with stronger evidence and clearer trade-offs.
Together, these ideas set the stage: learn a query-specific rubric generator from human preferences, then use those rubrics to train better research agents.
02Core Idea
Aha! Moment in one sentence: Teach a model to write a custom grading checklist for each question by learning directly from human choices—and then use those checklists to train research agents.
Three Analogies:
- Sports Referee: 🍞 You know how good refs learn what fouls really matter by watching lots of games and fan reactions? 🥬 Here, the rubric generator is the ref, learning which plays (criteria) matter most from human preferences, then calling fair shots during training. 🍞 In a tech-trend question, it emphasizes 'evidence strength' over 'flowery language.'
- Recipe Taster: 🍞 Imagine tasting two soups and saying which you prefer. 🥬 The model learns what makes the better soup (balance, heat, salt) and turns that into a tasting checklist. 🍞 For policy reports, it learns to weigh 'stakeholder impact' and 'evidence consistency' heavily.
- Travel Checklist: 🍞 Different trips need different checklists (camping vs. city tour). 🥬 The rubric generator writes the right checklist per question, then the agent packs accordingly. 🍞 For a medical-safety question, 'risk disclosure' and 'source reliability' get top weight.
Before vs. After:
- Before: One-size-fits-all rubrics or ad-hoc LLM judges; weak alignment; easy to game.
- After: Query-specific, human-preference-aligned rubrics; clearer, more discriminative feedback; stronger training signals.
Why It Works (intuition):
- Human preference pairs reveal not just what is acceptable, but what is better. That margin of 'better' trains the rubric generator to be picky in the right ways.
- A hybrid reward stabilizes training: preference consistency ensures alignment with people; an LLM-as-a-Judge checks logical coverage and coherence; a format reward keeps outputs machine-usable.
- The MaMs workflow breaks a long, messy task into clean steps and states (memory, plan, report), making it easier to apply rubrics reliably.
Building Blocks (with Sandwich explanations):
- 🍞 Hook: You know how a teacher writes a special grading sheet for each project? 🥬 The Concept: Query-Specific Rubric Generator is a model that writes a custom, weighted checklist for the given question.
- How it works: Read the question; draft criteria (titles, one-sentence descriptions, weights); include penalties for errors; output JSON.
- Why it matters: Tailored criteria catch what’s important for this exact question and avoid generic noise. 🍞 Anchor: For 'compare two vaccines,' it adds 'evidence quality,' 'side effect clarity,' and 'trial population coverage.'
- 🍞 Hook: When two classmates turn in different essays, you can say which one you prefer. 🥬 The Concept: Human Preference Data are pairs where humans choose the better report for the same question.
- How it works: Experts compare two candidate reports considering usefulness, coherence, completeness, and alignment.
- Why it matters: Captures subtle judgments like 'this one reconciles conflicting sources better.' 🍞 Anchor: Between two energy-policy reports, experts pick the one that weighs costs and benefits with numbers.
- 🍞 Hook: Think of a careful judge who reads the rules and scores fairly. 🥬 The Concept: LLM-as-a-Judge scores how coherent, relevant, and complete a rubric is and later scores reports per rubric items.
- How it works: Given the question and a rubric (or a report + single rubric item), it outputs a calibrated score.
- Why it matters: Adds an extra lens to ensure rubrics are usable and reports are graded dimension-by-dimension. 🍞 Anchor: It might rate 'Clear Structure' as 9/10 and 'Citations' as 7/10, then combine with weights.
- 🍞 Hook: Choosing the best of a few options gets easier when you compare them together. 🥬 The Concept: Group Relative Policy Optimization (GRPO) improves the rubric generator by sampling several rubrics at once and rewarding the better ones.
- How it works: Generate a group; score each with the hybrid reward; raise probability of higher-scoring rubrics.
- Why it matters: Stabilizes learning and avoids chasing noisy, single-instance feedback. 🍞 Anchor: If 5 rubrics are proposed, the one that best separates preferred vs. rejected reports gets boosted.
- 🍞 Hook: Big projects work better when teammates each do a clear job. 🥬 The Concept: Multi-agent Markov-state (MaMs) Workflow splits the agent into Search Agent, State Agent, and Report Agent, all sharing one model but different roles.
- How it works: Keep a compact state (memory, plan, report); Search fetches info; State fuses chunks into memory; Report writes and revises.
- Why it matters: Prevents context overload, reduces error accumulation, and supports rubric-based training at each rollout. 🍞 Anchor: Reading a 30-page PDF gets chunked: facts go into memory; relevant bits update the report step by step.
Key Enabler: Hybrid Reward (preference consistency + LLM judge + format). It keeps rubrics aligned with people, logically sound, and machine-parseable—so they can supervise training at scale.
03Methodology
High-level recipe: Input (question + human preference data) → Train Rubric Generator with Hybrid Reward via GRPO → Use Rubrics to Train a MaMs Research Agent that writes better reports → Evaluate on benchmarks.
Step 1. Build the Human Preference Dataset
- What happens: Create thousands of research-style questions across topics (law, health, tech, business). For each question, generate multiple candidate reports using strong agentic LLMs and the MaMs workflow, then have experts compare pairs and choose the better one.
- Why this step exists: We need ground truth of what humans prefer to anchor rubric learning; generic checklists don’t capture subtle distinctions.
- Example: For “Analyze causes of airline delays in 2024,” two reports are compared. Experts pick the one with clearer data sources, causal reasoning, and actionable insights.
How queries are created: Start from a knowledge graph, sample multi-hop entity paths (ensuring multi-step reasoning), and prompt an LLM to write natural questions. Rewriting with a stronger LLM diversifies style while keeping substance.
How candidate reports are generated: Use the MaMs workflow (explained below) with different LLMs and settings to produce varied yet reasonable drafts. Filter out clearly flawed ones, keep the top two for each question, and collect human preference labels (accepted vs. rejected).
Step 2. Train the Query-Specific Rubric Generator with a Hybrid Reward via GRPO
- What happens: The generator reads a question and proposes a set of rubric items, each with a title, a one-sentence description starting with a category (Key/Important/Optional/Error), and a weight. We sample a group of such rubrics and compute a combined reward.
- Why this step exists: Learning from human preferences makes rubrics discriminative and aligned with real judgments; the LLM-as-a-Judge and format checks stabilize quality and usability.
- Example: For a network-failures report, the generator includes 'Analysis of Causes' (weight 5), 'Troubleshooting Methods' (4), and 'Technical Errors' as a penalty item (-2).
The Secret Sauce: The Hybrid Reward
🍞 Hook: When grading a test, you (1) check answers match the key, (2) make sure your grading sheet makes sense, and (3) use a neat format so others can use it. 🥬 The Concept: Hybrid Reward combines three parts to train the rubric generator.
- How it works:
- Preference Consistency: Apply the rubric to both the human-accepted and rejected reports; the rubric is rewarded if it scores the accepted one higher.
- LLM-as-a-Judge Score: An LLM evaluates the rubric itself—Is it coherent, comprehensive, and relevant to the question?
- Format Reward: Check the rubric is valid JSON with only the allowed fields; penalize malformed outputs.
- Why it matters: Each part covers a failure mode—alignment with people, rubric quality, and machine-parseability. 🍞 Anchor: If a rubric can’t correctly rank the better climate report or outputs messy structure, it gets a low hybrid reward.
GRPO in practice
🍞 Hook: Picking the best from a small batch is easier than scoring one at a time. 🥬 The Concept: GRPO samples a group of rubric candidates per question and boosts the ones with higher hybrid reward.
- How it works: Generate 8 rubrics; compute hybrid rewards; relatively increase probability of the winners; repeat.
- Why it matters: Reduces noise and helps the model converge to stable, high-quality rubric writing. 🍞 Anchor: Among 8 variants for a legal analysis question, the one that best separates accepted vs. rejected reports and looks coherent wins.
Step 3. Train the DeepResearch Agent with MaMs Using Learned Rubrics
- What happens: For each new question, the trained rubric generator writes a tailored rubric. The MaMs agent then executes its workflow (search → state update → report), producing rollouts scored by the rubric via an LLM-as-a-Judge on each item, aggregated with weights to a single reward used by RL (GRPO).
- Why this step exists: Using tailored, human-aligned rubrics as the reward makes reinforcement learning focus on what people value per question type.
- Example with data: Suppose the rubric for a health-policy question heavily weights 'evidence strength' and 'risk-benefit analysis.' A rollout that cites systematic reviews and compares risk trade-offs scores higher and is reinforced.
Inside the MaMs Workflow (the agent’s recipe):
🍞 Hook: Think of a three-person pit crew: one scouts, one updates the notes, one drives. 🥬 The Concept: MaMs splits the job into three roles, all using the same LLM but different prompts.
- How it works:
- Search Agent: Decides next tool call (e.g., web search), refines the plan, and stops when enough info is gathered.
- State Agent: Receives long raw text, splits it into chunks, and incrementally fuses new facts into a compact 'memory' without losing earlier info.
- Report Agent: Updates the Markdown report step by step, integrating new evidence, fixing contradictions, and maintaining citations.
- Why it matters: Prevents long-context overload, keeps information organized, and reduces hallucinations. 🍞 Anchor: A 20-page PDF is chunked; facts like sample sizes and dates go into memory; the report adds a 'Methods' paragraph with proper citations.
What breaks without each step:
- No human preference pairs → rubrics drift toward style over substance.
- No LLM-as-a-Judge on rubrics → criteria may be incoherent or irrelevant.
- No format check → rubrics can’t be parsed by the training pipeline.
- No MaMs → long documents swamp context; reports become inconsistent.
- No itemized scoring → the reward is vague; RL may learn unstable shortcuts.
Putting it all together: The rubric generator becomes a reliable 'teacher’s checklist writer' that guides the MaMs agent to produce deeper, clearer, and more aligned research reports.
04Experiments & Results
The Tests (what they measured and why):
- Preference Modeling: Can the learned rubrics consistently score the human-preferred report higher than the rejected one? They report Preference Accuracy (AUC) and Paired Cohen’s d (the size and stability of the score gap).
- DeepResearch Bench: Do agents trained with learned rubrics produce better final reports on real questions? Judges score dimensions like comprehensiveness, depth, instruction following, and readability.
The Competition (baselines):
- Human-defined General Rubrics (fixed, generic checklists).
- LLM-generated rubrics (no human alignment).
- Supervised fine-tuning on LLM-made rubrics (SFT).
- RL with only LLM-as-a-Judge reward, only preference reward, or the hybrid reward.
- Workflows: classic ReAct vs. the proposed MaMs.
- Closed-source reference systems (leaderboard numbers for context).
The Scoreboard (with context):
- On the 5,000+ query preference dataset, using Qwen backbones, RL with the Hybrid Reward gets about 65% preference accuracy and the largest paired Cohen’s d (~0.37). That’s like getting an A when generic rubrics hover near a coin flip (about 49–60%). A higher Cohen’s d means the rubric pushes the preferred report clearly ahead, not just barely.
- For DeepResearch Bench, training agents with the learned rubric generator improved all four judged dimensions compared to generic or naïve LLM-generated rubrics. With Tongyi-DeepResearch + MaMs + RL-trained rubrics, the system beats all open-source baselines and approaches closed-source results.
- MaMs vs. ReAct: Under the same rubric strategy, MaMs consistently performs better, especially on readability and instruction following—evidence that managing long context with a stateful, chunked workflow pays off.
Surprising Findings:
- RL with human preference signals alone helps, but combining it with an LLM judge and format checks (the hybrid reward) works best—suggesting each reward component covers different failure modes.
- A related algorithm, GSPO, yields similar reward scores but noticeably higher output entropy (more varied rubrics). For rubric generation, stability beats diversity, so GRPO is preferred.
- The trained rubric generator not only ranks pairs better; when used as the training signal, it lifts downstream agent performance across multiple backbones.
Takeaway Numbers (memorable):
- Preference AUC ~65% with hybrid-RL rubrics vs. ~49–61% for generic or plain LLM strategies.
- Clear gains on DeepResearch Bench: the learned rubrics + MaMs move open-source systems closer to closed-source leaders.
- Improvements are consistent across comprehensiveness, depth, instruction following, and readability.
In plain terms: Teaching the model to write checklists from human choices makes the grader sharper, and a sharper grader trains a stronger researcher.
05Discussion & Limitations
Limitations (be specific):
- Pairwise preferences only compare two reports; real-world choices can involve multiple candidates and nuanced trade-offs (e.g., ‘A is deeper, B is clearer’). Extending to ranked or graded preferences could capture more detail.
- Some qualities (novelty, creativity, ethical framing) are subjective and still hard to judge, even with an LLM-as-a-Judge. Calibration and bias remain concerns.
- Generalization to very different domains or report formats (e.g., lab protocols, legal briefs, clinical notes) needs more testing.
- The approach depends on a capable LLM judge during training; if the judge has blind spots, the rubric learning might inherit them.
Required Resources:
- You need a preference dataset (pairs of reports and human choices) across diverse topics.
- Access to strong LLMs for rubric generation, item-by-item scoring, and judging; plus compute for RL (the paper used substantial GPU resources).
- A retrieval/tooling stack and the MaMs prompts to support long-horizon research.
When NOT to Use:
- Tasks with clear, verifiable answers (simple QA): use exact-match or programmatic rewards instead.
- Extremely short outputs: per-question rubrics are overkill.
- Highly subjective creative writing where human tastes vary widely and stability (not diversity) is the main goal.
Open Questions:
- Can we reduce reliance on LLM judges by using self-consistency checks, source-grounded verification, or lightweight human spot-checks?
- How to incorporate multi-way preferences or continuous ratings to better shape rubric weights?
- Can the system detect and discourage reward hacking better (e.g., by penalizing redundancy or shallow ‘keyword stuffing’)?
- How well do learned rubrics transfer to unseen domains with different structures (e.g., code audits, financial modeling)?
Bottom line: The method is a strong step toward scalable, human-aligned supervision for long-form research, but future work should make judging more robust, less LLM-dependent, and more expressive of complex human values.
06Conclusion & Future Work
Three-Sentence Summary: This paper learns a model that writes a custom, per-question grading checklist (rubric) from human preference data and uses it to train DeepResearch agents. A hybrid reward—preference consistency, LLM judge, and format validity—produces rubrics that are both human-aligned and machine-usable. Combined with the MaMs workflow, these rubrics boost report quality and outperform open-source baselines, approaching closed-source systems on DeepResearch Bench.
Main Achievement: Showing that learning query-specific rubrics from human preferences creates stronger, more discriminative training signals than generic rubrics or ungrounded LLM judgments—leading to measurably better research reports.
Future Directions: Expand beyond pairwise comparisons to rankings, cover subjective qualities (novelty, ethics) more reliably, reduce dependence on LLM judges via self-check mechanisms, and test transfer to new domains and formats.
Why Remember This: Instead of only training the writer, they trained the grader first—tailored to each question and grounded in what people actually prefer—and that smarter grader taught the writer to produce clearer, deeper, and more trustworthy research.
Practical Applications
- •Automate grading of long-form AI reports in internal evaluations with transparent, question-tailored rubrics.
- •Train enterprise research assistants (legal, medical policy, market analysis) using human-preference-aligned rewards.
- •Build safer search-and-report pipelines by penalizing technical errors and rewarding evidence strength.
- •Create classroom rubrics for AI-written essays that adapt to each assignment’s goals (e.g., lab vs. history report).
- •Benchmark vendor LLMs fairly by applying the same learned rubric to each model’s report.
- •Use MaMs to process lengthy PDFs and websites in chunks, keeping memory structured and reducing hallucinations.
- •Run A/B tests on report styles and update rubric weights to reflect stakeholder preferences over time.
- •Deploy lightweight spot-checks where humans review only disagreements between rubric scores and preferences.
- •Prioritize model improvements by analyzing which rubric items most often lower scores (e.g., weak citations).
- •Customize rubrics for regulated domains to emphasize compliance, risk disclosure, and source reliability.