DRPG (Decompose, Retrieve, Plan, Generate): An Agentic Framework for Academic Rebuttal

Peixuan Han; Yingjie Yu; Jingjun Xu; Jiaxuan You

DRPG (Decompose, Retrieve, Plan, Generate): An Agentic Framework for Academic Rebuttal

Intermediate

Peixuan Han, Yingjie Yu, Jingjun Xu et al.1/26/2026

arXiv PDF

Key Summary

•DRPG is a four-step AI helper that writes strong academic rebuttals by first breaking a review into parts, then fetching evidence, planning a strategy, and finally writing the response.
•It solves two big problems: long papers that overwhelm AI and generic replies that don’t directly address a reviewer’s real concern.
•The Planner is the star: it proposes several ways to respond (clarification vs justification) and picks the one best supported by the paper with over 98% accuracy.
•Across many models, DRPG beats prior pipelines by large margins in pairwise comparisons (higher Elo scores) and even outperforms average human rebuttals with only an 8B model.
•A Retriever trims the paper down to the most relevant paragraphs for each review point, cutting input length by about 75% without losing key evidence.
•Using a confidence threshold, DRPG only follows a planned perspective when it’s very sure, otherwise it safely falls back to generation without planning.
•DRPG keeps doing well over multiple back-and-forth rounds, showing it can handle follow-up reviewer questions better than baselines.
•A human study showed strong agreement between people and the LLM judge, suggesting the evaluation is reliable.
•Template planners (like Jiu-Jitsu) are too rigid; DRPG’s content-aware planning makes rebuttals more specific and persuasive.
•While DRPG is great at clarifying and defending, it cannot run new experiments; authors should always verify its claims before submitting.

Why This Research Matters

In growing research communities, thousands of papers compete for limited attention, making clear rebuttals vital for fairness. DRPG helps authors respond precisely, reducing confusion and making strong papers easier to recognize. Its structure shows that thinking before writing—and tying claims to evidence—beats bigger-but-blunter approaches. Because it explains its choices (scores per perspective), it’s not a black box and can guide authors to improve their own arguments. Beyond academia, the same pattern—break down, fetch proof, plan, then write—can upgrade many high-stakes communications. With careful human verification to avoid hallucinations, DRPG can save time, raise quality, and make decisions more accurate at scale.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you’re explaining your science fair project to three judges. Each judge asks different questions. If you try to answer everything at once, you’ll ramble, miss key points, and the judges won’t be convinced.

🥬 The Concept (Peer-review rebuttal, the world before): A rebuttal is the short letter authors write to respond to reviewers’ comments on a paper. Before tools like DRPG, even smart language models often gave long, general answers that didn’t hit the exact concern. Papers are long, reviews are nuanced, and time is tight as top conferences receive tens of thousands of submissions. How it worked (and struggled):

Authors or simple AI prompts replied to the whole review at once. 2) The AI tried to read the entire paper and review, often missing the most important bits. 3) Responses sounded polite but generic, not persuasive. 4) Reviewers stayed unconvinced, and authors lost chances to clear up misunderstandings. Why it matters: Without precise, well-supported rebuttals, good research can be misunderstood and rejected. That’s frustrating, wastes time, and slows science.

🍞 Anchor: Think of a student who answers every homework problem with a friendly paragraph but no math. Nice tone, wrong tool—the teacher isn’t convinced.

🍞 Hook: You know how a huge textbook can make it hard to find the one paragraph that answers your homework question? Even a great reader can get lost in the middle.

🥬 The Concept (Lost-in-the-middle problem): Language models struggle to pick out the right evidence from very long text. How it hurts rebuttals:

Papers are long and dense. 2) Review questions are specific. 3) The model may focus on the wrong parts. 4) The reply becomes vague instead of targeted. Why it matters: If you don’t pull the right quote or figure, your answer feels hand-wavy.

🍞 Anchor: It’s like trying to prove a point about Chapter 10 but quoting Chapter 3—you won’t persuade your teacher.

🍞 Hook: Picture a debate team. The winning team doesn’t just talk; they pick a clear angle (strategy) and back it up with evidence.

🥬 The Concept (Persuasion planning): Effective rebuttals need a planned strategy—either correct a misunderstanding (clarification) or defend the choice (justification). How it works in spirit:

Spot the exact concern. 2) Decide the best response angle. 3) Gather proof. 4) Speak directly to the concern with that angle. Why it matters: Without strategy, even good facts don’t land well.

🍞 Anchor: If a reviewer says “Your dataset is small,” a clear strategy could be: Clarify that the task is few-shot by design, or Justify that the goal is proof-of-concept—both need specific evidence.

🍞 Hook: Teams tried making fixed templates for replies—like mad-libs for rebuttals.

🥬 The Concept (Past attempts and their gaps): Direct prompting and template-based systems were fast but not flexible. How they worked:

Prompt a model to answer everything at once (direct). 2) Or pick a standard template (template planner). 3) Hope it matches the reviewer’s real concern. Why it failed: Templates were too generic; direct prompts got overwhelmed by long papers.

🍞 Anchor: It’s like answering every question on a test with the same paragraph. Sometimes it fits, often it doesn’t.

🍞 Hook: Imagine a smart assistant that first organizes the questions, then fetches the right pages, chooses the best strategy, and only then writes the answer.

🥬 The Concept (The gap DRPG fills): DRPG adds structure—Decompose → Retrieve → Plan → Generate—so every reply is focused, evidence-backed, and persuasive. How it works at a high level:

Break the review into atomic points. 2) Fetch the most relevant paragraphs for each point. 3) Plan the best perspective (clarify vs justify) that the paper supports. 4) Generate a precise, polite answer. Why it matters: Structure makes the AI both smarter and more trustworthy.

🍞 Anchor: Like a good lawyer, DRPG asks, “What’s the exact claim? What evidence proves our case? What’s the best angle?” and then delivers the argument.

02Core Idea

🍞 Hook: You know how a great chef first preps ingredients, then picks a recipe, and only then starts cooking? That order turns chaos into a delicious dish.

🥬 The Concept (Aha! in one sentence): Plan the argument before you write it, and pick the plan that your paper can actually prove—then generate the rebuttal. How it works:

Decompose the review into small, clear questions (atomic points). 2) Retrieve only the most relevant paragraphs per point (no drowning in text). 3) Plan several perspectives (clarification vs justification), then score and select the one most supported by the paper. 4) Generate a concise, evidence-backed reply. Why it matters: Without planning and evidence selection, LLMs write generic replies that fail to persuade.

🍞 Anchor: Instead of saying, “We respectfully disagree,” DRPG says, “Here’s the best angle and the exact paragraph that proves it.”

Multiple analogies (three ways to see the same idea):

Chef analogy: Prep (decompose), shop your pantry (retrieve), choose the recipe (plan), cook (generate).
Courtroom analogy: Identify charges (decompose), pull exhibits (retrieve), pick your legal argument (plan), deliver closing statement (generate).
Treasure map analogy: Mark X’s on the map (decompose), bring only the needed tools (retrieve), choose the safest route (plan), dig at the spot (generate).

Before vs After:

Before: One big prompt, long context, unfocused answers.
After: Modular steps: smaller questions, sharper evidence, explicit strategy, and targeted writing. Results are more persuasive and consistent.

Why it works (intuition, not equations):

Narrowing the question removes noise (less confusion, more focus).
Fetching only relevant paragraphs makes the model see the right facts.
Planning forces a choice of angle, aligning what you say with what you can prove.
Confidence gating (only using the plan when it’s sure) prevents forcing a bad strategy.

Building blocks (each with a Sandwich):

🍞 Hook: You know how big problems get easier when you split them into steps? 🥬 Decomposer: It turns a long review into a list of atomic points, each a single concern to address. How it works: 1) Scan the review. 2) Extract each distinct complaint or confusion. 3) List them clearly. Why it matters: If you reply to all at once, you miss details. 🍞 Anchor: From “Motivation not clear; Missing reference; Data size too small,” you get three separate questions to answer.
🍞 Hook: When you lose your keys, you don’t search the whole city—you check the most likely spots. 🥬 Retriever: It picks the most relevant paragraphs in the paper for each atomic point. How it works: 1) Embed the point and all paragraphs. 2) Compare similarity. 3) Keep the top matches. Why it matters: Cuts input by ~75% while keeping key evidence. 🍞 Anchor: For “data size too small,” it grabs paragraphs that explain few-shot focus or dataset constraints.
🍞 Hook: In a debate, you choose your angle before speaking. 🥬 Planner: It proposes multiple perspectives (clarification vs justification) and scores which the paper best supports. How it works: 1) Idea proposer drafts several angles (without seeing the paper to stay creative). 2) A scorer (encoder + small neural net) rates each angle by how well the retrieved paragraphs support it. 3) Choose the top-scoring angle if confidence is high. Why it matters: Strategy turns facts into persuasion. 🍞 Anchor: Faced with “data size too small,” it may pick “This is a few-shot task by design” if those paragraphs strongly support it.
🍞 Hook: Once your plan is solid, the speech writes itself. 🥬 Executor: It writes a concise, polite rebuttal for each point, grounded in the chosen evidence and perspective. How it works: 1) Read the point, relevant paragraphs, and selected perspective (if confident). 2) Produce a short, direct reply. Why it matters: Well-aimed writing convinces reviewers. 🍞 Anchor: “Question: Data size is small. Response: Our goal is proof-of-concept in few-shot settings; Section X reports strong accuracy under low-data regimes.”

Bonus concept—Clarification vs Justification:

🍞 Hook: If a friend says you were late at 8:10 but the event started at 8:30, you’d correct the fact; if it really started at 8:00, you’d explain why you were delayed. 🥬 The concept: Clarification fixes misunderstandings; Justification defends choices even if the reviewer is factually right. How it works: Clarify when evidence shows the reviewer missed something; justify when a different goal or standard is reasonable. Why it matters: Using the wrong type weakens your case. 🍞 Anchor: “We did include that ablation” (clarification) vs “This ablation isn’t needed for a theoretical paper” (justification).

03Methodology

High-level recipe: Review + Paper → Decompose → Retrieve → Plan → Generate → Final Rebuttal.

Step A: Decompose (turn one big review into atomic points)

What happens: The Decomposer (an LLM) reads the review and lists each distinct concern as a bite-sized point.
Why this exists: If you answer everything at once, you write vague replies. Atomic points help target each issue.
Example: From “Motivation unclear; Missing reference; Data size too small,” we get three clean items to handle.

Step B: Retrieve (fetch only the most relevant evidence)

What happens: The paper is split into paragraphs. For each atomic point, a dense retriever embeds the point and all paragraphs, compares them, and selects the top-K (e.g., K=15) most relevant ones by similarity.
Why this exists: Long papers overload models (“lost in the middle”). Retrieval cuts ~75% of input while keeping the proof you need.
Example: For “data size too small,” it might select: (1) a paragraph stating the work targets few-shot settings, (2) one explaining annotation scarcity, (3) an ablation showing robustness with low data.

Step C: Plan (propose and pick the best-supported perspective)

Part C1 – Idea proposer (creative brainstorming):
- What happens: Given a review point, it proposes up to five perspectives spanning both Clarification and Justification. Importantly, it does this without looking at the paper to keep ideas diverse.
- Why this exists: Creativity first, filtering later; starting broad avoids tunnel vision.
- Example candidates for “data size too small”: (a) Clarify that the task is few-shot by design. (b) Justify that annotation is expensive in this domain. (c) Justify that scaling is trivial given the design.
Part C2 – Perspective selector (evidence-based picking):
- What happens: A scorer reads each perspective and the retrieved paragraphs. Using a text encoder plus a small MLP, it assigns a supportive score to each perspective by how well the paragraphs back it up, then averages across paragraphs. The highest-scoring perspective wins if confidence is high (e.g., ≥ 0.8).
- Why this exists: Not all ideas are equally provable by the paper. We prefer the one the paper clearly supports.
- Example: If the retrieved paragraphs strongly emphasize few-shot design, the selector picks the “few-shot by design” perspective with high confidence.
Training the selector (intuitive view):
- Data: For each review point, combine five synthetic perspectives (from the proposer) with one ground-truth perspective extracted from a human rebuttal that led to improved scores.
- Goal: Learn to assign the highest score to the ground-truth style perspective when the evidence supports it.
- Result: The trained Planner identifies the right perspective with over 98% accuracy, far better than naïve similarity or using all paragraphs without filtering.
Safety (confidence gating):
- What happens: If the selector’s confidence is below the threshold, DRPG skips using a perspective and lets the generator answer with just the retrieved evidence.
- Why this exists: Prevents forcing a weak or mismatched plan.
- Example: If the paper doesn’t clearly support any candidate perspective, DRPG safely falls back.

Step D: Generate (write the rebuttal point-by-point)

What happens: The Executor (an LLM) writes a concise, respectful, evidence-grounded paragraph for each atomic point, optionally guided by the chosen perspective.
Why this exists: This is where the focused plan and evidence become a persuasive answer.
Example output: “Question: Data size is small. Response: Our method targets few-shot learning and demonstrates competitive results under limited labels; see paragraphs showing annotation scarcity and stable performance trends.”

The secret sauce (what makes DRPG clever):

It turns persuasion into a two-step process: be creative about possible angles, then be rigorous about which angle you can actually prove with your paper.
Retrieval and planning reinforce each other: better evidence selection leads to better plan scoring; better plans spotlight the right evidence during generation.
Confidence gating keeps the system robust when evidence is thin or ambiguous.

Extra mini-sandwiches for key sub-concepts:

🍞 Hook: When giving directions, you first pick possible routes, then choose the one with clear roads. 🥬 Supportive score: A number that says how well a perspective matches the retrieved evidence. How it works: Compare perspective text with each selected paragraph using an encoder+MLP, then average. Why it matters: It steers the plan toward provable claims. 🍞 Anchor: If three paragraphs clearly discuss few-shot design, that perspective gets the highest score.
🍞 Hook: You don’t cross a wobbly bridge unless you’re sure it’s safe. 🥬 Confidence threshold: Only use the plan if the selector is sure enough. How it works: Turn all scores into probabilities; if the winner’s probability is high, use it. Why it matters: Avoids overconfident bad choices. 🍞 Anchor: Unsure? The system replies using evidence without forcing a perspective.
🍞 Hook: If a friend misheard you, you correct them; if they heard you right but judge you unfairly, you explain your reasons. 🥬 Clarification vs Justification: Two distinct types of perspectives. How it works: Clarify factual mistakes vs justify reasonable choices under the paper’s goals. Why it matters: Picking the wrong type weakens persuasion. 🍞 Anchor: “The ablation is actually included” (clarify) vs “This ablation isn’t required for theory” (justify).

04Experiments & Results

The test (what they measured and why):

Goal: Check if DRPG’s structured approach produces rebuttals that readers (human or LLM judges) prefer over baselines.
Measures:
1. Pairwise comparisons to compute Elo scores (like a skill rating—higher means you usually win head-to-head). 2) A learned judge score simulating how a reviewer might adjust their rating after reading the rebuttal. 3) Planner accuracy in picking the right perspective.

The competition (who it beat):

Direct: One-shot prompting over the whole review.
Decomp: Break the review into points but no retrieval/planning.
DRG: Decompose + Retrieve + Generate (no Planner).
Jiu-Jitsu: Uses fixed templates to pick a perspective (rigid planning).
Human-written rebuttals (REAL) as a sanity check.

The scoreboard (with context):

Across four base LLMs (Qwen3-8B, GPT-oss-20B, Mixtral-8×7B, LLaMA3.3-70B), DRPG consistently won pairwise matchups, achieving the top Elo scores within each model family.
DRPG’s advantage over other pipelines was often large (think of getting an A when others get B’s), and even with a compact 8B model, it surpassed average human-level performance on the dataset.
The Planner’s selector achieved over 98% accuracy in identifying the human-like, evidence-supported perspective—far better than naïve similarity, scoring without paper content, or scoring over the whole paper without retrieval.
Restricting DRPG to use only Clarification or only Justification weakened performance; the balanced, content-aware choice is key.
In multi-round simulated discussions, DRPG continued improving over 2–3 turns, while baselines plateaued—showing it handles follow-up questions better.

Evaluation reliability (are the judges fair?):

LLM judge training: A small model (Qwen3-4B) was trained with reinforcement learning to produce reviewer-like scores; on 71% of test cases, its score exactly matched human reviewers.
Human study: On 20 sampled cases comparing DRG vs DRPG, three human experts showed substantial agreement, and their preferences aligned well with GPT-4o’s decisions. This suggests the automated comparisons are a good proxy for human judgment.

Surprising findings:

Bigger isn’t always necessary—structure beats size: A well-planned 8B-based DRPG exceeded average human rebuttals and beat larger unstructured baselines.
Templates aren’t enough: Jiu-Jitsu’s fixed patterns couldn’t adapt to specific paper evidence, making replies feel generic.
Planning must be evidence-aware: Simply proposing angles without checking support leads to weaker persuasion.

Plain-English takeaway: When you think before you speak—and only say what you can prove—you win more debates. DRPG puts that into code.

05Discussion & Limitations

Limitations (what it can’t do):

It can’t run new experiments or collect new data during rebuttal, so it can’t satisfy requests that require fresh results.
If a paper is missing key evidence, no planning can conjure it; DRPG will be limited to what’s written.
Risk of hallucination remains in generation; authors must verify claims and citations.
Domain transfer: Trained mostly on CS conference data; performance in very different fields may drop without adaptation.
Computational cost: Encoding long papers and scoring multiple perspectives per point isn’t free.

Required resources (what you need):

A solid retriever (e.g., BGE-M3) to embed points and paragraphs.
A planner selector (encoder + small MLP) trained on thousands of labeled examples (e.g., ~50k points).
A capable LLM executor (8B+ works well here) and basic infrastructure to run the pipeline.

When NOT to use it:

If the reviewer asks for new experiments, user studies, or large-scale re-runs—DRPG can’t do those.
Ultra-short reviews with one-liners may not benefit much from decomposition.
Fields with highly specialized jargon and no similar training data may need domain-specific tuning first.

Open questions:

How to integrate with “AI Scientist” tools to propose and (safely) run small additional analyses during rebuttal time?
Can we better detect and respond to a reviewer’s underlying value judgments (e.g., novelty vs practicality preferences)?
How well does this generalize beyond CS to medicine, physics, or humanities with different writing norms?
Can we add automatic fact-checking to further reduce hallucination risk, and source-citation highlighting for transparency?

06Conclusion & Future Work

Three-sentence summary: DRPG is a structured, four-step agent that turns long, messy reviews into persuasive, evidence-backed rebuttals by decomposing concerns, retrieving key paragraphs, planning the best-supported perspective, and then generating targeted replies. Across diverse models and settings, it beats prior pipelines (and even average human performance with an 8B model) thanks to a content-aware Planner that picks provable strategies with over 98% accuracy. It also shines in multi-round discussions and offers interpretable scores explaining why a given perspective was chosen.

Main achievement: Showing that planning-plus-evidence—done before writing—dramatically improves rebuttal quality, outperforming template systems and unstructured prompting.

Future directions: Combine DRPG with automated analysis tools to run small, safe add-on experiments during rebuttal; expand to new research domains; strengthen fact-checking and citation grounding; and refine the Planner to model reviewer preferences.

Why remember this: DRPG proves that structure beats size—by thinking first (and only saying what the paper can prove), even a modest model can out-argue larger ones. It’s a blueprint for building agentic systems that persuade responsibly, with transparency and focus.

Practical Applications

•Draft point-by-point rebuttals to peer reviews with targeted evidence from the paper.
•Prepare responses to journal revise-and-resubmit letters by selecting the best-supported strategy per comment.
•Assist grant applicants in addressing reviewer critiques with clear, evidence-grounded arguments.
•Help research leads or students practice rebuttals by showing multiple perspectives and why one is best.
•Generate structured replies to internal research critiques (e.g., lab meetings) with traceable citations.
•Support code review responses by retrieving relevant design docs and planning persuasive explanations.
•Draft customer support or compliance responses that require evidence-backed justifications.
•Assist legal or policy teams with structured, evidence-linked replies to public comments (with expert review).
•Create educational tools that teach students how to argue from evidence using clarification vs justification.
•Enable multi-round Q&A coaching where each follow-up is addressed with a fresh plan and evidence.

Version: 1