Paper2Rebuttal: A Multi-Agent Framework for Transparent Author Response Assistance

Qianli Ma; Chang Guo; Zhiheng Tian; Siyu Wang; Jipeng Xiao; Yuanhao Yue; Zhipeng Zhang

Paper2Rebuttal: A Multi-Agent Framework for Transparent Author Response Assistance

Intermediate

Qianli Ma, Chang Guo, Zhiheng Tian et al.1/20/2026

arXiv PDF

Key Summary

•This paper turns rebuttal writing from ‘just write some text’ into ‘make a plan with proof, then write.’
•It uses a team of AI helpers (multi-agent system) where each helper has a job, like finding evidence or checking the plan.
•Before writing, the system breaks big reviewer comments into small, clear questions so nothing gets missed.
•For each question, it builds a hybrid context that mixes short summaries with exact quotes from the paper.
•If the answer needs outside knowledge, it searches the literature and makes short, safe-to-cite briefs.
•It creates an inspectable response plan that links every claim to internal or external evidence.
•A human can review and adjust the plan (human-in-the-loop) before any final text is drafted.
•On a new benchmark (RebuttalBench), the method beats strong chat models in coverage, faithfulness, and coherence.
•Ablation tests show external evidence briefs matter most, while structuring and checkers also help.
•The biggest gains appear for weaker base models, meaning good planning reduces the need for super-strong generators.

Why This Research Matters

Better rebuttals can tip a paper from rejection to acceptance, so reducing hallucinations and missed points has real career impact. This framework makes the process transparent, letting authors see and approve the plan before any text is written. By anchoring every claim to quotes and citations, it builds trust with reviewers and avoids over-promising. It also saves time under deadline by organizing work into clear, prioritized to-dos. Because planning and evidence matter more than pure fluency, even smaller LLMs can produce strong, reliable rebuttals. The same idea—verify, then write—can improve many other high-stakes communications beyond research.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

You know how when you study for a test, it’s not enough to write a pretty essay—you have to answer the exact questions and show your work with proof? That’s what authors face during paper rebuttals: short deadlines, tough questions, and the need to back up every statement with clear evidence.

🍞 Top Bread (Hook): Imagine you and your friends build a LEGO city. When someone asks, “How did you make the bridge so strong?”, you can’t just say, “We tried hard.” You need to point to the exact LEGO pieces and steps you used. 🥬 Filling (The Actual Concept — multi-agent system)

What it is: A multi-agent system is a team of smart helpers, each with a special role.
How it works: 1) One helper reads and organizes the paper and reviews; 2) Another finds evidence in the paper; 3) Another searches outside papers; 4) Another makes a response plan; 5) A checker reviews it; 6) A writer turns the plan into text.
Why it matters: Without teamwork, one big helper tries to do everything at once, gets confused, and makes mistakes like hallucinations. 🍞 Bottom Bread (Anchor): Like a soccer team where the goalie, defender, and striker each do their part so the team doesn’t lose the ball.

Before this work, two common approaches struggled:

Direct-to-text systems (fine-tuned on rebuttals) tried to write answers in one shot. They often hallucinated numbers, over-promised fixes, or forgot some reviewer points because they weren’t built to track evidence.
Chat-based prompting could reason better but forced long, messy conversations. The middle steps—what questions were extracted, which evidence was found, and why—stayed hidden, so results were hard to trust or audit.

🍞 Top Bread (Hook): Think of a school project where everyone’s talking but no one is writing down the plan. You might do lots of work but still miss what the teacher asked. 🥬 Filling (The Actual Concept — human-in-the-loop checkpoints)

What it is: Human-in-the-loop checkpoints are moments when a person reviews and approves the plan.
How it works: 1) The system drafts a plan; 2) You check it; 3) You adjust goals or promises; 4) Only then does the system write the final text.
Why it matters: Without these checkpoints, the AI might promise new experiments you can’t run or say things that don’t match your paper. 🍞 Bottom Bread (Anchor): Like a teacher checking your outline before you write your essay so you don’t head in the wrong direction.

The real problem: rebuttals need four things the old methods didn’t guarantee—(i) complete coverage of every concern, (ii) strict faithfulness to the manuscript, (iii) verifiable grounding that links claims to quotes or citations, and (iv) global consistency so answers don’t contradict each other.

🍞 Top Bread (Hook): In sports, a good coach changes the plan depending on the other team. 🥬 Filling (The Actual Concept — dynamic contextualization)

What it is: Dynamic contextualization means building the right context for each question.
How it works: 1) Start from a compact summary of the paper; 2) Expand only the needed parts into full, exact quotes; 3) Add outside references only if required.
Why it matters: Without it, the AI wastes space on unused text or misses the precise lines that matter. 🍞 Bottom Bread (Anchor): Like zooming in on a map only where you need street-level detail.

🍞 Top Bread (Hook): When you don’t know something, you ask a librarian. 🥬 Filling (The Actual Concept — external search module)

What it is: A tool that searches outside literature when your paper alone can’t answer a point.
How it works: 1) Decide if outside evidence is needed; 2) Create smart search queries; 3) Screen papers; 4) Produce short, safe-to-cite briefs.
Why it matters: Without it, you either ignore requests for comparisons or make risky, unsupported claims. 🍞 Bottom Bread (Anchor): Like borrowing the right book to back up your science fair claim.

What was missing before? A “verify-then-write” workflow that turns rebuttal writing into an evidence-first, plan-driven process. This paper fills that gap with a transparent, multi-agent system that authors can audit step by step, cutting hallucinations and keeping claims honest and consistent. The stakes are real: better rebuttals can change paper decisions, reduce stress under deadlines, and improve the fairness and clarity of scientific discussions.

02Core Idea

The Aha! Moment in one sentence: Don’t write the rebuttal first—plan it with linked evidence, check it, then write.

🍞 Top Bread (Hook): You know how detectives pin strings from clues to a suspect board before they explain the case? 🥬 Filling (The Actual Concept — evidence-centric planning)

What it is: Building your argument around evidence before writing sentences.
How it works: 1) Split reviews into small concerns; 2) Gather internal quotes and external citations; 3) Draft a plan that links each claim to its source; 4) Check for gaps or contradictions; 5) Only then draft text.
Why it matters: Without planning, you might sound fluent but still miss a key point or make claims you can’t prove. 🍞 Bottom Bread (Anchor): Like solving a math problem by writing the steps and justifications first.

Three analogies for the same idea:

Cooking: Mise en place—measure and set out all ingredients (evidence), then follow the recipe (plan) so the dish (rebuttal) turns out right.
Courtroom: You collect exhibits and witness statements (evidence), build your case outline (plan), then give the closing argument (final text).
Building: You draw blueprints (plan with evidence anchors) before pouring concrete (writing), so the house stands straight (consistency).

Before vs After:

Before: One-shot answers that might be eloquent but miss points, hallucinate, or contradict themselves.
After: Structured, inspectable plans that trace every claim to an internal quote or an outside citation, with clear action items when new experiments are needed.

🍞 Top Bread (Hook): A messy to-do list makes you forget chores. 🥬 Filling (The Actual Concept — concern breakdown)

What it is: Turning long reviews into small, actionable questions.
How it works: 1) Split compound complaints; 2) Merge duplicates; 3) Tag priorities (critical/important/minor); 4) Link to paper sections.
Why it matters: Without this, you’ll skip something or answer the wrong thing. 🍞 Bottom Bread (Anchor): Like turning “clean the house” into “vacuum living room, wash dishes, take out trash.”

🍞 Top Bread (Hook): A recipe shows every step so anyone can follow it. 🥬 Filling (The Actual Concept — inspectable response plan)

What it is: A step-by-step outline that shows what you’ll say and which evidence backs it.
How it works: 1) For each concern, write claim + linked quote/citation; 2) Note if new experiments are needed as action items; 3) Run checks for conflicts.
Why it matters: Without an inspectable plan, mistakes hide inside pretty wording. 🍞 Bottom Bread (Anchor): Like a LEGO instruction booklet with numbered steps and exact bricks.

🍞 Top Bread (Hook): A team beats a solo act when the job is complex. 🥬 Filling (The Actual Concept — REBUTTALAGENT)

What it is: A multi-agent system that does “verify-then-write” for rebuttals.
How it works: 1) Parse and compress the paper; 2) Extract atomic concerns; 3) Build hybrid contexts; 4) Retrieve outside papers if needed; 5) Draft a plan with evidence links and action items; 6) Checker ensures coverage and consistency; 7) Human approves; 8) Writer drafts final text.
Why it matters: Without this pipeline, the model is more likely to hallucinate or contradict itself. 🍞 Bottom Bread (Anchor): Like a newsroom where researchers, fact-checkers, editors, and writers each ensure the article is accurate and clear.

🍞 Top Bread (Hook): Practice fields help athletes improve before the big game. 🥬 Filling (The Actual Concept — REBUTTALBENCH)

What it is: A benchmark built from real OpenReview threads to test rebuttal helpers.
How it works: 1) Pair reviewer critiques, author replies, and follow-ups; 2) Score responses on relevance, argument quality, and communication; 3) Use LLM-as-judge with a detailed rubric.
Why it matters: Without a realistic test, you can’t tell if the system truly improves rebuttals. 🍞 Bottom Bread (Anchor): Like a driving simulator that checks braking, turning, and attention all at once.

Why it works (intuition): Planning with evidence reduces hallucinations; hybrid contexts keep the right details in view; external briefs fill genuine gaps; checkers catch contradictions; and human checkpoints align the plan with what authors can actually do. Together, these pieces shift success from “how fancy the sentences are” to “how complete, true, and consistent the answers are.”

03Methodology

At a high level: Manuscript + Reviews → [Input Structuring] → [Evidence Construction] → [Plan + Checks + Human Approval] → [Final Draft].

Step 1: Input Structuring (Parser, Compressor, Extractor, Checkers)

What happens: The system turns the PDF into paragraph-indexed text (parser), distills it into a compact summary that keeps key claims and results (compressor), and splits reviewer text into atomic concerns with priorities (extractor). Checkers verify the compact summary hasn’t lost facts and that concerns are neither over-merged nor over-split.
Why this step exists: It’s hard to reason over huge PDFs and long reviews. Compact-yet-faithful views keep tokens low and retrieval precise, while atomic concerns ensure nothing gets missed.
Example: Reviewer says, “No comparison to LoRA and unclear Eq. 3 motivation.” The system makes two concerns: (q1) missing LoRA comparison (P1), (q2) unclear MI choice in Eq. 3 (P2), and links q1 to Sec. 4.2, Tab. 2; q2 to Sec. 3.2, Eq. 3.

Step 2: Evidence Construction (Hybrid Context + External Briefs)

Atomic Concern Conditioned Hybrid Context: For each concern, the system searches the compact summary to find relevant sections, then expands only those into full, exact quotes while keeping the rest summarized. This blends efficiency with precision.
On-Demand External Evidence: If a concern needs outside support (novelty disputes, missing baselines), a search planner crafts queries; a retriever finds candidate papers; a screener filters them; and a summarizer produces short, safe-to-cite briefs highlighting key claims, comparisons, and limitations.
Why this step exists: Direct generation often invents facts. Building evidence first anchors arguments to real text and citations.
Example: For q1 (LoRA comparison), the hybrid context pulls your result tables and exact lines about parameter counts. If LoRA isn’t in the paper, the external module finds top LoRA references and produces brief summaries you can cite.

🍞 Top Bread (Hook): When you pack a backpack, you put the important items on top. 🥬 Filling (The Actual Concept — dynamic contextualization via hybrid context)

What it is: Expanding only the needed parts of the paper into high-fidelity quotes while leaving the rest compressed.
How it works: 1) Locate likely spots in the compact view; 2) Swap in the original paragraphs there; 3) Keep links back to exact sections.
Why it matters: Without it, the model either drowns in text or misses exact wording. 🍞 Bottom Bread (Anchor): Like opening only the chapter you need instead of carrying an open encyclopedia.

Step 3: Planning and Drafting (Strategist, Plan Checker, Human-in-the-Loop, Drafter)

Strategist: For each concern, proposes a response strategy anchored in evidence. If the concern needs new experiments, it generates action items instead of fake results.
Plan Checker: Verifies coverage, evidence links, and cross-point consistency (no contradictions between answers).
Human-in-the-Loop: Authors review and adjust the plan—accept clarifications, resize promises, or re-scope action items to match real constraints.
Drafter: Converts the approved plan into a formal rebuttal letter, preserving evidence links and placeholders (e.g., [TBD]) where work is pending.
Why this step exists: It prevents over-commitment and keeps the final text faithful to verified facts.
Example: If a reviewer asks for a new ablation, the plan includes “Run ablation X on dataset Y; report metric Z.” The draft uses placeholders until results are ready.

The Secret Sauce:

Evidence-first: Gather and link proof before fluency.
Hybrid context: The right detail at the right time.
Action items instead of guesses: No fabricated numbers; just tasks to complete.
Lightweight verification: Small checks catch big problems (missed coverage, contradictions) early.

Mini Data Walkthrough:

Input: Manuscript PDF + reviews.
Output: For each concern, a plan entry: Claim → Evidence (quote or citation) → If needed: Action Item → Confidence/Notes → Final drafted paragraph.

What breaks without each step:

No structuring: Missed concerns and wrong sections.
No evidence construction: Hallucinated claims.
No planning/checking: Contradictions and unsafe commitments.
No human checkpoint: Promises that the team can’t keep.

04Experiments & Results

The Test: The authors build RebuttalBench from real ICLR OpenReview threads—each sample has a reviewer critique, the author’s reply, and the reviewer’s follow-up. They score three things using a detailed LLM-as-judge rubric: Relevance (did you answer the exact points?), Argumentation Quality (is it logically supported by real evidence?), and Communication Quality (is it clear and professional?).

The Competition: They compare strong closed-source LLMs writing rebuttals directly (GPT-5-mini, Grok-4.1-fast, Gemini-3-Flash, DeepSeekV3.2) versus the same LLMs run inside RebuttalAgent’s structured pipeline. This makes it a fair fight: same model, but different workflow.

The Scoreboard (with context): Across all backbones, RebuttalAgent wins on all dimensions. For example, it boosts coverage by up to +0.78 (DeepSeekV3.2) and specificity by up to +1.33 (GPT-5-mini). Think of it like turning B-grade answers into solid A/A- answers by making sure every question is handled with concrete proof and no contradictions. Communication Quality also improves, but the biggest jumps come from better concern tracking and stronger evidence use—not just fancier wording.

Surprising Findings:

Bigger help for weaker models: The structured pipeline lifts smaller models more than larger ones. This suggests planning and evidence can substitute for raw generative power. In other words, a well-organized plan lets a modest writer perform like a pro.
Balanced gains: Improvements are not just in one area (like adding citations). RebuttalAgent improves coverage, logic consistency, evidence support, tone, and clarity together, indicating the whole pipeline works as a system.

Ablation Study Insights:

Removing external evidence briefs hurts the most, dropping coverage and constructiveness—the system becomes vaguer without citation-ready support.
Skipping input structuring reduces semantic alignment and evidence support—messy concerns and unstable manuscript views make answers drift.
Turning off plan-level checkers slightly lowers evidence support and rebuttal quality—small checks prevent subtle errors.

Bottom line: Verify-then-write beats write-then-hope, especially when time is tight and accuracy matters.

05Discussion & Limitations

Limitations:

Base-model dependence: The pipeline improves reliability, but the ceiling still depends on the underlying LLM’s reasoning and reading skills.
LLM-as-judge bias: The evaluator is itself an LLM, which may share weaknesses with systems being judged; human studies would add confidence.
Retrieval fragility: External search can miss key papers or pull noisy sources; careful filtering and human review remain important.
PDF parsing/compression drift: Even with checks, some subtle details might be lost in compression and need re-expansion.
Domain transfer: The system is tuned for ML conference rebuttals; other fields (e.g., biology, HCI) may need tailored prompts and tools.

Required Resources:

Access to strong LLM backbones, a search API for scholarly content, and GPU/CPU budget for multi-step runs.
Author time for reviewing the plan and executing action items (e.g., running small ablations), though less time than unguided chat iterations.

When NOT to Use:

Very short, trivial rebuttals where a single, obvious clarification suffices.
Cases where no external search is allowed or evidence cannot be shared (policy constraints).
Extremely limited time without any chance for human review—though the automated mode still helps, the biggest wins come with the checkpoint.

Open Questions:

How to add stronger, reference-checked citation graphs that verify claims against PDFs automatically?
Can we create cross-reviewer consistency maps to detect subtle stance conflicts earlier?
What’s the best mix of human time vs. agent time for maximum benefit?
How to evaluate rebuttals with human panels efficiently to complement LLM-as-judge scores?

06Conclusion & Future Work

Three-sentence summary: The paper reframes rebuttal writing as an evidence-first planning problem: break concerns into atoms, build hybrid contexts and external briefs, check for consistency, then draft. This multi-agent, verify-then-write workflow increases coverage, faithfulness, and coherence while letting humans steer final commitments. On RebuttalBench, it consistently outperforms direct-to-text baselines across strong and weak LLMs.

Main Achievement: A transparent, controllable pipeline (RebuttalAgent) that anchors every rebuttal claim in verifiable internal or external evidence before any prose is written.

Future Directions: Add stronger PDF-grounding and citation verification, richer cross-point consistency checks, and human-user studies across domains beyond ML. Explore tighter integrations with code repos for automatic artifact checks and with conference templates for one-click formatting.

Why Remember This: It shows that great rebuttals aren’t just about fluent writing—they’re about planning with proof. When AI organizes decisions and evidence first, even smaller models can produce reliable, high-impact responses.

Practical Applications

•Draft author rebuttals for conference and journal submissions with evidence-linked plans.
•Create point-by-point responses for grant review panels with citations and action items.
•Prepare structured replies to code reviews, linking claims to exact lines and test results.
•Write product RFP responses with traceable references to specs, benchmarks, and policies.
•Assemble legal-style position letters that cite internal documents and external precedents.
•Produce customer escalation replies grounded in logs, tickets, and knowledge-base articles.
•Help students answer rubric-based feedback with quotes from their own essays and sources.
•Compile compliance audit responses with document anchors and external standards.
•Generate medical prior-authorization letters referencing chart notes and guidelines.
•Plan research revision work (to-do lists) from reviews without fabricating outcomes.

Version: 1