OpenNovelty: An LLM-powered Agentic System for Verifiable Scholarly Novelty Assessment

Ming Zhang; Kexin Tan; Yueyuan Huang; Yujiong Shen; Chunchun Ma; Li Ju; Xinran Zhang; Yuhui Wang; Wenqing Jing; Jingyi Deng; Huayu Sha; Binze Hu; Jingqi Tong; Changhao Jiang; Yage Geng; Yuankai Ying; Yue Zhang; Zhangyue Yin; Zhiheng Xi; Shihan Dou; Tao Gui; Qi Zhang; Xuanjing Huang

OpenNovelty: An LLM-powered Agentic System for Verifiable Scholarly Novelty Assessment

Intermediate

Ming Zhang, Kexin Tan, Yueyuan Huang et al.1/4/2026

arXiv PDF

Key Summary

•OpenNovelty is a four-phase, AI-powered helper that checks how new a research paper’s ideas are by comparing them to real, retrieved papers.
•It first extracts the paper’s main task and claimed contributions, then turns them into smart search queries with several paraphrased variants.
•Using an academic semantic search engine, it gathers hundreds of candidates and filters them down to 60–80 strong matches per paper.
•It builds a clear, labeled family tree (taxonomy) of related work and runs claim-by-claim, full-text comparisons with verifiable evidence quotes.
•Any refutation must include exact quotes from both papers that pass a token-level verification algorithm; unverified refutations are automatically downgraded.
•The system outputs a structured, human-readable report with citations, snippets, a related-work taxonomy, and textual similarity findings.
•Unlike naïve LLM approaches, OpenNovelty does not trust the model’s memory; it grounds every decision in real, citable papers.
•It was deployed on 500+ ICLR 2026 submissions and often found closely related works that authors missed.
•The design favors fairness, traceability, and transparency, while stating limits like visual/math parsing gaps and search index coverage.
•OpenNovelty aims to support, not replace, reviewers by making novelty judgments clearer, faster, and backed by checkable evidence.

Why This Research Matters

Peer review shapes which ideas get published, funded, and followed, so novelty must be judged fairly and transparently. OpenNovelty reduces guesswork by grounding every decision in real, retrieved papers with verified quotes. This helps reviewers work faster and with more confidence, especially when literature volume is overwhelming. Authors benefit from clearer, evidence-based feedback that can surface missed citations without unfair accusations. Conference organizers gain a scalable tool that promotes consistency and reduces hallucinated references. Over time, this strengthens trust in scientific decisions and encourages genuinely new contributions.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: You know how a school library keeps getting more and more books every year, and it gets harder to know if a “new” story is truly new? Reviewers of research papers have the same problem—there are too many papers to read and not enough time.

🥬 The Concept (Novelty in peer review): Novelty is how new and original a paper’s ideas are compared to what’s already known.

How it works: reviewers read the paper, hunt for similar past work, and judge whether the new ideas go beyond what’s been done.
Why it matters: without good novelty checks, old ideas can be presented as new, and truly fresh ideas might be missed. 🍞 Anchor: Imagine two students turn in “new” science fair projects on volcanoes; to judge novelty, you’d compare them to past projects and see what’s truly different.

🍞 Hook: Imagine trying to spot a unique snowflake in a blizzard. That’s today’s AI literature.

🥬 The Concept (Publication explosion): The number of AI papers has skyrocketed.

How it works: top venues and arXiv add thousands of papers every year, faster than humans can digest.
Why it matters: reviewers struggle to read broadly, so novelty judgments can become rushed and subjective. 🍞 Anchor: If you had to grade 100 essays in a weekend, you might miss who actually wrote the most original one.

🍞 Hook: You know how a friend might “remember” a fact that never really happened? Some AIs do that too.

🥬 The Concept (LLM hallucination): LLMs can make up references or details when asked to judge novelty from memory.

How it works: they predict likely text but don’t always have the real source.
Why it matters: a novelty check based on made-up sources isn’t trustworthy. 🍞 Anchor: It’s like citing a book that doesn’t exist; you can’t verify it.

🍞 Hook: Think of using a map when you’re unsure of the route.

🥬 The Concept (Retrieval-augmented analysis): Instead of trusting memory, you look up real documents and quote them.

How it works: search broadly, pull back relevant papers, and compare claims line by line.
Why it matters: decisions become checkable and fair. 🍞 Anchor: When you argue a point in class, you bring the textbook page as proof.

🍞 Hook: Imagine sorting a huge sticker collection into neat albums before trading.

🥬 The Concept (Structured organization): Reviewers need organized context, not just a pile of links.

How it works: build a labeled, hierarchical map (a taxonomy) of related work.
Why it matters: without structure, you can’t see where a paper truly sits. 🍞 Anchor: It’s easier to compare soccer teams when they’re grouped by leagues and positions.

The World Before: Reviewers faced an overwhelming flood of papers, limited time, and tools that either relied on LLM memory (risking hallucinations) or just skimmed titles/abstracts, missing technical details. RAG systems mostly compared surface summaries, and many methods couldn’t show fine-grained, verifiable evidence.

The Problem: How can we judge a paper’s novelty fairly and consistently when the literature is massive, and verification must be precise and traceable?

Failed Attempts:

Naïve LLMs hallucinated references.
Abstract-only comparisons missed core technical overlaps in methods and training details.
Context windows were too small to organize large sets of prior work.

The Gap: A system that (1) grounds every claim in retrieved, real papers, (2) compares at the contribution level using full text, (3) organizes the field with an interpretable taxonomy, and (4) outputs a report with verifiable quotes.

Real Stakes: Fair reviews affect careers, research directions, and public trust. If novelty checks are weak, we reward repetition and slow down true discovery; if they are strong and transparent, we encourage honest progress and save reviewers’ time while improving decisions.

02Core Idea

🍞 Hook: Imagine you’re a detective who only trusts clues you can hold in your hand—no guesses allowed.

🥬 The Concept (Key insight): Always ground novelty judgments in real, retrievable papers with verifiable quotes, never in the AI’s memory.

How it works: extract claims → search widely → organize related work → compare contributions with full text → accept a refutation only if quotes from both papers pass a verification check.
Why it matters: without grounding and verification, judgments can be wrong or uncheckable. 🍞 Anchor: A referee uses instant replay with frame-by-frame proof before making a final call.

Three Analogies:

Librarian: 🍞 Hook: You know how a librarian doesn’t guess where a book is—they look it up in the catalog. 🥬 The Concept: OpenNovelty “catalogs” the field around your paper and checks each claim with exact page snippets.

How it works: index, retrieve, sort, verify.
Why it matters: no guesswork. 🍞 Anchor: Finding a quote on page 42 beats “I think I saw it somewhere.”

Science Fair Judge: 🍞 Hook: Imagine judging projects by comparing designs and instructions, not just titles. 🥬 The Concept: Claim-level full-text comparisons catch real overlaps that abstract-only checks miss.

How it works: read methods, training steps, and problem setups.
Why it matters: tiny wording changes in abstracts can hide big method similarities. 🍞 Anchor: Two volcano projects both use the same pressure-release trick—that’s a meaningful overlap.

Family Tree: 🍞 Hook: Think of building a family tree to see who’s closely related. 🥬 The Concept: A hierarchical taxonomy groups related papers so you can compare true siblings.

How it works: cluster by approach/problem/context with MECE rules.
Why it matters: comparisons are fair only when you compare near neighbors. 🍞 Anchor: You don’t compare a goalie to a violinist when judging goalie skills.

Before vs After:

Before: LLMs guessed from memory or skimmed abstracts; structure was shallow; evidence was not verifiable.
After: The system retrieves real papers, organizes them meaningfully, and verifies quotes token by token; reports are auditable and consistent.

Why it Works (intuition):

Retrieval widens coverage so relevant work isn’t missed.
Taxonomy narrows focus to the most comparable neighbors.
Claim-level, full-text checking increases precision on what actually matters.
Evidence verification forces honesty: only show claims backed by text that truly appears in both sources.

Building Blocks (explained with sandwiches):

🍞 Hook: Imagine comparing two recipes, not just the titles. 🥬 Full-text comparisons: Look at all the words to find real similarities and differences.

How: align passages, check steps and ingredients.
Why: abstracts can hide method overlaps. 🍞 Anchor: “Chocolate cake” titles can mask totally different baking steps.

🍞 Hook: You know how you list your project’s main parts before presenting? 🥬 Contribution extraction: Identify the paper’s key claimed advances.

How: scan for “We propose/introduce/design…” and write structured summaries.
Why: without clear claims, you can’t test novelty. 🍞 Anchor: A robot kit’s parts list helps you check what’s truly new.

🍞 Hook: Finding a book by meaning, not exact wording. 🥬 Semantic search engine: Finds papers that match the idea, even if words differ.

How: natural-language queries with semantic variants.
Why: researchers phrase the same idea differently. 🍞 Anchor: “Soccer” vs “football” still lands you in the right sports shelf.

🍞 Hook: Sorting trading cards by team and position. 🥬 Hierarchical taxonomy: Organizes related work into labeled branches and leaves.

How: LLM groups by approach/problem/context with MECE rules.
Why: comparisons only make sense among close neighbors. 🍞 Anchor: Compare strikers with strikers, not with coaches.

🍞 Hook: Like highlighting the exact sentence in the textbook. 🥬 Evidence snippets: Short quotes used as proof.

How: extract matching text from both papers.
Why: proof must be checkable. 🍞 Anchor: “See line 3, paragraph 2.”

🍞 Hook: A school helper that follows a to-do list. 🥬 LLM-powered agentic system: An AI that executes a multi-step plan with tools.

How: extract → search → filter → compare → verify → report.
Why: complex tasks need coordinated steps. 🍞 Anchor: A student uses notes, the library, and a checklist to finish a project.

🍞 Hook: A robot judge that only rules with video replay. 🥬 Automated novelty analysis: The whole pipeline that checks how new claims are.

How: ground in real papers, verify quotes, classify judgments.
Why: fairness and speed together. 🍞 Anchor: Quick, fair calls in a sports match with instant replay.

🍞 Hook: The final report card for an idea. 🥬 Novelty report: A structured summary with taxonomy, comparisons, and citations.

How: assemble verified outputs into clear modules.
Why: reviewers need readable, trusted evidence. 🍞 Anchor: A project grade sheet with comments and exact references.

03Methodology

At a high level: Input (PDF/URL) → Phase I (Extract core task & claims; generate queries) → Phase II (Semantic search + multi-layer filtering) → Phase III (Taxonomy + full-text comparisons + evidence verification) → Phase IV (Render verifiable report).

Phase I: Information Extraction and Query Expansion

🍞 Hook: You know how you write a summary and a shopping list before cooking?

🥬 Core idea: Extract what the paper is really about (core task) and what it claims as new (contributions), then turn them into search queries with paraphrased variants.

How it works (step-by-step):
1. Core task extraction: produce one 5–15 word phrase in field terms (e.g., “accelerating diffusion model inference”).
2. Claimed contribution extraction: up to three items with name, verbatim author claim, normalized description, and source hint.
3. Query generation: for the core task and each contribution, create one primary query plus 2 semantic variants. Contribution queries start with “Find papers about …”.
Why it matters: If you misread the paper’s main claims or search too narrowly, you’ll miss key prior work. 🍞 Anchor: Like listing “make tomato pasta” (core task) and “new sauce method” (contribution), then searching recipes using a few different phrasings.

Concrete example data: For a paper on training LLM agents with multi-turn RL, the core task phrase becomes “training LLM agents for long-horizon decision making via multi-turn reinforcement learning,” and a contribution query like “Find papers about reinforcement learning frameworks for training agents in multi-turn decision-making tasks,” plus variants.

Phase II: Semantic Search and Multi-layer Filtering

🍞 Hook: Imagine scooping up lots of seashells first, then keeping only the best ones.

🥬 Core idea: Use a semantic search engine (Wispaper) with all queries as-is, then filter results by quality flags, deduplication, temporal fairness, and top-K selection.

How it works:
1. Execute 6–12 natural-language queries concurrently.
2. Assign quality flags (perfect/partial/no) per paper using verification verdicts; keep only perfect.
3. Per scope (core task vs. each contribution), deduplicate within scope and select Top-K (up to 50 for core task, up to 10 per contribution).
4. Remove self-references and any paper published after the target (temporal filter).
5. Cross-scope deduplicate and merge to a final 60–80 unique candidates.
Why it matters: Broad recall ensures you don’t miss key work; careful filtering keeps the set focused and fair. 🍞 Anchor: It’s like searching “best pasta,” “top spaghetti,” and “great noodles,” then keeping only verified, on-topic recipes published before yours.

Concrete example numbers: A target paper might pull 2,328 raw results, which reduce to about 73 unique candidates after layered filtering (∼97% filtered).

Phase III: Analysis & Synthesis

Subpart A: Hierarchical Taxonomy

🍞 Hook: Think of arranging books by series, then volumes, then chapters.

🥬 Core idea: Build an LLM-generated, MECE-checked taxonomy over the Top-50 core-task papers.

How it works:
1. The LLM reads titles/abstracts and proposes a labeled tree (depth 3–5 typical).
2. Each node has a scope_note and exclude_note to keep boundaries crisp.
3. Automated validation ensures every candidate appears exactly once; a repair step fixes gaps without inventing IDs.
Why it matters: Without a clear map, you can’t fairly compare neighbors or see the field’s structure. 🍞 Anchor: A family tree makes it obvious who are siblings; you compare siblings first.

Subpart B: Textual Similarity Detection

🍞 Hook: Like scanning two essays for copy-pasted paragraphs.

🥬 Core idea: Find 30+ word overlapping segments between the target and candidates, then verify both sides against source texts.

How it works:
1. LLM proposes candidate segments labeled Direct or Paraphrase.
2. A token-level anchor alignment algorithm verifies each quote in both documents (confidence ≥ 0.6, 30+ words).
3. Only verified segments appear in the report.
Why it matters: Similarity can indicate versions or undisclosed reuse; verification prevents false alarms. 🍞 Anchor: Only the passages that match exactly in both books make it to the evidence list.

Subpart C: Comparative Analysis

🍞 Hook: When judging two science projects, you compare the same parts (like methods) to be fair.

🥬 Core idea: Run two comparison modes—core-task sibling distinctions and contribution-level refutability.

How it works:
1. Core-task comparisons: explain differences with sibling papers in the same taxonomy leaf or with sibling subtopics if needed.
2. Contribution-level comparisons: for each claimed contribution, compare against each candidate independently using full text.
3. Three-way judgment: can_refute (requires verified quote pairs), cannot_refute, or unclear.
4. Auto-downgrade can_refute to cannot_refute if quotes fail verification.
Why it matters: Claim-level fairness plus proof keeps the process precise and trustworthy. 🍞 Anchor: A referee only calls a foul if the replay clearly shows contact.

Phase IV: Report Generation

🍞 Hook: After you finish an experiment, you write a neat lab report with labeled sections.

🥬 Core idea: Render the structured JSON from Phase III into a readable report with seven modules.

How it works:
1. Modules: original paper info, core-task survey (taxonomy + 2-paragraph narrative), contribution analysis (per-claim judgments), core-task comparisons, textual similarity, references, and metadata.
2. Deterministic templates ensure consistent citations, quote truncation, and indentation.
3. No new LLM calls; it’s pure formatting of verified content.
Why it matters: Final outputs must be consistent, auditable, and easy to read. 🍞 Anchor: It’s the polished science fair poster that clearly shows methods, results, and sources.

Secret Sauce (what’s clever):

Ground-first, verify-always: no parametric memory, only real papers with token-verified quotes.
Broad recall + sharp filtering: multiple query variants plus strict quality flags.
MECE taxonomy: interpretable structure to compare true neighbors.
Three-way judgments with hard evidence constraints: precision over drama.
Modular, agentic pipeline: robust, parallelizable steps that degrade gracefully if some calls fail.

04Experiments & Results

🍞 Hook: Imagine testing a metal detector by burying different coins and seeing which ones it finds.

🥬 The Concept (Evaluation setup): The team deployed OpenNovelty on 500+ ICLR 2026 submissions to see if it could find strong prior work and provide verifiable analyses at scale.

How it works: Each submission ran through all four phases; outputs were published online for transparency.
Why it matters: Real-world volume and visibility test reliability better than tiny lab demos. 🍞 Anchor: It’s like running a school-wide experiment and posting all results on the bulletin board.

The Test: What did they measure and why?

Retrieval effectiveness: Does broad semantic search plus filtering surface the truly relevant neighbors?
Analysis quality: Do taxonomy and comparisons make sense, and are refutations backed by verifiable quotes?
Practical usefulness: Do reports reveal related work authors may have overlooked?

🍞 Hook: Picture a race where new runners must beat experienced athletes to prove themselves.

🥬 The Concept (Baselines/competition): They compared against naïve LLM judging (memory-based) and RAG systems that mostly match titles/abstracts.

How it works: Naïve LLMs risk hallucinated citations; abstract-only RAG misses method-level overlap.
Why it matters: If OpenNovelty wins where others stumble—evidence and full-text comparison—it shows genuine progress. 🍞 Anchor: It’s not enough to know two essays have similar titles; you have to read and compare the paragraphs.

Scoreboard with context:

Candidate funnel: Per submission, hundreds to thousands of raw hits are reduced to roughly 60–80 unique, high-quality candidates.
- Context: That’s like sifting sand through finer sieves until only the most relevant grains remain.
Closely-related finds: Preliminary analysis shows the system often surfaces near-miss or overlooked prior work.
- Context: Like getting an A+ for finding hidden references when many others get a B- by missing them.
Evidence-backed refutations: can_refute labels only survive with verified, token-matched quote pairs.
- Context: This is stricter than typical practice and reduces false accusations.

🍞 Hook: Sometimes an experiment surprises you, like a plant that grows faster in shade than sun.

🥬 The Concept (Surprising findings): Full-text, claim-level checks changed some narratives.

How it works: Abstracts suggested big differences, but methods revealed meaningful overlaps—or vice versa.
Why it matters: Real novelty often lives in technical details; surface-level scans can mislead. 🍞 Anchor: Two cakes labeled “vanilla” can taste different if one hides lemon zest in the batter.

Reliability and transparency:

All reports are public with citations and snippets, allowing community scrutiny.
Auto-downgrading unverified refutations prioritizes fairness to authors.
Textual similarity segments are presented for human interpretation, recognizing legitimate reasons for overlap (e.g., versions/shared authors).

Practical takeaways:

Retrieval breadth reduces the risk of missing key neighbors.
MECE taxonomy helps reviewers orient quickly.
Verified-evidence gating turns novelty debates into inspectable claims rather than opinions.
The pipeline scales to conference-level workloads while keeping judgments auditable.

05Discussion & Limitations

🍞 Hook: Even the best microscope can’t see through walls; every tool has limits.

🥬 The Concept (Limitations): OpenNovelty is strong but not magic.

How it works: It struggles with equations/figures, depends on the search index’s coverage, and taxonomies can vary across runs.
Why it matters: Some novelty hinges on math or visuals; missing indexes create blind spots; variable trees can shift neighbor sets. 🍞 Anchor: If a recipe’s key is a drawing, text-only readers might miss it.

🍞 Hook: Building a treehouse needs wood, tools, and time.

🥬 The Concept (Required resources): You need access to a capable LLM, an academic semantic search engine (e.g., Wispaper), and compute for many comparisons.

How it works: Parallel querying, verification passes, and rendering pipelines run at scale.
Why it matters: Without these, throughput and reliability drop. 🍞 Anchor: No hammer, no nails—no treehouse.

🍞 Hook: You don’t use a telescope to read a street sign.

🥬 The Concept (When not to use): Don’t treat cannot_refute as proof of novelty, or use the report for punitive actions.

How it works: Results reflect what was retrieved, not the entire universe of papers.
Why it matters: Overreliance or adversarial use can harm fair review practices. 🍞 Anchor: “No sightings” on your telescope doesn’t mean the bird isn’t there.

🍞 Hook: Every mystery leaves clues for the next detective.

🥬 The Concept (Open questions): How to evaluate retrieval recall with ground truth? Can math/figures be analyzed better? How to calibrate can_refute confidence? Can hybrid taxonomy methods boost stability?

How it works: Planned benchmarks (NoveltyBench), multi-engine search comparisons, similarity detection studies, and end-to-end user evaluations.
Why it matters: Better measurement leads to better systems and fairer science. 🍞 Anchor: A new map helps everyone hike safer and faster next time.

06Conclusion & Future Work

🍞 Hook: Think of a fair judge who only rules after checking the video replay from multiple cameras.

🥬 Three-sentence summary: OpenNovelty is a four-phase, agentic system that evaluates scholarly novelty by grounding every judgment in retrieved, real papers and verified evidence quotes. It extracts claims, retrieves broadly with semantic variants, organizes the field via an interpretable taxonomy, and runs claim-level full-text comparisons that only count if token-verified. The result is a transparent, checkable report that helps reviewers make fairer, more consistent decisions at scale. 🍞 Anchor: It’s the difference between “I think I saw a foul” and “Here’s the slow-motion replay with timestamps.”

Main achievement: Turning novelty assessment from memory-based opinion into evidence-verified analysis, complete with structured organization and audit-friendly outputs.

Future directions: Build NoveltyBench for ground-truth evaluation; compare multiple search engines; improve math/visual understanding; stabilize taxonomies; calibrate refutation confidence; and run user studies to measure review quality and efficiency gains.

Why remember this: Because trustworthy science needs trustworthy tools. By insisting on real citations, full-text comparisons, and quote verification, OpenNovelty raises the bar for fairness and transparency in peer review—helping fresh ideas shine while giving reviewers the proof they need.

Practical Applications

•Assist reviewers in locating overlooked related work for specific claims before making novelty judgments.
•Support authors in pre-submission checks to find and cite near-duplicate or closely related methods.
•Help area chairs spot clusters of overlapping submissions within a taxonomy to route experts effectively.
•Provide research mentors with structured maps of a subfield for onboarding students quickly.
•Enable ethics and compliance checks by flagging high textual overlap for human review.
•Power literature reviews with claim-level, evidence-backed comparisons instead of surface summaries.
•Benchmark search engines on recall of reviewer-cited prior work using the proposed NoveltyBench.
•Guide meta-analyses by grouping methods and tasks through interpretable taxonomies.
•Improve rebuttal quality by linking disagreements to specific, verified quotes from both sides.

Version: 1