PaperSearchQA: Learning to Search and Reason over Scientific Papers with RLVR

James Burgess; Jan N. Hansen; Duo Peng; Yuhui Zhang; Alejandro Lozano; Min Woo Sun; Emma Lundberg; Serena Yeung-Levy

PaperSearchQA: Learning to Search and Reason over Scientific Papers with RLVR

Intermediate

James Burgess, Jan N. Hansen, Duo Peng et al.1/26/2026

arXiv PDF

Key Summary

•This paper teaches a language-model agent to look up facts in millions of scientific paper summaries and answer clear, single-answer questions.
•It builds a huge biomedical search library (16 million PubMed abstracts) and a new question set called PaperSearchQA with 60,000 fact-based Q&A pairs.
•The agent learns with reinforcement learning with verifiable rewards (RLVR), which only gives points when the final answer exactly matches the truth or a synonym.
•Compared to regular prompting and standard retrieval-augmented generation (RAG), the RL-trained agent answers more questions correctly.
•On their main test, the RL agent raised accuracy to about 51% with a 7B model, much better than strong baselines around 30–37%.
•The team also provides fair benchmarks, including a cleaned-up BioASQ factoid set with synonym lists to check exact matches.
•Interesting behaviors emerge: the agent plans simple searches, reasons before searching, and self-verifies when it already knows the answer.
•Semantic retrieval (dense embeddings) helped only a little compared to classic keyword search (BM25) in this technical domain.
•Paraphrasing questions (changing wording) made the task harder and prevented easy keyword matching.
•The resources are open (on Hugging Face) and the method can expand to other sciences like chemistry and materials science.

Why This Research Matters

Scientists and doctors are flooded with new papers every day, and finding one exact fact can take a long time. This work shows how to train an AI helper that can quickly look up precise, checkable answers in a large scientific library. Because the reward is verifiable and automated, the system can scale to many questions without costly human grading. The benchmark and datasets are open, so others can improve the agent and apply it to new fields like chemistry or materials science. Over time, smarter search agents can help researchers design better experiments, compare findings, and spot reliable evidence. Ultimately, this reduces time-to-discovery and makes scientific knowledge more accessible.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you’re in a giant library looking for one exact fact, like the name of a gene. There are millions of books, and you have only a few minutes. You’d want a smart helper who can search well and double-check answers.

🥬 The Concept: Large Language Models (LLMs)

What it is: LLMs are computer programs that read and write human-like text.
How it works: 1) They learn patterns from huge amounts of text. 2) They predict the next word in a sentence. 3) They use that skill to answer questions and explain things. 4) They can follow instructions and call tools, like search engines.
Why it matters: Without LLMs, computers can’t understand questions or summarize the key facts we need. 🍞 Anchor: When you ask, “What gene is mutated in childhood retinoblastoma?”, an LLM understands you want a specific gene name (RB1) rather than a long story.

🍞 Hook: You know how detectives look up records and connect clues before naming the culprit? A search agent does that with text.

🥬 The Concept: Search Agents

What it is: A search agent is an LLM that can plan, search a database, read results, and then answer.
How it works: 1) Read the question. 2) Plan a search query (keywords). 3) Retrieve matching documents. 4) Read them. 5) Give the final answer. 6) Optionally repeat until confident.
Why it matters: Without search, the agent might guess or rely on memory that could be wrong or incomplete. 🍞 Anchor: For “Which algae causes mastitis in cattle?”, the agent searches papers, finds “Prototheca,” and returns that as the answer.

🍞 Hook: Think of “factoid questions” like quiz-bowl buzzers—short, exact answers.

🥬 The Concept: Factoid Question Answering (QA)

What it is: Factoid QA asks for one clear item, like a gene name, a disease, or a location in the body.
How it works: 1) Ask a precise question. 2) Search relevant text. 3) Pull out the one correct entity. 4) Check exact-match or synonyms.
Why it matters: Without single, checkable answers, training an agent with simple rewards becomes messy and ambiguous. 🍞 Anchor: “What gene is mutated in Sickle Cell Anemia?” Answer: HBB.

🍞 Hook: When you Google something, you type keywords; the search engine finds pages with those words. That’s like one of our tools.

🥬 The Concept: BM25 and e5 Retrieval Indexes

What it is: BM25 is a classic keyword search; e5 is a semantic (meaning-based) search using embeddings.
How it works: 1) BM25 scores documents by shared words and their importance. 2) e5 turns text into vectors and finds similar meanings even if the words differ. 3) Both return top-k likely matches for your query.
Why it matters: Without good retrieval, the agent can’t find the right paper, so it can’t answer correctly. 🍞 Anchor: If the question is paraphrased, e5 may still find the right abstract even when BM25’s exact keywords don’t match.

🍞 Hook: Imagine getting a gold star only if your final answer is exactly correct—no points for showing work.

🥬 The Concept: Reinforcement Learning with Verifiable Rewards (RLVR)

What it is: RLVR trains an agent by giving a reward only if its final answer is verifiably correct.
How it works: 1) Agent reads the question and can search. 2) Agent outputs a final answer. 3) A simple checker compares the answer to the truth (or synonyms). 4) Reward 1 for a match, 0 otherwise. 5) The agent updates its behavior to get more matches next time.
Why it matters: Without a crisp reward (right/wrong), learning can become fuzzy or depend on expensive human labels. 🍞 Anchor: If the gold answer is “apolipoprotein C-III,” the reward triggers for “APOC3,” “apoC-III,” or “apolipoprotein C3.”

🍞 Hook: Picture having a super-sized science encyclopedia focused on health and biology.

🥬 The Concept: PubMed Abstract Corpus

What it is: A collection of over 16 million short summaries (abstracts) of biomedical papers.
How it works: 1) Titles + abstracts are indexed. 2) A retriever quickly finds likely matches. 3) The agent reads these to answer.
Why it matters: Without a large, high-quality science library, the agent can’t learn real research search skills. 🍞 Anchor: To answer “Which kinase complex is essential for cytokine signaling in cutaneous T-cell lymphoma?”, the agent searches PubMed abstracts and finds “Jak1/Jak3.”

🍞 Hook: When you play a game with a score, you get better by seeing which moves help you win. RL training does that for the agent.

🥬 The Concept: Search-R1 Training

What it is: A recent RLVR recipe for training LLMs to interleave reasoning and search.
How it works: 1) Minimal instructions tell the agent how to think, search, and answer. 2) The agent tries answers; correct ones get reward. 3) A policy update (GRPO) nudges the model toward successful patterns.
Why it matters: Without this training, the model might not learn robust, flexible search habits. 🍞 Anchor: After training, the 7B model’s accuracy jumps from mid-30s with RAG to about 51% with RLVR on the new benchmark.

The World Before: Most search-agent research focused on general trivia—great for “capital of France,” not so great for complex bio-medical facts. Systems often used heavy hand-crafted scaffolds or supervision, which can limit generalization.

The Problem: Scientists need agents that can understand technical terms, form smart queries, and judge evidence from scientific papers—skills not fully covered by general trivia datasets.

Failed Attempts: Pure prompting or supervised fine-tuning helped, but often struggled to generalize well. Many datasets had fuzzy or long-form answers, hard to auto-check for RL.

The Gap: We lacked a large, verifiable, science-focused environment where an agent could learn to search and reason with clean, automatic right/wrong signals.

Real Stakes: In daily life, better literature search saves scientists time, helps doctors and researchers find reliable facts, and lays groundwork for future “AI Scientist” assistants that can plan studies and analyze evidence.

02Core Idea

🍞 Hook: You know how it’s easier to practice piano when the app lights up the right notes and instantly tells you if you played them correctly? That tight feedback helps you improve fast.

🥬 The Concept: The “Aha!” Moment

What it is (one sentence): If we train a search-capable LLM in a scientific library and only reward it when its final answers exactly match verified truths, it will learn to plan queries, retrieve the right papers, and extract precise facts better than standard methods.
How it works: 1) Build a big, realistic science library (PubMed abstracts). 2) Create many clear, single-answer questions with synonym lists. 3) Let the agent think, search, and answer. 4) Reward only exact matches (or synonyms). 5) Update the model so it repeats the good patterns.
Why it matters: Without this environment and reward, the agent won’t reliably learn technical search and precise answering. 🍞 Anchor: After RLVR training in this setup, the 7B agent scores about 51% on the new test vs ~36% with a strong RAG baseline.

Multiple Analogies:

Treasure Hunt: The map (retriever) helps you find the spot; the X (exact-match reward) confirms if you truly found the treasure. If you get the treasure, you keep your strategy; if not, you adjust.
Cooking Class: The teacher only grades the finished dish by tasting it (verifiable reward). You learn which recipe steps matter (planning and search) to get a pass.
Basketball Practice: You only get points when the ball goes through the hoop (exact match). Over time, you learn better plays (queries) and shots (answers).

Before vs After:

Before: Agents often answered based on memory or brittle prompts; training data was general trivia; rewards were unclear or required human labels.
After: Agents learn in a real science library, get simple pass/fail signals, and develop routines for planning searches, reading abstracts, and answering exactly.

Why It Works (intuition):

Verifiable rewards keep learning clean: no fuzzy grading. The model gets clear signals about success.
A realistic environment (millions of abstracts) forces good habits: query rewriting, keyword focus, and quick evidence checks.
Synonym lists remove trickiness about wording, so the agent learns the concept, not just the spelling.

Building Blocks (broken down):

🍞 Hook: Imagine writing quiz questions from short encyclopedia entries, then checking if players can find and answer them.

🥬 The Concept: Dataset Construction Pipeline

What it is: A step-by-step way to turn abstracts into many high-quality, single-answer questions.
How it works: 1) Sample abstracts. 2) Use an LLM to write three factoid QAs per abstract, guided by expert-made categories. 3) Expand answers with synonyms. 4) Paraphrase some questions to make keyword matching harder. 5) Expert review and filtering. 6) Split into train/test.
Why it matters: Without clear, scalable questions and synonyms, RLVR can’t train reliably in science domains. 🍞 Anchor: From an abstract about hydrocephalus, generate: “Which birth defect can cause unilateral hydrocephalus?” Answer: “Foramen of Monro atresia.”

🍞 Hook: You know how a phone’s search can find photos of “dogs” even if the picture file is named something else? That’s like semantic search.

🥬 The Concept: Dual Indexing (BM25 and e5)

What it is: Two search tools—one keyword-based (BM25), one meaning-based (e5)—to fetch candidate abstracts.
How it works: 1) BM25 finds exact-word matches efficiently. 2) e5 finds related meanings via embeddings. 3) The agent queries; the system returns top matches for it to read.
Why it matters: Without fast, accurate retrieval, even a smart model can’t find the right evidence. 🍞 Anchor: A paraphrased question still reaches the right abstract through e5 even when BM25’s exact keywords differ.

🍞 Hook: Think of a game that gives you a point only when your final answer card matches the answer key.

🥬 The Concept: RLVR Training inside Search-R1

What it is: A training loop where the agent thinks, searches, and answers; a reward fires only on exact matches (or synonyms), and the model updates.
How it works: 1) Minimal system prompt teaches ‘think’, ‘search’, and ‘answer’ tags. 2) Roll out multiple attempts per question. 3) Compare to gold answers. 4) Use GRPO to adjust the policy toward successes.
Why it matters: Without this loop, the agent might not consistently learn planning, retrieval, and precise answering. 🍞 Anchor: Over training, traces show the agent extracting keywords, forming cleaner queries, and verifying before finalizing.

🍞 Hook: If you can grade the final product automatically, you can scale learning quickly.

🥬 The Concept: Verifiable Benchmarks (PaperSearchQA and BioASQ-factoid)

What it is: Clean test sets with synonym lists so exact-match grading is fair and robust.
How it works: 1) PaperSearchQA: 60k samples (train/test split built-in). 2) BioASQ-factoid: human-written, smaller but high quality. 3) Both use exact-match with synonyms.
Why it matters: Without solid tests, you don’t know if the agent really improved. 🍞 Anchor: On BioASQ-factoid, the RL-trained agent beats non-RL baselines while using the same retriever and base model.

03Methodology

High-level Overview: Input (Question) → Think (plan) → Search (BM25/e5) → Read (top documents) → Answer (exact entity) → Reward (1/0) → Update model.

Step 1: Build the Scientific Playground (Corpus + Indexes)

What happens: Gather 16 million PubMed abstracts (title + abstract) and index them with BM25 (keyword) and e5 (semantic) retrievers. Keep everything memory- and GPU-ready (e5 index ~93GB; BM25 ~2.6GB; corpus ~23GB).
Why this step exists: Without a large, real scientific library, the agent can’t practice real search. Without indexes, search would be too slow.
Example: A query about “Jak1/Jak3 kinase complex” should retrieve abstracts mentioning cytokine signaling and T-cell lymphoma.

Step 2: Design Useful Question Types (Categories)

What happens: Experts and LLMs propose candidate categories, then experts merge them into 10 practical groups (e.g., genetic mutations, therapeutics, methods, biomarkers, anatomy).
Why this step exists: Categories steer question generation toward real-world, unambiguous factoids scientists care about.
Example: In “Methods & resources,” a question like “What in vitro assay measures cancer cell invasion through basement membrane?” Answer: “Matrigel-coated filter invasion assay.”

Step 3: Generate Factoid QAs from Abstracts

What happens: For each sampled abstract, an LLM (GPT-4.1) creates three clear, single-answer questions that do not rely on reading the paper itself (no “this study” phrasing). Experts iteratively refine prompts.
Why this step exists: RLVR needs lots of precise, verifiable questions. Human-only authoring doesn’t scale to tens of thousands.
Example: From a RYMV abstract, generate: “In vitro, which rice cell type is most used to assess viral replication?” Answer: “Rice protoplasts.”

Step 4: Add Synonym Lists (Golden Answers)

What happens: Another LLM call expands each gold answer with synonyms and variants (e.g., capitalization, hyphens, common aliases).
Why this step exists: Many entities have multiple accepted names. Exact-match grading would be unfair without synonyms.
Example: “APOC3” includes “apolipoprotein C-III,” “apoC-III,” “apoCIII,” “apolipoprotein C3,” etc.

Step 5: Paraphrase Half the Questions

What happens: About 50% of questions are reworded to avoid easy keyword overlap with the source abstract while keeping the same meaning and answer.
Why this step exists: Prevents the task from being too easy for purely lexical matching (BM25-only), encouraging genuine understanding and search skill.
Example: Original: “What congenital abnormality can cause unilateral hydrocephalus?” Paraphrase: “Which birth defect may lead to hydrocephalus on one brain side?”

Step 6: Split, Publish, and Prepare Evaluation Sets

What happens: Create a 54,907 train / 5,000 test split for PaperSearchQA. Also collect BioASQ-factoid (1,609 samples), add synonyms, and release both on Hugging Face.
Why this step exists: Clear train/test boundaries avoid leakage. BioASQ adds a human-authored, respected benchmark.
Example: A test-time question about “Wells criteria” must exactly match “pulmonary embolism” or agreed synonyms.

Step 7: Minimal Agent Interface (System Prompt)

What happens: Provide the LLM a simple instruction format: reason inside <think>…</think>, search via <search>…</search>, and output final <answer>…</answer>. When the agent emits a <search> block, the system retrieves top-k docs and injects them between <information>…</information>.
Why this step exists: Keeps the scaffold small so the agent can learn its own strategies during RL, improving generalization.
Example: The agent writes: <think>Plan keywords: mastitis, algae, cattle.</think><search>algae mastitis cattle</search> → gets abstracts → <answer>Prototheca</answer>.

Step 8: RLVR Training (Search-R1 with GRPO)

What happens: The agent runs rollouts per question, interleaving thinking, searching, reading, and answering. The verifier gives reward 1 if the final answer matches any synonym after normalization (lowercasing, trimming, removing articles), else 0. Group Relative Policy Optimization (GRPO) updates the policy versus a reference model with a KL penalty.
Why this step exists: Outcome-only rewards encourage the agent to discover robust, general tactics without micromanaged labels.
Example: Across attempts, the model learns to extract keywords, adjust queries (e.g., add specific protein names), and delay answering until it has sufficient evidence.

Step 9: Baselines for Fair Comparison

What happens: Compare to direct LLM inference (no retrieval), chain-of-thought prompting, RAG (retrieve then generate), Search-o1, and PaperQA2. Keep the same base models (Qwen2.5 3B/7B) and same retrievers for fairness.
Why this step exists: To show whether RLVR actually improves performance beyond strong non-RL methods.
Example: The 7B RAG baseline scores ~36.5% on PaperSearchQA; the RLVR-trained agent reaches ~51.0%.

The Secret Sauce:

Verifiable rewards + synonym-expanded gold answers transform scientific QA into a clean, scalable learning signal.
Minimal scaffolding lets the agent learn flexible behaviors (planning, verification) rather than memorize a fixed script.
Paraphrased questions ensure the agent practices real search and reasoning rather than brittle keyword tricks.

Concrete Data Flow Example:

Input: “Which kinase complex is essential for cytokine-induced signaling in cutaneous T-cell lymphoma?”
Think: “Keywords: kinase complex, cytokine signaling, cutaneous T-cell lymphoma; likely Jak kinases.”
Search: “Jak kinase complex cytokine signaling CTCL” → top abstracts mention “Jak1/Jak3.”
Read: Confirm role in cytokine signaling pathways.
Answer: “Jak1/Jak3 kinase complex.”
Reward: 1 (matches gold or synonyms).

04Experiments & Results

The Test: They measured exact-match accuracy—does the agent’s final answer exactly match any gold answer or its synonyms after normalization? This clean metric fits RLVR perfectly.

The Competition (Baselines):

Direct LLM (no retrieval): The model guesses from memory.
Chain-of-Thought prompting (CoT): Encourages step-by-step text, but no special training.
RAG: Retrieve top documents, then generate an answer.
Search-o1: A search-enhanced reasoning baseline.
PaperQA2: A strong literature agent, here run with the same retrievers and base models for fairness.

Scoreboard (with context):

PaperSearchQA (test set):
- Qwen2.5-7B-Instruct: Direct 27.5%; CoT 29.7%; RAG 36.5%; Search-o1 36.5%; PaperQA2 37.1%; RLVR (Search-R1) 51.0%.
- Qwen2.5-3B-Instruct: Direct 16.7%; CoT 20.3%; RAG 32.0%; Search-o1 30.8%; PaperQA2 32.4%; RLVR (Search-R1) 41.6%. Interpretation: 51.0% is like jumping from a solid B- (~36%) to a strong A- or A (~51%) when everyone else stayed around the mid-30s.
BioASQ-factoid (human-authored):
- Qwen2.5-7B-Instruct: Direct 24.9%; CoT 23.4%; RAG 29.7%; Search-o1 31.5%; PaperQA2 32.8%; RLVR (Search-R1) 44.8%.
- Qwen2.5-3B-Instruct: Direct 15.8%; CoT 16.5%; RAG 30.0%; Search-o1 29.4%; PaperQA2 33.1%; RLVR (Search-R1) 35.5%. Interpretation: Even on a respected human-written benchmark, the RL-trained agent pulls notably ahead.

Per-Category Patterns (high level):

Easiest: “Biomarkers & diagnostics,” “Protein function & signaling” often scored higher.
Hardest: “Genetic mutations” tended to be more challenging.

Surprising Findings:

Semantic vs Syntactic Retrieval: Using e5 (semantic) helped only a little—within ~2 points—over BM25 in this domain. Likely reasons: scientific terms are specific (keywords matter), and current semantic models may not fully capture biomedical nuance.
Strong Parametric Knowledge: Even without retrieval, larger models did reasonably well (e.g., 7B Direct ~29.7% on PaperSearchQA), suggesting they memorized some PubMed facts during pretraining. Still, retrieval clearly boosts accuracy.
Model Size Matters: Gains from 3B to 7B were consistent across methods, but RLVR’s margin over CoT was large (~20 points), indicating gains come from better knowledge use rather than just prettier reasoning text.
Paraphrasing Raises Difficulty: Non-paraphrased questions scored 57.2% vs paraphrased 44.9% under RLVR—proof that paraphrasing prevents trivial keyword matches and makes the benchmark more realistic.
Training Dynamics: Base vs instruct models converged similarly, though base needed more steps. GRPO could be unstable, with occasional reward collapse in some runs—training stability is an open engineering challenge.

Behavioral Insights (from traces):

Planning and Keyword Extraction: The agent increasingly writes down key terms first, then forms a targeted query.
Reasoning Before Search: Sometimes it sketches a likely answer from memory, then searches to confirm.
Self-Verification: Even when it “knows,” it still searches to double-check, a healthy habit for science.
Less Varied Behavior Over Time: Because the task is factoid QA, the learned routine becomes streamlined: plan → search → answer.

Bottom Line: RLVR in a science-focused environment beats strong non-RL baselines by a clear margin and encourages helpful behaviors like planning and verification.

05Discussion & Limitations

Limitations (be specific):

Scope: Only single-hop factoid QA over abstracts. No multi-hop reasoning, no lists of answers, and no long-form synthesis—yet these are common in real research tasks.
Domain Coverage: Biomedical only; chemistry, materials science, and computer science are not included (though the pipeline can extend).
Synthetic QA Risks: LLM-generated questions can occasionally reflect hallucinations or narrow claims from single abstracts; expert review reduced this but did not eliminate it.
Tooling and Stability: Dense retrieval requires large GPU memory; GRPO training can be unstable in some runs.
Prototype Status: Not production-ready; more evaluation and safety checks are needed for real deployments.

Required Resources:

Hardware: At least two 80GB GPUs for e5 retrieval inference; multiple A100s (80GB) for RL training at the reported scale.
Software: Search-R1-compatible setup (VERL framework), BM25/e5 indexes, access to the released datasets.
Expertise: Familiarity with RL training loops, retrieval tuning, and biomedical terminology.

When NOT to Use:

If you need long evidence summaries, comparisons across conflicting studies, or systematic reviews—the system is trained for short, single-entity answers.
If your questions span multiple documents in complex chains (multi-hop), or require images/figures.
If you lack sufficient compute to run dense retrieval or RL training.

Open Questions:

How to scale to multi-hop, list answers, and long-form reasoning while keeping verification reliable?
Can we design better biomedical semantic retrievers that clearly beat BM25?
How to measure source reliability, handle retractions, and surface uncertainty in answers?
Can confidence-aware agents skip search when they truly “know,” and search only when needed?
How to extend beyond text to figures and tables for deeper evidence checks?

Honest Assessment: This work is a strong step toward practical scientific search agents. It shows that simple, verifiable rewards inside a realistic scientific library lead to better skills than standard prompting or RAG alone. But real lab workflows need multi-document synthesis, quality assessment, and multimodal evidence—all exciting next steps.

06Conclusion & Future Work

Three-Sentence Summary: The authors built a realistic scientific search playground—16 million PubMed abstracts plus 60k clear, single-answer questions—so an LLM agent can practice searching and answering with verifiable rewards. Training with RLVR (Search-R1) significantly beats strong non-RL baselines on both the new PaperSearchQA test and the BioASQ-factoid benchmark. The agent learns helpful habits like planning queries and self-verifying, laying groundwork for future AI Scientist systems.

Main Achievement: Showing that outcome-only RL in a science-focused environment measurably improves an LLM’s ability to search papers and deliver precise, verifiable facts.

Future Directions:

Expand beyond single-hop factoids to multi-hop, list answers, and long-form synthesis with graded or rubric-based rewards.
Improve biomedical semantic retrieval and add confidence/uncertainty and source-reliability signals.
Incorporate multimodal evidence (figures/tables) and domain-specific tools like citation traversal.
Extend the pipeline to other scientific fields (chemistry, materials science, CS) at similar scale.

Why Remember This: It demonstrates a clean recipe—big real corpus + verifiable fact questions + RLVR—that reliably teaches agents to plan searches and extract exact scientific facts. This is a practical building block for future AI assistants that help scientists find, verify, and use knowledge quickly and carefully.

Practical Applications

•Rapid fact lookup for clinicians (e.g., biomarkers, diagnostic scales) during decision support.
•Lab assistants that fetch exact protocol details or method names from literature (e.g., in vitro assays).
•Drug and indication cross-checks for pharmacology teams using synonyms-aware exact matching.
•Grant and paper writing aids that verify single-entity facts before inclusion.
•Educational tools that quiz students with paraphrased, factoid biomedical questions.
•Curation pipelines that auto-tag papers with entities (genes, diseases, methods) using verified answers.
•Knowledge base construction that ingests exact matches from abstracts to populate entity links.
•Research onboarding tools that answer common, precise lab questions (e.g., cell lines, assays).
•Pharmaceutical safety teams verifying specific adverse event terms tied to therapies.
•Semantic search tuning benchmarks for biomedical retrieval systems using paraphrased questions.

Version: 1