Papers4

All Beginner Intermediate Advanced

All Sources arXiv

#Exact Match

SAGE: Benchmarking and Improving Retrieval for Deep Research Agents

Intermediate

Tiansheng Hu, Yilun Zhao et al.Feb 5arXiv

SAGE is a new test for how well AI research agents find scientific papers when questions require multi-step reasoning.

#SAGE benchmark#scientific literature retrieval#deep research agents

Not triaged yet

AT$^2$PO: Agentic Turn-based Policy Optimization via Tree Search

Intermediate

Zefang Zong, Dingwei Chen et al.Jan 8arXiv

AT2PO is a new way to train AI agents that work in several turns, like asking the web a question, reading the result, and trying again.

#Agentic Reinforcement Learning#Turn-level Optimization#Tree Search

Not triaged yet

EpiQAL: Benchmarking Large Language Models in Epidemiological Question Answering for Enhanced Alignment and Reasoning

Intermediate

Mingyang Wei, Dehai Min et al.Jan 6arXiv

EpiQAL is a new benchmark that tests how well AI models answer population-level disease questions using real research papers.

#Epidemiological reasoning#Question answering#Benchmarking LLMs

Not triaged yet

Probing Scientific General Intelligence of LLMs with Scientist-Aligned Workflows

Intermediate

Wanghan Xu, Yuhao Zhou et al.Dec 18arXiv

The paper defines Scientific General Intelligence (SGI) as an AI that can do science like a human scientist across the full loop: study, imagine, test, and understand.

#Scientific General Intelligence#Practical Inquiry Model#Scientist-aligned benchmark

Not triaged yet