Papers2

#preference accuracy

RubricBench: Aligning Model-Generated Rubrics with Human Standards

Qiyuan Zhang, Junyi Zhou et al.Mar 2arXiv

RubricBench is a new benchmark that checks whether AI judges can use clear, checklist-style rules (rubrics) the way humans do.

#RubricBench#rubric-guided evaluation#reward models

Learning Query-Specific Rubrics from Human Preferences for DeepResearch Report Generation

Intermediate

Changze Lv, Jie Zhou et al.Feb 3arXiv

DeepResearch agents write long, evidence-based reports, but teaching and grading them is hard because there is no single 'right answer' to score against.

#DeepResearch#query-specific rubrics#human preference learning