RubricBench is a new benchmark that checks whether AI judges can use clear, checklist-style rules (rubrics) the way humans do.
DeepResearch agents write long, evidence-based reports, but teaching and grading them is hard because there is no single 'right answer' to score against.