JudgeRLVR: Judge First, Generate Second for Efficient Reasoning
IntermediateJiangshan Duo, Hanyu Li et al.Jan 13arXiv
JudgeRLVR teaches a model to be a strict judge of answers before it learns to generate them, which trims bad ideas early.
#RLVR#judge-then-generate#discriminative supervision