Alternating Reinforcement Learning for Rubric-Based Reward Modeling in Non-Verifiable LLM Post-Training
IntermediateRan Xu, Tianci Liu et al.Feb 2arXiv
The paper introduces Rubric-ARM, a system that teaches two AI helpers—a rubric maker and a judge—to learn together using reinforcement learning so they can better decide which answers people would prefer.
#Rubric-based reward modeling#LLM-as-a-judge#Alternating reinforcement learning