Long chains of thought make AI smarter but also slower, pricier, and limited by memory windows.
The paper introduces Rubric-ARM, a system that teaches two AI helpers—a rubric maker and a judge—to learn together using reinforcement learning so they can better decide which answers people would prefer.
Large language models usually get judged one message at a time, but many real tasks need smart planning across a whole conversation.