This paper introduces P-GenRM, a personalized generative reward model that judges AI answers using a custom scorecard built just for each user and situation.
This paper teaches AI to look things up on the web and fix its own mistakes mid-thought instead of starting over from scratch.
Large language models sometimes reach the right answer for the wrong reasons, which is risky and confusing.