This paper asks a simple question: does reinforcement learning (RL) truly make medical vision-language models (VLMs) smarter, or just help them pick better from answers they already know?
The paper shows that making a model write a number as a sequence of digits and then grading the whole number at the end works better than grading each digit separately.