Tool Verification for Test-Time Reinforcement Learning
IntermediateRuotong Liao, Nikolai RΓΆhrich et al.Mar 2arXiv
The paper fixes a big flaw in test-time reinforcement learning (TTRL): when many wrong answers agree, the model rewards the mistake and gets stuck.
#test-time reinforcement learning#verification-weighted voting#tool verification