Robots learn better when they get small hints at every step instead of only a final thumbs-up or thumbs-down.
This paper asks a new question for vision-language models: not just 'What do you see?' but 'How far along is the task right now?'