VG-Refiner is a new way for AI to find the right object in a picture when given a description, even if helper tools make mistakes.
The paper shows how a vision-language model (VLM) can train itself to be a fair judge of answers about images without using any human preference labels.