Big picture: Vision-language models look at hundreds of image pieces (tokens), which makes them slow and sometimes chatty with mistakes called hallucinations.
The paper shows that changing the language a model 'thinks in' (its language of thought) can make its English answers more varied without making them much worse in quality.