The paper asks a simple question: what must a vision modelβs internal pictures (embeddings) look like if it can recognize new mixes of things it already knows?
The paper discovers a tiny, special group of neurons inside large language models (LLMs) that act like a reward system in the human brain.
The paper shows that when vision-language models write captions, only a small set of uncertain words (about 20%) act like forks that steer the whole sentence.