The paper asks a simple question: when an AI sees a picture and some text but the instructions say 'only trust the picture,' how does it decide which one to follow?
The paper shows that big sequence models (like transformers) quietly learn longer goals inside their hidden activations, even though they are trained one step at a time.