Visual spatial reasoning often fails when a model only looks at one picture and must imagine new viewpoints.
FantasyVLN teaches a robot to follow language instructions while looking around, using a smart, step-by-step thinking style during training but not at test time.