Robots learn best from what they would actually see, which is a first-person (egocentric) view, but most AI models are trained on third-person videos and get confused.
RoboTracer is a vision-language model that turns tricky, word-only instructions into safe, step-by-step 3D paths (spatial traces) robots can follow.
InfiniteVL is a vision-language model that mixes two ideas: local focus with Sliding Window Attention and long-term memory with a linear module called Gated DeltaNet.
Robots need lots of realistic, long videos to learn, but collecting them is slow and expensive.
The paper shows how a vision-language model (VLM) can train itself to be a fair judge of answers about images without using any human preference labels.