GTR-Turbo teaches a vision-language agent using a 'free teacher' made by merging its own past checkpoints, so no costly external model is needed.
Before this work, most big language models talked one word at a time (autoregressive), which made them slow and hard to parallelize.