SAMTok turns any object’s mask in an image into just two special “words” so language models can handle pixels like they handle text.
Cities are full of places defined by people, like schools and parks, which are hard to see clearly from space without extra clues.