SAMTok turns any objectโs mask in an image into just two special โwordsโ so language models can handle pixels like they handle text.
Cities are full of places defined by people, like schools and parks, which are hard to see clearly from space without extra clues.