SARAH is a real-time system that makes virtual characters move their whole bodies naturally during a conversation while knowing where the user is.
This paper builds a new audio tokenizer, called MOSS-Audio-Tokenizer, that turns sound into tiny tokens the way text tokenizers turn sentences into words.