VIBEVOICE-ASR is a single-pass system that listens to up to 60 minutes of audio at once and outputs who spoke, when they spoke, and what they said in one stream.
This paper builds one smart system that listens to child–adult conversations and writes what was said, who said it, and exactly when each person spoke.
This paper introduces MOSS Transcribe Diarize, a single model that writes down what people say in a conversation, tells who said each part, and marks the exact times—all in one go.