VIBEVOICE-ASR is a single-pass system that listens to up to 60 minutes of audio at once and outputs who spoke, when they spoke, and what they said in one stream.
Most people on Earth speak more than one language and often switch languages in the same chat, but AI tools arenβt tested well on this real behavior.
This paper builds a big, fair test called Hearing to Translate to check how well different speech translation systems work in the real world.