VIBEVOICE-ASR is a single-pass system that listens to up to 60 minutes of audio at once and outputs who spoke, when they spoke, and what they said in one stream.
This paper builds a big, fair test called Hearing to Translate to check how well different speech translation systems work in the real world.