The paper fixes a big problem in long video generation: models either forget what happened or slowly drift off-topic over time.
This paper introduces MOSS Transcribe Diarize, a single model that writes down what people say in a conversation, tells who said each part, and marks the exact times—all in one go.
Coding agents used to fix software rely on feedback; unit tests give only pass/fail signals that are often noisy or missing.
Standard attention is slow for long texts because it compares every word with every other word, which takes quadratic time.