This paper fixes a common problem in multimodal AI: models can understand pictures and words well but stumble when asked to create matching images.
HERBench is a new test that checks if video AI models can combine several clues spread across time, not just guess from one frame or language priors.