Large Vision-Language Models (LVLMs) are great with one picture but get confused when you give them several, often mixing details from different images.
HERMES is a training-free way to make video-language models understand live, streaming video quickly and accurately.