The paper solves a big problem: when you merge several reinforcement-learned models, their special skills get watered down by simple averaging.
This paper builds a new test, LongShOTBench, to check if AI can truly understand long videos by using sight, speech, and sounds together.
InfiniteVL is a vision-language model that mixes two ideas: local focus with Sliding Window Attention and long-term memory with a linear module called Gated DeltaNet.