A Benchmark and Agentic Framework for Omni-Modal Reasoning and Tool Use in Long Videos
IntermediateMohammed Irfan Kurpath, Jaseel Muhammad Kaithakkodan et al.Dec 18arXiv
This paper builds a new test, LongShOTBench, to check if AI can truly understand long videos by using sight, speech, and sounds together.
#long-form video understanding#multimodal reasoning#audio-visual-speech alignment