Papers2

#video question answering

Synthetic Visual Genome 2: Extracting Large-scale Spatio-Temporal Scene Graphs from Videos

This paper builds a giant, automatically made video library called SVG2 that tells who is in a video, what they look like, and how they interact over time.

#video scene graph#spatio-temporal reasoning#panoptic segmentation

Not triaged yet

Multimodal Fact-Level Attribution for Verifiable Reasoning

Beginner

David Wan, Han Wang et al.Feb 12arXiv

This paper builds a new test, called MURGAT, to check whether AI models can back up each small fact they say with the right part of a video, audio, or figure.

#multimodal grounding#fact-level attribution#atomic fact decomposition

Not triaged yet