VideoLoom: A Video Large Language Model for Joint Spatial-Temporal Understanding
IntermediateJiapeng Shi, Junke Wang et al.Jan 12arXiv
VideoLoom is a single AI model that can tell both when something happens in a video and where it happens, at the pixel level.
#Video Large Language Model#Temporal Grounding#Referring Video Object Segmentation