VidVec: Unlocking Video MLLM Embeddings for Video-Text Retrieval
IntermediateIssar Tzachor, Dvir Samuel et al.Feb 8arXiv
VidVec shows that video-capable multimodal language models already hide strong matching signals between videos and sentences inside their middle layers.
#video–text retrieval#multimodal large language models#intermediate layer embeddings