Papers15

#visual grounding

Unified Video Editing with Temporal Reasoner

VideoCoF is a new way to edit videos that first figures out WHERE to edit and then does the edit, like thinking before acting.

#video editing#diffusion transformer#chain-of-frames

Not triaged yet

VG-Refiner: Towards Tool-Refined Referring Grounded Reasoning via Agentic Reinforcement Learning

Intermediate

Yuji Wang, Wenlong Liu et al.Dec 6arXiv

VG-Refiner is a new way for AI to find the right object in a picture when given a description, even if helper tools make mistakes.

#visual grounding#referring expression comprehension#tool-integrated visual reasoning

Not triaged yet

Self-Improving VLM Judges Without Human Annotations

Intermediate

Inna Wanyin Lin, Yushi Hu et al.Dec 2arXiv

The paper shows how a vision-language model (VLM) can train itself to be a fair judge of answers about images without using any human preference labels.

#vision-language model#VLM judge#reward model

Not triaged yet

1 2