Papers5

#IoU

Beyond Unimodal Shortcuts: MLLMs as Cross-Modal Reasoners for Grounded Named Entity Recognition

The paper teaches multimodal large language models (MLLMs) to stop guessing from just text or just images and instead check both together before answering.

#GMNER#Multimodal Large Language Models#Modality Bias

PhysRVG: Physics-Aware Unified Reinforcement Learning for Video Generative Models

Intermediate

Qiyuan Zhang, Biao Gong et al.Jan 16arXiv

This paper teaches video-making AIs to follow real-world physics, so rolling balls roll right and collisions look believable.

#physics-aware video generation#rigid body motion#reinforcement learning

Medical SAM3: A Foundation Model for Universal Prompt-Driven Medical Image Segmentation

Intermediate

Chongcong Jiang, Tianxingjian Ding et al.Jan 15arXiv

Medical SAM3 is a text-prompted medical image segmentation model that was fully fine-tuned on 33 diverse datasets to work across many imaging types like ultrasound, X-ray, endoscopy, and pathology.

#Medical image segmentation#Prompt-based segmentation#Foundation models

GroundingME: Exposing the Visual Grounding Gap in MLLMs through Multi-Dimensional Evaluation

Intermediate

Rang Li, Lei Li et al.Dec 19arXiv

Visual grounding is when an AI finds the exact thing in a picture that a sentence is talking about, and this paper shows today’s big vision-language AIs are not as good at it as we thought.

#visual grounding#multimodal large language models#benchmark

ReVSeg: Incentivizing the Reasoning Chain for Video Segmentation with Reinforcement Learning

Intermediate

Yifan Li, Yingda Yin et al.Dec 2arXiv

ReVSeg teaches an AI to segment objects in videos by thinking step-by-step instead of guessing everything at once.

#Reasoning Video Object Segmentation#Vision-Language Models#Temporal Grounding