Papers8

#IoU

Learning Cross-View Object Correspondence via Cycle-Consistent Mask Prediction

Shannan Yan, Leqi Zheng et al.Feb 22arXiv

This paper teaches a computer to find the same object when seen from two very different cameras, like a body camera (first-person) and a room camera (third-person).

#cross-view correspondence#egocentric to exocentric#binary segmentation

Not triaged yet

CADEvolve: Creating Realistic CAD via Program Evolution

Intermediate

Maksim Elistratov, Marina Barannikov et al.Feb 18arXiv

AI models that make CAD designs used to learn mostly from simple “draw-then-extrude” examples, so they struggled with real, complex parts.

#CAD#CadQuery#Image2CAD

Not triaged yet

TimeChat-Captioner: Scripting Multi-Scene Videos with Time-Aware and Structural Audio-Visual Captions

Intermediate

Linli Yao, Yuancheng Wei et al.Feb 9arXiv

This paper teaches AI to write movie-like scripts for videos by adding exact timestamps and rich details about what you see and hear.

#Omni Dense Captioning#time-aware video captioning#audio-visual understanding

Not triaged yet

Beyond Unimodal Shortcuts: MLLMs as Cross-Modal Reasoners for Grounded Named Entity Recognition

Intermediate

Jinlong Ma, Yu Zhang et al.Feb 4arXiv

The paper teaches multimodal large language models (MLLMs) to stop guessing from just text or just images and instead check both together before answering.

#GMNER#Multimodal Large Language Models#Modality Bias

Not triaged yet

PhysRVG: Physics-Aware Unified Reinforcement Learning for Video Generative Models

Intermediate

Qiyuan Zhang, Biao Gong et al.Jan 16arXiv

This paper teaches video-making AIs to follow real-world physics, so rolling balls roll right and collisions look believable.

#physics-aware video generation#rigid body motion#reinforcement learning

Not triaged yet

Medical SAM3: A Foundation Model for Universal Prompt-Driven Medical Image Segmentation

Intermediate

Chongcong Jiang, Tianxingjian Ding et al.Jan 15arXiv

Medical SAM3 is a text-prompted medical image segmentation model that was fully fine-tuned on 33 diverse datasets to work across many imaging types like ultrasound, X-ray, endoscopy, and pathology.

#Medical image segmentation#Prompt-based segmentation#Foundation models

Not triaged yet

GroundingME: Exposing the Visual Grounding Gap in MLLMs through Multi-Dimensional Evaluation

Intermediate

Rang Li, Lei Li et al.Dec 19arXiv

Visual grounding is when an AI finds the exact thing in a picture that a sentence is talking about, and this paper shows today’s big vision-language AIs are not as good at it as we thought.

#visual grounding#multimodal large language models#benchmark

Not triaged yet

ReVSeg: Incentivizing the Reasoning Chain for Video Segmentation with Reinforcement Learning

Intermediate

Yifan Li, Yingda Yin et al.Dec 2arXiv

ReVSeg teaches an AI to segment objects in videos by thinking step-by-step instead of guessing everything at once.

#Reasoning Video Object Segmentation#Vision-Language Models#Temporal Grounding

Not triaged yet