How I Study AI - Learn AI Papers & Lectures the Easy Way

Beyond Unimodal Shortcuts: MLLMs as Cross-Modal Reasoners for Grounded Named Entity Recognition

Intermediate

Jinlong Ma, Yu Zhang et al.Feb 4arXiv

The paper teaches multimodal large language models (MLLMs) to stop guessing from just text or just images and instead check both together before answering.

#GMNER#Multimodal Large Language Models#Modality Bias

SpatialTree: How Spatial Abilities Branch Out in MLLMs

Intermediate

Yuxi Xiao, Longfei Li et al.Dec 23arXiv

SpatialTree is a new, four-level "ability tree" that tests how multimodal AI models (that see and read) handle space: from basic seeing to acting in the world.

#Spatial Intelligence#Multimodal Large Language Models#Hierarchical Benchmark

Active Video Perception: Iterative Evidence Seeking for Agentic Long Video Understanding

Beginner

Ziyang Wang, Honglu Zhou et al.Dec 5arXiv

Long Video Understanding (LVU) is hard because the important clues are tiny, far apart in time, and buried in hours of mostly unimportant footage.

#Active Video Perception#Long Video Understanding#Plan-Observe-Reflect

Papers3

Beyond Unimodal Shortcuts: MLLMs as Cross-Modal Reasoners for Grounded Named Entity Recognition

SpatialTree: How Spatial Abilities Branch Out in MLLMs

Active Video Perception: Iterative Evidence Seeking for Agentic Long Video Understanding