🎓How I Study AIHISA
📖Read
📄Papers📰Blogs🎬Courses
💡Learn
🛤️Paths📚Topics💡Concepts🎴Shorts
🎯Practice
📝Daily Log🎯Prompts🧠Review
SearchSettings
How I Study AI - Learn AI Papers & Lectures the Easy Way

Papers6

AllBeginnerIntermediateAdvanced
All SourcesarXiv
#Multimodal Large Language Models

Ref-Adv: Exploring MLLM Visual Reasoning in Referring Expression Tasks

Intermediate
Qihua Dong, Kuo Yang et al.Feb 27arXiv

This paper builds a new test called Ref-Adv to check if AI can truly match tricky sentences to the right thing in a picture.

#Referring Expression Comprehension#Visual Grounding#Multimodal Large Language Models

When and How Much to Imagine: Adaptive Test-Time Scaling with World Models for Visual Spatial Reasoning

Intermediate
Shoubin Yu, Yue Zhang et al.Feb 9arXiv

Visual spatial reasoning often fails when a model only looks at one picture and must imagine new viewpoints.

#Adaptive Test-Time Scaling#World Models#Visual Spatial Reasoning

Beyond Unimodal Shortcuts: MLLMs as Cross-Modal Reasoners for Grounded Named Entity Recognition

Intermediate
Jinlong Ma, Yu Zhang et al.Feb 4arXiv

The paper teaches multimodal large language models (MLLMs) to stop guessing from just text or just images and instead check both together before answering.

#GMNER#Multimodal Large Language Models#Modality Bias

Modality Gap-Driven Subspace Alignment Training Paradigm For Multimodal Large Language Models

Intermediate
Xiaomin Yu, Yi Xin et al.Feb 2arXiv

This paper finds a precise way to describe and fix the Modality Gap, which is when image and text features that mean the same thing still sit in different places in the AI’s memory space.

#Modality Gap#Multimodal Large Language Models#Contrastive Learning

SpatialTree: How Spatial Abilities Branch Out in MLLMs

Intermediate
Yuxi Xiao, Longfei Li et al.Dec 23arXiv

SpatialTree is a new, four-level "ability tree" that tests how multimodal AI models (that see and read) handle space: from basic seeing to acting in the world.

#Spatial Intelligence#Multimodal Large Language Models#Hierarchical Benchmark

Active Video Perception: Iterative Evidence Seeking for Agentic Long Video Understanding

Beginner
Ziyang Wang, Honglu Zhou et al.Dec 5arXiv

Long Video Understanding (LVU) is hard because the important clues are tiny, far apart in time, and buried in hours of mostly unimportant footage.

#Active Video Perception#Long Video Understanding#Plan-Observe-Reflect