🎓How I Study AIHISA
📖Read
📄Papers📰Blogs🎬Courses
💡Learn
🛤️Paths📚Topics💡Concepts🎴Shorts
🎯Practice
📝Daily Log🎯Prompts🧠Review
SearchSettings
How I Study AI - Learn AI Papers & Lectures the Easy Way

Papers20

AllBeginnerIntermediateAdvanced
All SourcesarXiv
#multimodal large language models

GroundingME: Exposing the Visual Grounding Gap in MLLMs through Multi-Dimensional Evaluation

Intermediate
Rang Li, Lei Li et al.Dec 19arXiv

Visual grounding is when an AI finds the exact thing in a picture that a sentence is talking about, and this paper shows today’s big vision-language AIs are not as good at it as we thought.

#visual grounding#multimodal large language models#benchmark

4D-RGPT: Toward Region-level 4D Understanding via Perceptual Distillation

Intermediate
Chiao-An Yang, Ryo Hachiuma et al.Dec 18arXiv

This paper teaches a video-understanding AI to think in 3D plus time (4D) so it can answer questions about specific objects moving in videos.

#4D perception#multimodal large language models#perceptual distillation

Differences That Matter: Auditing Models for Capability Gap Discovery and Rectification

Intermediate
Qihao Liu, Chengzhi Mao et al.Dec 18arXiv

AuditDM is a friendly 'auditor' model that hunts for where vision-language models get things wrong and then creates the right practice to fix them.

#AuditDM#model auditing#cross-model divergence

Step-GUI Technical Report

Intermediate
Haolong Yan, Jia Wang et al.Dec 17arXiv

This paper builds Step-GUI, a pair of small-but-strong GUI agent models (4B/8B) that can use phones and computers by looking at screenshots and following instructions.

#GUI automation#multimodal large language models#trajectory-level calibration

TimeLens: Rethinking Video Temporal Grounding with Multimodal LLMs

Intermediate
Jun Zhang, Teng Wang et al.Dec 16arXiv

TimeLens studies how to teach AI not just what happens in a video, but exactly when it happens, which is called video temporal grounding (VTG).

#video temporal grounding#multimodal large language models#benchmark re-annotation

MMSI-Video-Bench: A Holistic Benchmark for Video-Based Spatial Intelligence

Intermediate
Jingli Lin, Runsen Xu et al.Dec 11arXiv

This paper introduces MMSI-Video-Bench, a big, carefully hand-made test to check how well AI understands space and motion in videos.

#video-based spatial intelligence#multimodal large language models#spatial construction

Rethinking Chain-of-Thought Reasoning for Videos

Intermediate
Yiwu Zhong, Zi-Yuan Hu et al.Dec 10arXiv

The paper shows that video AIs do not need long, human-like chains of thought to reason well.

#video reasoning#chain-of-thought#concise reasoning

OmniSafeBench-MM: A Unified Benchmark and Toolbox for Multimodal Jailbreak Attack-Defense Evaluation

Intermediate
Xiaojun Jia, Jie Liao et al.Dec 6arXiv

OmniSafeBench-MM is a one-stop, open-source test bench that fairly compares how multimodal AI models get tricked (jailbroken) and how well different defenses stop that.

#multimodal large language models#jailbreak attacks#safety alignment
12