🎓How I Study AIHISA
📖Read
📄Papers📰Blogs🎬Courses
💡Learn
🛤️Paths📚Topics💡Concepts🎴Shorts
🎯Practice
🧩Problems🎯Prompts🧠Review
Search
How I Study AI - Learn AI Papers & Lectures the Easy Way

Papers23

AllBeginnerIntermediateAdvanced
All SourcesarXiv
#vision-language models

SpatiaLab: Can Vision-Language Models Perform Spatial Reasoning in the Wild?

Intermediate
Azmine Toushik Wasi, Wahid Faisal et al.Feb 3arXiv

SpatiaLab is a new test that checks if vision-language models (VLMs) can understand real-world spatial puzzles, like what’s in front, behind, bigger, or reachable.

#SpatiaLab#spatial reasoning#vision-language models

Instruction Anchors: Dissecting the Causal Dynamics of Modality Arbitration

Intermediate
Yu Zhang, Mufan Xu et al.Feb 3arXiv

The paper asks a simple question: when an AI sees a picture and some text but the instructions say 'only trust the picture,' how does it decide which one to follow?

#multimodal instruction following#modality arbitration#instruction tokens

Thinking with Comics: Enhancing Multimodal Reasoning through Structured Visual Storytelling

Intermediate
Andong Chen, Wenxin Zhu et al.Feb 2arXiv

This paper shows that comics (multi-panel pictures with words) can help AI think through problems step by step, just like a student explains their work.

#multimodal reasoning#visual storytelling#comics

VTC-R1: Vision-Text Compression for Efficient Long-Context Reasoning

Intermediate
Yibo Wang, Yongcheng Jing et al.Jan 29arXiv

This paper shows a new way to help AI think through long problems faster by turning earlier text steps into small pictures the AI can reread.

#vision-text compression#optical memory#iterative reasoning

MMFineReason: Closing the Multimodal Reasoning Gap via Open Data-Centric Methods

Intermediate
Honglin Lin, Zheng Liu et al.Jan 29arXiv

MMFineReason is a huge, open dataset (1.8 million examples, 5.1 billion solution tokens) that teaches AIs to think step by step about pictures and text together.

#multimodal reasoning#vision-language models#chain-of-thought

PROGRESSLM: Towards Progress Reasoning in Vision-Language Models

Intermediate
Jianshu Zhang, Chengxuan Qian et al.Jan 21arXiv

This paper asks a new question for vision-language models: not just 'What do you see?' but 'How far along is the task right now?'

#progress reasoning#vision-language models#episodic retrieval

OS-Symphony: A Holistic Framework for Robust and Generalist Computer-Using Agent

Intermediate
Bowen Yang, Kaiming Jin et al.Jan 12arXiv

Computer-using agents kept forgetting important visual details over long tasks and could not reliably find up-to-date, step-by-step help for unfamiliar apps.

#computer-using agents#vision-language models#milestone memory

AgentOCR: Reimagining Agent History via Optical Self-Compression

Intermediate
Lang Feng, Fuchao Yang et al.Jan 8arXiv

AgentOCR turns an agent’s long text history into pictures so it can remember more using fewer tokens.

#AgentOCR#optical self-compression#visual tokens

FocusUI: Efficient UI Grounding via Position-Preserving Visual Token Selection

Intermediate
Mingyu Ouyang, Kevin Qinghong Lin et al.Jan 7arXiv

FOCUSUI makes computer-using AI faster and still accurate by looking only at the important parts of a screen.

#UI grounding#vision-language models#visual token pruning

WebGym: Scaling Training Environments for Visual Web Agents with Realistic Tasks

Intermediate
Hao Bai, Alexey Taymanov et al.Jan 5arXiv

WebGym is a giant practice world (almost 300,000 tasks) that lets AI web agents learn on real, ever-changing websites instead of tiny, fake ones.

#WebGym#visual web agents#vision-language models

ProGuard: Towards Proactive Multimodal Safeguard

Intermediate
Shaohan Yu, Lijun Li et al.Dec 29arXiv

ProGuard is a safety guard for text and images that doesn’t just spot known problems—it can also recognize and name new, never-seen-before risks.

#proactive safety#multimodal moderation#out-of-distribution detection

See Less, See Right: Bi-directional Perceptual Shaping For Multimodal Reasoning

Intermediate
Shuoshuo Zhang, Yizhen Zhang et al.Dec 26arXiv

The paper teaches vision-language models (AIs that look and read) to pay attention to the right picture parts without needing extra tools during answering time.

#BiPS#perceptual shaping#vision-language models
12