🎓How I Study AIHISA
📖Read
📄Papers📰Blogs🎬Courses
💡Learn
🛤️Paths📚Topics💡Concepts🎴Shorts
🎯Practice
🧩Problems🎯Prompts🧠Review
Search
How I Study AI - Learn AI Papers & Lectures the Easy Way

Papers28

AllBeginnerIntermediateAdvanced
All SourcesarXiv
#vision-language models

SpatiaLab: Can Vision-Language Models Perform Spatial Reasoning in the Wild?

Intermediate
Azmine Toushik Wasi, Wahid Faisal et al.Feb 3arXiv

SpatiaLab is a new test that checks if vision-language models (VLMs) can understand real-world spatial puzzles, like what’s in front, behind, bigger, or reachable.

#SpatiaLab#spatial reasoning#vision-language models

Instruction Anchors: Dissecting the Causal Dynamics of Modality Arbitration

Intermediate
Yu Zhang, Mufan Xu et al.Feb 3arXiv

The paper asks a simple question: when an AI sees a picture and some text but the instructions say 'only trust the picture,' how does it decide which one to follow?

#multimodal instruction following#modality arbitration#instruction tokens

Thinking with Comics: Enhancing Multimodal Reasoning through Structured Visual Storytelling

Intermediate
Andong Chen, Wenxin Zhu et al.Feb 2arXiv

This paper shows that comics (multi-panel pictures with words) can help AI think through problems step by step, just like a student explains their work.

#multimodal reasoning#visual storytelling#comics

Kimi K2.5: Visual Agentic Intelligence

Beginner
Kimi Team, Tongtong Bai et al.Feb 2arXiv

Kimi K2.5 is a new open-source AI that can read both text and visuals (images and videos) and act like a team of helpers to finish big tasks faster.

#multimodal learning#vision-language models#joint optimization

VTC-R1: Vision-Text Compression for Efficient Long-Context Reasoning

Intermediate
Yibo Wang, Yongcheng Jing et al.Jan 29arXiv

This paper shows a new way to help AI think through long problems faster by turning earlier text steps into small pictures the AI can reread.

#vision-text compression#optical memory#iterative reasoning

MMFineReason: Closing the Multimodal Reasoning Gap via Open Data-Centric Methods

Intermediate
Honglin Lin, Zheng Liu et al.Jan 29arXiv

MMFineReason is a huge, open dataset (1.8 million examples, 5.1 billion solution tokens) that teaches AIs to think step by step about pictures and text together.

#multimodal reasoning#vision-language models#chain-of-thought

PROGRESSLM: Towards Progress Reasoning in Vision-Language Models

Intermediate
Jianshu Zhang, Chengxuan Qian et al.Jan 21arXiv

This paper asks a new question for vision-language models: not just 'What do you see?' but 'How far along is the task right now?'

#progress reasoning#vision-language models#episodic retrieval

XR: Cross-Modal Agents for Composed Image Retrieval

Beginner
Zhongyu Yang, Wei Pang et al.Jan 20arXiv

XR is a new, training-free team of AI helpers that finds images using both a reference picture and a short text edit (like “same jacket but red”).

#Composed Image Retrieval#cross-modal reasoning#multi-agent system

OS-Symphony: A Holistic Framework for Robust and Generalist Computer-Using Agent

Intermediate
Bowen Yang, Kaiming Jin et al.Jan 12arXiv

Computer-using agents kept forgetting important visual details over long tasks and could not reliably find up-to-date, step-by-step help for unfamiliar apps.

#computer-using agents#vision-language models#milestone memory

AgentOCR: Reimagining Agent History via Optical Self-Compression

Intermediate
Lang Feng, Fuchao Yang et al.Jan 8arXiv

AgentOCR turns an agent’s long text history into pictures so it can remember more using fewer tokens.

#AgentOCR#optical self-compression#visual tokens

FocusUI: Efficient UI Grounding via Position-Preserving Visual Token Selection

Intermediate
Mingyu Ouyang, Kevin Qinghong Lin et al.Jan 7arXiv

FOCUSUI makes computer-using AI faster and still accurate by looking only at the important parts of a screen.

#UI grounding#vision-language models#visual token pruning

What Users Leave Unsaid: Under-Specified Queries Limit Vision-Language Models

Beginner
Dasol Choi, Guijin Son et al.Jan 7arXiv

Real people often ask vague questions with pictures, and today’s vision-language models (VLMs) struggle with them.

#vision-language models#under-specified queries#query explicitation
123