🎓How I Study AIHISA
📖Read
📄Papers📰Blogs🎬Courses
💡Learn
🛤️Paths📚Topics💡Concepts🎴Shorts
🎯Practice
🧩Problems🎯Prompts🧠Review
Search
How I Study AI - Learn AI Papers & Lectures the Easy Way

Papers11

AllBeginnerIntermediateAdvanced
All SourcesarXiv
#visual grounding

Reinforced Attention Learning

Intermediate
Bangzheng Li, Jianmo Ni et al.Feb 4arXiv

This paper teaches AI to pay attention better by training its focus, not just its words.

#Reinforced Attention Learning#attention policy#multimodal LLM

ObjEmbed: Towards Universal Multimodal Object Embeddings

Intermediate
Shenghao Fu, Yukun Su et al.Feb 2arXiv

ObjEmbed teaches an AI to understand not just whole pictures, but each object inside them, and to link those objects to the right words.

#object embeddings#IoU embedding#visual grounding

WorldVQA: Measuring Atomic World Knowledge in Multimodal Large Language Models

Intermediate
Runjie Zhou, Youbo Shao et al.Jan 28arXiv

WorldVQA is a new test that checks if multimodal AI models can correctly name what they see in pictures without doing extra reasoning.

#WorldVQA#atomic visual knowledge#multimodal large language models

LaViT: Aligning Latent Visual Thoughts for Multi-modal Reasoning

Intermediate
Linquan Wu, Tianxiang Jiang et al.Jan 15arXiv

LaViT is a new way to teach smaller vision-language models to look at the right parts of an image before they speak.

#multimodal reasoning#visual attention#knowledge distillation

VL-LN Bench: Towards Long-horizon Goal-oriented Navigation with Active Dialogs

Intermediate
Wensi Huang, Shaohao Zhu et al.Dec 26arXiv

Real life directions are often vague, so the paper creates a task where a robot can ask questions while it searches for a very specific object in a big house.

#embodied AI#interactive navigation#instance goal navigation

See Less, See Right: Bi-directional Perceptual Shaping For Multimodal Reasoning

Intermediate
Shuoshuo Zhang, Yizhen Zhang et al.Dec 26arXiv

The paper teaches vision-language models (AIs that look and read) to pay attention to the right picture parts without needing extra tools during answering time.

#BiPS#perceptual shaping#vision-language models

GroundingME: Exposing the Visual Grounding Gap in MLLMs through Multi-Dimensional Evaluation

Intermediate
Rang Li, Lei Li et al.Dec 19arXiv

Visual grounding is when an AI finds the exact thing in a picture that a sentence is talking about, and this paper shows today’s big vision-language AIs are not as good at it as we thought.

#visual grounding#multimodal large language models#benchmark

Thinking with Images via Self-Calling Agent

Intermediate
Wenxi Yang, Yuzhong Zhao et al.Dec 9arXiv

This paper teaches a vision-language model to think about images by talking to copies of itself, using only words to plan and decide.

#Self-Calling Chain-of-Thought#sCoT#interleaved multimodal chain-of-thought

Unified Video Editing with Temporal Reasoner

Intermediate
Xiangpeng Yang, Ji Xie et al.Dec 8arXiv

VideoCoF is a new way to edit videos that first figures out WHERE to edit and then does the edit, like thinking before acting.

#video editing#diffusion transformer#chain-of-frames

VG-Refiner: Towards Tool-Refined Referring Grounded Reasoning via Agentic Reinforcement Learning

Intermediate
Yuji Wang, Wenlong Liu et al.Dec 6arXiv

VG-Refiner is a new way for AI to find the right object in a picture when given a description, even if helper tools make mistakes.

#visual grounding#referring expression comprehension#tool-integrated visual reasoning

Self-Improving VLM Judges Without Human Annotations

Intermediate
Inna Wanyin Lin, Yushi Hu et al.Dec 2arXiv

The paper shows how a vision-language model (VLM) can train itself to be a fair judge of answers about images without using any human preference labels.

#vision-language model#VLM judge#reward model