🎓How I Study AIHISA
📖Read
📄Papers📰Blogs🎬Courses
💡Learn
🛤️Paths📚Topics💡Concepts🎴Shorts
🎯Practice
🧩Problems🎯Prompts🧠Review
Search
How I Study AI - Learn AI Papers & Lectures the Easy Way

Papers13

AllBeginnerIntermediateAdvanced
All SourcesarXiv
#Vision-Language Models

AdaptMMBench: Benchmarking Adaptive Multimodal Reasoning for Mode Selection and Reasoning Process

Intermediate
Xintong Zhang, Xiaowen Zhang et al.Feb 2arXiv

AdaptMMBench is a new test that checks if AI models know when to just look and think, and when to use extra visual tools like zooming or brightening an image.

#Adaptive Multimodal Reasoning#Vision-Language Models#Tool Invocation

A2Eval: Agentic and Automated Evaluation for Embodied Brain

Intermediate
Shuai Zhang, Jiayu Hu et al.Feb 2arXiv

A2Eval is a two-agent system that automatically builds and runs fair tests for robot-style vision-language models, cutting wasted work while keeping results trustworthy.

#Embodied AI#Vision-Language Models#Agentic Evaluation

Youtu-VL: Unleashing Visual Potential via Unified Vision-Language Supervision

Intermediate
Zhixiang Wei, Yi Li et al.Jan 27arXiv

Youtu-VL is a new kind of vision-language model that learns to predict both words and tiny image pieces, not just words.

#Vision-Language Models#Unified Autoregressive Supervision#Visual Tokenization

Render-of-Thought: Rendering Textual Chain-of-Thought as Images for Visual Latent Reasoning

Intermediate
Yifan Wang, Shiyu Li et al.Jan 21arXiv

Render-of-Thought (RoT) turns the model’s step-by-step thinking from long text into slim images so the model can think faster with fewer tokens.

#Render-of-Thought#Chain-of-Thought#Latent Reasoning

ChartVerse: Scaling Chart Reasoning via Reliable Programmatic Synthesis from Scratch

Intermediate
Zheng Liu, Honglin Lin et al.Jan 20arXiv

ChartVerse is a new way to make lots of tricky, realistic charts and perfectly checked questions so AI can learn to read charts better.

#Chart reasoning#Vision-Language Models#Rollout Posterior Entropy

CoV: Chain-of-View Prompting for Spatial Reasoning

Intermediate
Haoyu Zhao, Akide Liu et al.Jan 8arXiv

This paper teaches AI to look around a 3D place step by step, instead of staring at a fixed set of pictures, so it can answer tricky spatial questions better.

#Chain-of-View Prompting#Embodied Question Answering#Active Viewpoint Reasoning

Aligning Text, Code, and Vision: A Multi-Objective Reinforcement Learning Framework for Text-to-Visualization

Intermediate
Mizanur Rahman, Mohammed Saidul Islam et al.Jan 8arXiv

This paper teaches a model to turn a question about a table into both a short answer and a clear, correct chart.

#Text-to-Visualization#Reinforcement Learning#GRPO

CPPO: Contrastive Perception for Vision Language Policy Optimization

Intermediate
Ahmad Rezaei, Mohsen Gholami et al.Jan 1arXiv

CPPO is a new way to fine‑tune vision‑language models so they see pictures more accurately before they start to reason.

#CPPO#Contrastive Perception Loss#Vision-Language Models

Forging Spatial Intelligence: A Roadmap of Multi-Modal Data Pre-Training for Autonomous Systems

Beginner
Song Wang, Lingdong Kong et al.Dec 30arXiv

Robots like cars and drones see the world with many different sensors (cameras, LiDAR, radar, and even event cameras), and this paper shows a clear roadmap for teaching them to understand space by learning from all of these together.

#Spatial Intelligence#Multi-Modal Pre-Training#Self-Supervised Learning

QuantiPhy: A Quantitative Benchmark Evaluating Physical Reasoning Abilities of Vision-Language Models

Intermediate
Li Puyin, Tiange Xiang et al.Dec 22arXiv

QuantiPhy is a new test that checks if AI models can measure real-world physics from videos using numbers, not guesses.

#QuantiPhy#Vision-Language Models#Physical reasoning

Toward Ambulatory Vision: Learning Visually-Grounded Active View Selection

Intermediate
Juil Koo, Daehyeon Choi et al.Dec 15arXiv

This paper teaches robots to move their camera to a better spot before answering a question about what they see.

#Active Perception#Embodied AI#Vision-Language Models

ReVSeg: Incentivizing the Reasoning Chain for Video Segmentation with Reinforcement Learning

Intermediate
Yifan Li, Yingda Yin et al.Dec 2arXiv

ReVSeg teaches an AI to segment objects in videos by thinking step-by-step instead of guessing everything at once.

#Reasoning Video Object Segmentation#Vision-Language Models#Temporal Grounding
12