🎓How I Study AIHISA
📖Read
📄Papers📰Blogs🎬Courses
💡Learn
🛤️Paths📚Topics💡Concepts🎴Shorts
🎯Practice
🧩Problems🎯Prompts🧠Review
Search
How I Study AI - Learn AI Papers & Lectures the Easy Way

Papers15

AllBeginnerIntermediateAdvanced
All SourcesarXiv
#multimodal reasoning

Length-Unbiased Sequence Policy Optimization: Revealing and Controlling Response Length Variation in RLVR

Intermediate
Fanfan Liu, Youyang Yin et al.Feb 5arXiv

The paper discovers that popular RLVR methods for training language and vision-language models secretly prefer certain answer lengths, which can hurt learning.

#LUSPO#RLVR#GRPO

Thinking with Comics: Enhancing Multimodal Reasoning through Structured Visual Storytelling

Intermediate
Andong Chen, Wenxin Zhu et al.Feb 2arXiv

This paper shows that comics (multi-panel pictures with words) can help AI think through problems step by step, just like a student explains their work.

#multimodal reasoning#visual storytelling#comics

RANKVIDEO: Reasoning Reranking for Text-to-Video Retrieval

Intermediate
Tyler Skow, Alexander Martin et al.Feb 2arXiv

RANKVIDEO is a video-native reasoning reranker that helps search engines find the right videos for a text query by directly looking at the video’s visuals and audio, not just text captions.

#text-to-video retrieval#video-native reranking#multimodal reasoning

Mind-Brush: Integrating Agentic Cognitive Search and Reasoning into Image Generation

Intermediate
Jun He, Junyan Ye et al.Feb 2arXiv

Mind-Brush turns image generation from a one-step 'read the prompt and draw' into a multi-step 'think, research, and create' process.

#agentic image generation#multimodal reasoning#retrieval-augmented generation

Research on World Models Is Not Merely Injecting World Knowledge into Specific Tasks

Intermediate
Bohan Zeng, Kaixin Zhu et al.Feb 2arXiv

This paper argues that true world models are not just sprinkling facts into single tasks, but building a unified system that can see, think, remember, act, and generate across many situations.

#world models#unified framework#multimodal reasoning

MMFineReason: Closing the Multimodal Reasoning Gap via Open Data-Centric Methods

Intermediate
Honglin Lin, Zheng Liu et al.Jan 29arXiv

MMFineReason is a huge, open dataset (1.8 million examples, 5.1 billion solution tokens) that teaches AIs to think step by step about pictures and text together.

#multimodal reasoning#vision-language models#chain-of-thought

LaViT: Aligning Latent Visual Thoughts for Multi-modal Reasoning

Intermediate
Linquan Wu, Tianxiang Jiang et al.Jan 15arXiv

LaViT is a new way to teach smaller vision-language models to look at the right parts of an image before they speak.

#multimodal reasoning#visual attention#knowledge distillation

Omni-R1: Towards the Unified Generative Paradigm for Multimodal Reasoning

Intermediate
Dongjie Cheng, Yongqi Li et al.Jan 14arXiv

Omni-R1 teaches AI to think with pictures and words at the same time by drawing helpful mini-images while reasoning.

#multimodal reasoning#interleaved generation#functional image generation

Watching, Reasoning, and Searching: A Video Deep Research Benchmark on Open Web for Agentic Video Reasoning

Intermediate
Chengwen Liu, Xiaomin Yu et al.Jan 11arXiv

VideoDR is a new benchmark that tests if AI can watch a video, pull out key visual clues, search the open web, and chain the clues together to find one verifiable answer.

#video deep research#multimodal reasoning#open-domain question answering

See Less, See Right: Bi-directional Perceptual Shaping For Multimodal Reasoning

Intermediate
Shuoshuo Zhang, Yizhen Zhang et al.Dec 26arXiv

The paper teaches vision-language models (AIs that look and read) to pay attention to the right picture parts without needing extra tools during answering time.

#BiPS#perceptual shaping#vision-language models

LongVideoAgent: Multi-Agent Reasoning with Long Videos

Intermediate
Runtao Liu, Ziyi Liu et al.Dec 23arXiv

LongVideoAgent is a team of three AIs that work together to answer questions about hour‑long TV episodes without missing small details.

#long video question answering#multi-agent reasoning#temporal grounding

InSight-o3: Empowering Multimodal Foundation Models with Generalized Visual Search

Intermediate
Kaican Li, Lewei Yao et al.Dec 21arXiv

This paper builds a tough new test called O3-BENCH to check if AI can truly think with images, not just spot objects.

#multimodal reasoning#generalized visual search#reinforcement learning
12