Papers20

#multimodal reasoning

Phi-4-reasoning-vision-15B Technical Report

Jyoti Aneja, Michael Harrison et al.Mar 4arXiv

Phi-4-reasoning-vision-15B is a small, open-weight AI that understands pictures and text together and is especially good at math, science, and using computer screens.

#multimodal reasoning#vision-language model#mid-fusion

Not triaged yet

AgilePruner: An Empirical Study of Attention and Diversity for Adaptive Visual Token Pruning in Large Vision-Language Models

Intermediate

Changwoo Baek, Jouwon Song et al.Mar 1arXiv

Big picture: Vision-language models look at hundreds of image pieces (tokens), which makes them slow and sometimes chatty with mistakes called hallucinations.

#visual token pruning#attention-based pruning#diversity-based pruning

Not triaged yet

From Statics to Dynamics: Physics-Aware Image Editing with Latent Transition Priors

Intermediate

Liangbing Zhao, Le Zhuo et al.Feb 25arXiv

The paper turns image editing from a one-step “before → after” trick into a mini physics simulation that follows real-world rules.

#physics-aware image editing#physical state transition#latent transition priors

Not triaged yet

PyVision-RL: Forging Open Agentic Vision Models via RL

Intermediate

Shitian Zhao, Shaoheng Lin et al.Feb 24arXiv

PyVision-RL teaches vision-language models to act like curious agents that think in multiple steps and use Python tools to inspect images and videos.

#agentic multimodal models#reinforcement learning#dynamic tooling

Not triaged yet

DeepVision-103K: A Visually Diverse, Broad-Coverage, and Verifiable Mathematical Dataset for Multimodal Reasoning

Intermediate

Haoxiang Sun, Lizhen Xu et al.Feb 18arXiv

DeepVision-103K is a new 103,000-example picture-and-text math dataset designed to help AI think better using rewards that can be checked automatically.

#DeepVision-103K#multimodal reasoning#RLVR

Not triaged yet

Length-Unbiased Sequence Policy Optimization: Revealing and Controlling Response Length Variation in RLVR

Intermediate

Fanfan Liu, Youyang Yin et al.Feb 5arXiv

The paper discovers that popular RLVR methods for training language and vision-language models secretly prefer certain answer lengths, which can hurt learning.

#LUSPO#RLVR#GRPO

Not triaged yet

Thinking with Comics: Enhancing Multimodal Reasoning through Structured Visual Storytelling

Intermediate

Andong Chen, Wenxin Zhu et al.Feb 2arXiv

This paper shows that comics (multi-panel pictures with words) can help AI think through problems step by step, just like a student explains their work.

#multimodal reasoning#visual storytelling#comics

Not triaged yet

RANKVIDEO: Reasoning Reranking for Text-to-Video Retrieval

Intermediate

Tyler Skow, Alexander Martin et al.Feb 2arXiv

RANKVIDEO is a video-native reasoning reranker that helps search engines find the right videos for a text query by directly looking at the video’s visuals and audio, not just text captions.

#text-to-video retrieval#video-native reranking#multimodal reasoning

Not triaged yet

Mind-Brush: Integrating Agentic Cognitive Search and Reasoning into Image Generation

Intermediate

Jun He, Junyan Ye et al.Feb 2arXiv

Mind-Brush turns image generation from a one-step 'read the prompt and draw' into a multi-step 'think, research, and create' process.

#agentic image generation#multimodal reasoning#retrieval-augmented generation

Not triaged yet

Research on World Models Is Not Merely Injecting World Knowledge into Specific Tasks

Intermediate

Bohan Zeng, Kaixin Zhu et al.Feb 2arXiv

This paper argues that true world models are not just sprinkling facts into single tasks, but building a unified system that can see, think, remember, act, and generate across many situations.

#world models#unified framework#multimodal reasoning

Not triaged yet

MMFineReason: Closing the Multimodal Reasoning Gap via Open Data-Centric Methods

Intermediate

Honglin Lin, Zheng Liu et al.Jan 29arXiv

MMFineReason is a huge, open dataset (1.8 million examples, 5.1 billion solution tokens) that teaches AIs to think step by step about pictures and text together.

#multimodal reasoning#vision-language models#chain-of-thought

Not triaged yet

LaViT: Aligning Latent Visual Thoughts for Multi-modal Reasoning

Intermediate

Linquan Wu, Tianxiang Jiang et al.Jan 15arXiv

LaViT is a new way to teach smaller vision-language models to look at the right parts of an image before they speak.

#multimodal reasoning#visual attention#knowledge distillation

Not triaged yet

1 2