Phi-4-reasoning-vision-15B is a small, open-weight AI that understands pictures and text together and is especially good at math, science, and using computer screens.
MMR-Life is a new test (benchmark) that checks how AI understands everyday situations using several real photos at once.
Big picture: Vision-language models look at hundreds of image pieces (tokens), which makes them slow and sometimes chatty with mistakes called hallucinations.
The paper turns image editing from a one-step “before → after” trick into a mini physics simulation that follows real-world rules.
PyVision-RL teaches vision-language models to act like curious agents that think in multiple steps and use Python tools to inspect images and videos.
DeepVision-103K is a new 103,000-example picture-and-text math dataset designed to help AI think better using rewards that can be checked automatically.
The paper fixes a common problem in AI: models can read pictures and text well, but they often mess up the logic behind them.
Modern image editors can now follow visual prompts like arrows and scribbles, which opens a new way for attackers to hide harmful instructions inside images.
The paper discovers that popular RLVR methods for training language and vision-language models secretly prefer certain answer lengths, which can hurt learning.
This paper shows that comics (multi-panel pictures with words) can help AI think through problems step by step, just like a student explains their work.
RANKVIDEO is a video-native reasoning reranker that helps search engines find the right videos for a text query by directly looking at the video’s visuals and audio, not just text captions.
Mind-Brush turns image generation from a one-step 'read the prompt and draw' into a multi-step 'think, research, and create' process.