Phi-4-reasoning-vision-15B is a small, open-weight AI that understands pictures and text together and is especially good at math, science, and using computer screens.
Large multimodal models (LMMs) can look at pictures and read text, but they still miss tricky cases, like tiny chart labels or multi-step math.
OCR is like reading a page exactly as it is, and that strictness makes it perfect for fast, parallel generation.
This paper shows a simple, repeatable way to teach general Vision-Language Models (VLMs) to understand e-commerce items much better without forgetting their general skills.
The paper fixes a common problem in AI: models can read pictures and text well, but they often mess up the logic behind them.
AdaptMMBench is a new test that checks if AI models know when to just look and think, and when to use extra visual tools like zooming or brightening an image.
Kimi K2.5 is a new open-source AI that can read both text and visuals (images and videos) and act like a team of helpers to finish big tasks faster.
AdaReasoner teaches AI to pick the right visual tools, use them in the right order, and stop using them when they aren’t helping.
Long texts are expensive for AI to read because each extra token costs a lot of compute and memory.
This paper teaches a vision-language model to think about images by talking to copies of itself, using only words to plan and decide.