Large multimodal models (LMMs) can look at pictures and read text, but they still miss tricky cases, like tiny chart labels or multi-step math.
DeepVision-103K is a new 103,000-example picture-and-text math dataset designed to help AI think better using rewards that can be checked automatically.
Large Multimodal Models (LMMs) are great at reading text and looking at pictures, but they usually do most of their thinking in words, which limits deep visual reasoning.