GutenOCR turns a general vision-language model into a single, smart OCR front-end that can read, find, and point to text on a page using simple prompts.
STEP3-VL-10B is a small (10 billion parameters) open multimodal model that sees images and reads text, yet scores like much larger models.
This paper builds a tough new test called O3-BENCH to check if AI can truly think with images, not just spot objects.
Long texts are expensive for AI to read because each extra token costs a lot of compute and memory.