EgoActor is a vision-language model that turns everyday instructions like 'Go to the door and say hi' into step-by-step, egocentric actions a humanoid robot can actually do.
This paper shows a new way to predict what a phone screen will look like after you tap or scroll: generate web code (like HTML/CSS/SVG) and then render it to pixels.
This paper teaches AI to copy the hidden idea inside a picture (a visual metaphor) and reuse that idea on a brand‑new subject.
This paper upgrades a small but mighty vision-language model called PaddleOCR-VL-1.5 to read and understand real-world, messy documents better than any model before it.
MemOCR is a new way for AI to remember long histories by turning important notes into a picture with big, bold parts for key facts and tiny parts for details.
GutenOCR turns a general vision-language model into a single, smart OCR front-end that can read, find, and point to text on a page using simple prompts.
LightOnOCR-2-1B is a single, compact AI model that reads PDF pages and scans and turns them into clean, well-ordered text without using fragile multi-step OCR pipelines.
FOFPred is a new AI that reads one or two images plus a short instruction like “move the bottle left to right,” and then predicts how every pixel will move in the next moments.
Molmo2 is a family of vision-language models that can watch videos, understand them, and point to or track things over time using fully open weights, data, and code.
Cities are full of places defined by people, like schools and parks, which are hard to see clearly from space without extra clues.
OpenVoxel is a training-free way to understand 3D scenes by grouping tiny 3D blocks (voxels) into objects and giving each object a clear caption.
SkinFlow is a 7B-parameter vision–language model that diagnoses skin conditions by sending the most useful visual information to the language brain, instead of just getting bigger.