Phi-4-reasoning-vision-15B is a small, open-weight AI that understands pictures and text together and is especially good at math, science, and using computer screens.
FireRed-OCR turns a general vision-language model into a careful document reader that follows strict rules, so its outputs are usable in the real world.
Text-to-image models can make pretty pictures but still miss details in complex prompts, like counts, positions, or exact text.
This paper teaches image generators to place objects in the right spots by building a special teacher called a reward model focused on spatial relationships.
This paper builds a giant, automatically made video library called SVG2 that tells who is in a video, what they look like, and how they interact over time.
MediX-R1 teaches medical AI models to give clear, free-form answers (not just A, B, C, or D) and to explain their thinking.
Modern image generators can still make strange mistakes like extra fingers or melted faces, and today’s vision-language models (VLMs) often miss them.
Mobile-O is a small but smart AI that can both understand pictures and make new images, and it runs right on your phone.
The paper builds a Computer-Using World Model (CUWM) that lets an AI “imagine” what a desktop app (like Word/Excel/PowerPoint) will look like after a click or keystroke—before doing it for real.
EgoActor is a vision-language model that turns everyday instructions like 'Go to the door and say hi' into step-by-step, egocentric actions a humanoid robot can actually do.
This paper shows a new way to predict what a phone screen will look like after you tap or scroll: generate web code (like HTML/CSS/SVG) and then render it to pixels.
This paper teaches AI to copy the hidden idea inside a picture (a visual metaphor) and reuse that idea on a brand‑new subject.