Papers29

#vision-language model

Phi-4-reasoning-vision-15B Technical Report

Jyoti Aneja, Michael Harrison et al.Mar 4arXiv

Phi-4-reasoning-vision-15B is a small, open-weight AI that understands pictures and text together and is especially good at math, science, and using computer screens.

#multimodal reasoning#vision-language model#mid-fusion

Not triaged yet

FireRed-OCR Technical Report

Intermediate

Hao Wu, Haoran Lou et al.Mar 2arXiv

FireRed-OCR turns a general vision-language model into a careful document reader that follows strict rules, so its outputs are usable in the real world.

#FireRed-OCR#structural hallucination#document parsing

Not triaged yet

RAISE: Requirement-Adaptive Evolutionary Refinement for Training-Free Text-to-Image Alignment

Intermediate

Liyao Jiang, Ruichen Chen et al.Feb 28arXiv

Text-to-image models can make pretty pictures but still miss details in complex prompts, like counts, positions, or exact text.

#text-to-image alignment#adaptive inference#evolutionary refinement

Not triaged yet

Enhancing Spatial Understanding in Image Generation via Reward Modeling

Intermediate

Zhenyu Tang, Chaoran Feng et al.Feb 27arXiv

This paper teaches image generators to place objects in the right spots by building a special teacher called a reward model focused on spatial relationships.

#spatial reasoning#reward modeling#preference learning

Not triaged yet

See and Fix the Flaws: Enabling VLMs and Diffusion Models to Comprehend Visual Artifacts via Agentic Data Synthesis

Intermediate

Jaehyun Park, Minyoung Ahn et al.Feb 24arXiv

Modern image generators can still make strange mistakes like extra fingers or melted faces, and today’s vision-language models (VLMs) often miss them.

#visual artifacts#structural artifacts#diffusion transformer

Not triaged yet

Mobile-O: Unified Multimodal Understanding and Generation on Mobile Device

Intermediate

Abdelrahman Shaker, Ahmed Heakl et al.Feb 23arXiv

Mobile-O is a small but smart AI that can both understand pictures and make new images, and it runs right on your phone.

#Mobile-O#unified multimodal model#on-device AI

Not triaged yet

Computer-Using World Model

Intermediate

Yiming Guan, Rui Yu et al.Feb 19arXiv

The paper builds a Computer-Using World Model (CUWM) that lets an AI “imagine” what a desktop app (like Word/Excel/PowerPoint) will look like after a click or keystroke—before doing it for real.

#world model#GUI agent#desktop automation

Not triaged yet

EgoActor: Grounding Task Planning into Spatial-aware Egocentric Actions for Humanoid Robots via Visual-Language Models

Intermediate

Yu Bai, MingMing Yu et al.Feb 4arXiv

EgoActor is a vision-language model that turns everyday instructions like 'Go to the door and say hi' into step-by-step, egocentric actions a humanoid robot can actually do.

#EgoActing#vision-language model#humanoid robot

Not triaged yet

Generative Visual Code Mobile World Models

Intermediate

Woosung Koh, Sungjun Han et al.Feb 2arXiv

This paper shows a new way to predict what a phone screen will look like after you tap or scroll: generate web code (like HTML/CSS/SVG) and then render it to pixels.

#mobile GUI#world model#vision-language model

Not triaged yet

Beyond Pixels: Visual Metaphor Transfer via Schema-Driven Agentic Reasoning

Intermediate

Yu Xu, Yuxin Zhang et al.Feb 1arXiv

This paper teaches AI to copy the hidden idea inside a picture (a visual metaphor) and reuse that idea on a brand‑new subject.

#visual metaphor#metaphor transfer#schema grammar

Not triaged yet

MemOCR: Layout-Aware Visual Memory for Efficient Long-Horizon Reasoning

Intermediate

Yaorui Shi, Shugui Liu et al.Jan 29arXiv

MemOCR is a new way for AI to remember long histories by turning important notes into a picture with big, bold parts for key facts and tiny parts for details.

#MemOCR#visual memory#adaptive information density

Not triaged yet

GutenOCR: A Grounded Vision-Language Front-End for Documents

Intermediate

Hunter Heidenreich, Ben Elliott et al.Jan 20arXiv

GutenOCR turns a general vision-language model into a single, smart OCR front-end that can read, find, and point to text on a page using simple prompts.

#grounded OCR#vision-language model#document understanding

Not triaged yet

1 2 3