🎓How I Study AIHISA
📖Read
📄Papers📰Blogs🎬Courses
💡Learn
🛤️Paths📚Topics💡Concepts🎴Shorts
🎯Practice
📝Daily Log🎯Prompts🧠Review
SearchSettings
How I Study AI - Learn AI Papers & Lectures the Easy Way

Papers33

AllBeginnerIntermediateAdvanced
All SourcesarXiv
#vision-language model

Phi-4-reasoning-vision-15B Technical Report

Intermediate
Jyoti Aneja, Michael Harrison et al.Mar 4arXiv

Phi-4-reasoning-vision-15B is a small, open-weight AI that understands pictures and text together and is especially good at math, science, and using computer screens.

#multimodal reasoning#vision-language model#mid-fusion

FireRed-OCR Technical Report

Intermediate
Hao Wu, Haoran Lou et al.Mar 2arXiv

FireRed-OCR turns a general vision-language model into a careful document reader that follows strict rules, so its outputs are usable in the real world.

#FireRed-OCR#structural hallucination#document parsing

RAISE: Requirement-Adaptive Evolutionary Refinement for Training-Free Text-to-Image Alignment

Intermediate
Liyao Jiang, Ruichen Chen et al.Feb 28arXiv

Text-to-image models can make pretty pictures but still miss details in complex prompts, like counts, positions, or exact text.

#text-to-image alignment#adaptive inference#evolutionary refinement

Enhancing Spatial Understanding in Image Generation via Reward Modeling

Intermediate
Zhenyu Tang, Chaoran Feng et al.Feb 27arXiv

This paper teaches image generators to place objects in the right spots by building a special teacher called a reward model focused on spatial relationships.

#spatial reasoning#reward modeling#preference learning

Synthetic Visual Genome 2: Extracting Large-scale Spatio-Temporal Scene Graphs from Videos

Beginner
Ziqi Gao, Jieyu Zhang et al.Feb 26arXiv

This paper builds a giant, automatically made video library called SVG2 that tells who is in a video, what they look like, and how they interact over time.

#video scene graph#spatio-temporal reasoning#panoptic segmentation

MediX-R1: Open Ended Medical Reinforcement Learning

Beginner
Sahal Shaji Mullappilly, Mohammed Irfan Kurpath et al.Feb 26arXiv

MediX-R1 teaches medical AI models to give clear, free-form answers (not just A, B, C, or D) and to explain their thinking.

#medical multimodal RL#open-ended reinforcement learning#composite reward

See and Fix the Flaws: Enabling VLMs and Diffusion Models to Comprehend Visual Artifacts via Agentic Data Synthesis

Intermediate
Jaehyun Park, Minyoung Ahn et al.Feb 24arXiv

Modern image generators can still make strange mistakes like extra fingers or melted faces, and today’s vision-language models (VLMs) often miss them.

#visual artifacts#structural artifacts#diffusion transformer

Mobile-O: Unified Multimodal Understanding and Generation on Mobile Device

Intermediate
Abdelrahman Shaker, Ahmed Heakl et al.Feb 23arXiv

Mobile-O is a small but smart AI that can both understand pictures and make new images, and it runs right on your phone.

#Mobile-O#unified multimodal model#on-device AI

Computer-Using World Model

Intermediate
Yiming Guan, Rui Yu et al.Feb 19arXiv

The paper builds a Computer-Using World Model (CUWM) that lets an AI “imagine” what a desktop app (like Word/Excel/PowerPoint) will look like after a click or keystroke—before doing it for real.

#world model#GUI agent#desktop automation

EgoActor: Grounding Task Planning into Spatial-aware Egocentric Actions for Humanoid Robots via Visual-Language Models

Intermediate
Yu Bai, MingMing Yu et al.Feb 4arXiv

EgoActor is a vision-language model that turns everyday instructions like 'Go to the door and say hi' into step-by-step, egocentric actions a humanoid robot can actually do.

#EgoActing#vision-language model#humanoid robot

Generative Visual Code Mobile World Models

Intermediate
Woosung Koh, Sungjun Han et al.Feb 2arXiv

This paper shows a new way to predict what a phone screen will look like after you tap or scroll: generate web code (like HTML/CSS/SVG) and then render it to pixels.

#mobile GUI#world model#vision-language model

Beyond Pixels: Visual Metaphor Transfer via Schema-Driven Agentic Reasoning

Intermediate
Yu Xu, Yuxin Zhang et al.Feb 1arXiv

This paper teaches AI to copy the hidden idea inside a picture (a visual metaphor) and reuse that idea on a brand‑new subject.

#visual metaphor#metaphor transfer#schema grammar
123