🎓How I Study AIHISA
📖Read
📄Papers📰Blogs🎬Courses
💡Learn
🛤️Paths📚Topics💡Concepts🎴Shorts
🎯Practice
📝Daily Log🎯Prompts🧠Review
SearchSettings
How I Study AI - Learn AI Papers & Lectures the Easy Way

Papers8

AllBeginnerIntermediateAdvanced
All SourcesarXiv
#vision-language models

From Perception to Action: An Interactive Benchmark for Vision Reasoning

Beginner
Yuhao Wu, Maojia Song et al.Feb 24arXiv

The paper introduces CHAIN, a hands-on 3D playground that tests if AI can not only see objects but also plan and act under real physics.

#interactive benchmark#vision-language models#physical reasoning

DODO: Discrete OCR Diffusion Models

Beginner
Sean Man, Roy Ganz et al.Feb 18arXiv

OCR is like reading a page exactly as it is, and that strictness makes it perfect for fast, parallel generation.

#OCR#vision-language models#discrete diffusion

NarraScore: Bridging Visual Narrative and Musical Dynamics via Hierarchical Affective Control

Beginner
Yufan Wen, Zhaocheng Liu et al.Feb 9arXiv

NarraScore turns a video's changing story into a matching soundtrack by using emotion as the bridge.

#video-to-music generation#affective computing#valence-arousal

Kimi K2.5: Visual Agentic Intelligence

Beginner
Kimi Team, Tongtong Bai et al.Feb 2arXiv

Kimi K2.5 is a new open-source AI that can read both text and visuals (images and videos) and act like a team of helpers to finish big tasks faster.

#multimodal learning#vision-language models#joint optimization

XR: Cross-Modal Agents for Composed Image Retrieval

Beginner
Zhongyu Yang, Wei Pang et al.Jan 20arXiv

XR is a new, training-free team of AI helpers that finds images using both a reference picture and a short text edit (like “same jacket but red”).

#Composed Image Retrieval#cross-modal reasoning#multi-agent system

What Users Leave Unsaid: Under-Specified Queries Limit Vision-Language Models

Beginner
Dasol Choi, Guijin Son et al.Jan 7arXiv

Real people often ask vague questions with pictures, and today’s vision-language models (VLMs) struggle with them.

#vision-language models#under-specified queries#query explicitation

ThinkRL-Edit: Thinking in Reinforcement Learning for Reasoning-Centric Image Editing

Beginner
Hengjia Li, Liming Jiang et al.Jan 6arXiv

ThinkRL-Edit teaches an image editor to think first and draw second, which makes tricky, reasoning-heavy edits much more accurate.

#reasoning-centric image editing#reinforcement learning#chain-of-thought

Beyond Memorization: A Multi-Modal Ordinal Regression Benchmark to Expose Popularity Bias in Vision-Language Models

Beginner
Li-Zhong Szu-Tu, Ting-Lin Wu et al.Dec 24arXiv

The paper builds YearGuessr, a giant, worldwide photo-and-text dataset of 55,546 buildings with their construction years (1001–2024), GPS, and popularity (page views).

#YearGuessr#building age estimation#ordinal regression