Papers5

#vision-language models

Kimi K2.5: Visual Agentic Intelligence

Kimi Team, Tongtong Bai et al.Feb 2arXiv

Kimi K2.5 is a new open-source AI that can read both text and visuals (images and videos) and act like a team of helpers to finish big tasks faster.

#multimodal learning#vision-language models#joint optimization

XR: Cross-Modal Agents for Composed Image Retrieval

Beginner

Zhongyu Yang, Wei Pang et al.Jan 20arXiv

XR is a new, training-free team of AI helpers that finds images using both a reference picture and a short text edit (like “same jacket but red”).

#Composed Image Retrieval#cross-modal reasoning#multi-agent system

What Users Leave Unsaid: Under-Specified Queries Limit Vision-Language Models

Beginner

Dasol Choi, Guijin Son et al.Jan 7arXiv

Real people often ask vague questions with pictures, and today’s vision-language models (VLMs) struggle with them.

#vision-language models#under-specified queries#query explicitation

ThinkRL-Edit: Thinking in Reinforcement Learning for Reasoning-Centric Image Editing

Beginner

Hengjia Li, Liming Jiang et al.Jan 6arXiv

ThinkRL-Edit teaches an image editor to think first and draw second, which makes tricky, reasoning-heavy edits much more accurate.

#reasoning-centric image editing#reinforcement learning#chain-of-thought

Beyond Memorization: A Multi-Modal Ordinal Regression Benchmark to Expose Popularity Bias in Vision-Language Models

Beginner

Li-Zhong Szu-Tu, Ting-Lin Wu et al.Dec 24arXiv

The paper builds YearGuessr, a giant, worldwide photo-and-text dataset of 55,546 buildings with their construction years (1001–2024), GPS, and popularity (page views).

#YearGuessr#building age estimation#ordinal regression