Papers4

#Vision-language models

UniG2U-Bench: Do Unified Models Advance Multimodal Understanding?

This paper builds UniG2U-Bench, a big test to find out when making pictures (generation) actually helps models understand pictures and text together.

#Unified multimodal models#Vision-language models#Generation-to-Understanding (G2U)

Not triaged yet

Forest Before Trees: Latent Superposition for Efficient Visual Reasoning

Intermediate

Yubo Wang, Juntian Zhang et al.Jan 11arXiv

This paper introduces Laser, a new way for vision-language models to think in their hidden space before speaking, so they see the whole “forest” before picking out the “trees.”

#Latent reasoning#Dynamic Windowed Alignment Learning#Dynamic Semantic Windows

Not triaged yet

V-REX: Benchmarking Exploratory Visual Reasoning via Chain-of-Questions

Intermediate

Chenrui Fan, Yijun Liang et al.Dec 12arXiv

This paper introduces V-REX, a new benchmark that tests how AI systems reason about images through step-by-step exploration, not just final answers.

#V-REX#Chain-of-Questions#Exploratory visual reasoning

Not triaged yet

From Segments to Scenes: Temporal Understanding in Autonomous Driving via Vision-Language Model

Intermediate

Kevin Cannons, Saeed Ranjbar Alvar et al.Dec 4arXiv

This paper builds TAD, a brand-new test that checks if AI can understand what happens over time in real driving videos.

#Temporal understanding#Autonomous driving#Vision-language models

Not triaged yet