Papers5

#SigLIP

Compositional Generalization Requires Linear, Orthogonal Representations in Vision Embedding Models

Arnas Uselis, Andrea Dittadi et al.Feb 27arXiv

The paper asks a simple question: what must a vision model’s internal pictures (embeddings) look like if it can recognize new mixes of things it already knows?

#compositional generalization#linear representation hypothesis#orthogonal representations

Not triaged yet

Half-Truths Break Similarity-Based Retrieval

Intermediate

Bora Kargi, Arnas Uselis et al.Feb 27arXiv

Similarity-based image–text models like CLIP can be fooled by “half-truths,” where adding one plausible but wrong detail makes a caption look more similar to an image instead of less similar.

#half-truth vulnerability#similarity-based retrieval#CLIP

Not triaged yet

World Guidance: World Modeling in Condition Space for Action Generation

Intermediate

Yue Su, Sijin Chen et al.Feb 25arXiv

WoG (World Guidance) teaches a robot to imagine just the right bits of the near future and use those bits to pick better actions.

#Vision-Language-Action#world modeling#condition space

Not triaged yet

HiF-VLA: Hindsight, Insight and Foresight through Motion Representation for Vision-Language-Action Models

Intermediate

Minghui Lin, Pengxiang Ding et al.Dec 10arXiv

Robots often act like goldfish with short memories; HiF-VLA fixes this by letting them use motion to remember the past and predict the future.

#Vision-Language-Action#motion vectors#temporal reasoning

Not triaged yet

One Layer Is Enough: Adapting Pretrained Visual Encoders for Image Generation

Intermediate

Yuan Gao, Chen Chen et al.Dec 8arXiv

This paper shows that we can turn big, smart vision features into a small, easy-to-use code for image generation with just one attention layer.

#Feature Auto-Encoder#FAE#Self-Supervised Learning

Not triaged yet