Papers4

#DINO

Locality-Attending Vision Transformer

Sina Hajimiri, Farzad Beizaee et al.Mar 5arXiv

Vision Transformers (ViTs) are great at recognizing what is in a whole image but often blur the tiny details needed to label each pixel (segmentation).

#Vision Transformer#self-attention#segmentation

Not triaged yet

Compositional Generalization Requires Linear, Orthogonal Representations in Vision Embedding Models

Intermediate

Arnas Uselis, Andrea Dittadi et al.Feb 27arXiv

The paper asks a simple question: what must a vision model’s internal pictures (embeddings) look like if it can recognize new mixes of things it already knows?

#compositional generalization#linear representation hypothesis#orthogonal representations

Not triaged yet

Context Forcing: Consistent Autoregressive Video Generation with Long Context

Intermediate

Shuo Chen, Cong Wei et al.Feb 5arXiv

The paper fixes a big problem in long video generation: models either forget what happened or slowly drift off-topic over time.

#autoregressive video generation#long-context modeling#distribution matching distillation

Not triaged yet

Relational Visual Similarity

Intermediate

Thao Nguyen, Sicheng Mo et al.Dec 8arXiv

Most image-similarity tools only notice how things look (color, shape, class) and miss deeper, human-like connections.

Not triaged yet