Papers8

#Vision Transformer

Locality-Attending Vision Transformer

Sina Hajimiri, Farzad Beizaee et al.Mar 5arXiv

Vision Transformers (ViTs) are great at recognizing what is in a whole image but often blur the tiny details needed to label each pixel (segmentation).

#Vision Transformer#self-attention#segmentation

Not triaged yet

VidEoMT: Your ViT is Secretly Also a Video Segmentation Model

Intermediate

Narges Norouzi, Idil Esen Zulfikar et al.Feb 19arXiv

VidEoMT shows that a single, well‑trained Vision Transformer (ViT) can segment and track objects in videos without extra tracking gadgets.

#Video Segmentation#Vision Transformer#Encoder-only

Not triaged yet

Masked Depth Modeling for Spatial Perception

Intermediate

Bin Tan, Changjiang Sun et al.Jan 25arXiv

The paper turns the 'holes' (missing spots) in depth camera images into helpful training hints instead of treating them as garbage.

#Masked Depth Modeling#RGB-D cameras#Depth completion

Not triaged yet

OpenVision 3: A Family of Unified Visual Encoder for Both Understanding and Generation

Intermediate

Letian Zhang, Sucheng Ren et al.Jan 21arXiv

OpenVision 3 is a single vision encoder that learns one set of image tokens that work well for both understanding images (like answering questions) and generating images (like making new pictures).

#Unified Visual Encoder#VAE#Vision Transformer

Not triaged yet

Implicit Neural Representation Facilitates Unified Universal Vision Encoding

Intermediate

Matthew Gwilliam, Xiao Wang et al.Jan 20arXiv

This paper introduces HUVR, a single vision model that can both recognize what’s in an image and reconstruct or generate images from tiny codes.

#Implicit Neural Representation#Hyper-Networks#Vision Transformer

Not triaged yet

LightOnOCR: A 1B End-to-End Multilingual Vision-Language Model for State-of-the-Art OCR

Intermediate

Said Taghadouini, Adrien Cavaillès et al.Jan 20arXiv

LightOnOCR-2-1B is a single, compact AI model that reads PDF pages and scans and turns them into clean, well-ordered text without using fragile multi-step OCR pipelines.

#LightOnOCR-2-1B#end-to-end OCR#vision-language model

Not triaged yet

InfiniDepth: Arbitrary-Resolution and Fine-Grained Depth Estimation with Neural Implicit Fields

Intermediate

Hao Yu, Haotong Lin et al.Jan 6arXiv

InfiniDepth is a new way to predict depth that treats every image location as a smooth, continuous place you can ask for depth, not just the fixed pixels of a grid.

#monocular depth estimation#neural implicit fields#arbitrary resolution depth

Not triaged yet

Towards Scalable Pre-training of Visual Tokenizers for Generation

Intermediate

Jingfeng Yao, Yuda Song et al.Dec 15arXiv

The paper tackles a paradox: visual tokenizers that get great pixel reconstructions often make worse images when used for generation.

#visual tokenizer#latent space#Vision Transformer

Not triaged yet