🎓How I Study AIHISA
📖Read
📄Papers📰Blogs🎬Courses
💡Learn
🛤️Paths📚Topics💡Concepts🎴Shorts
🎯Practice
📝Daily Log🎯Prompts🧠Review
SearchSettings
How I Study AI - Learn AI Papers & Lectures the Easy Way

Papers8

AllBeginnerIntermediateAdvanced
All SourcesarXiv
#Vision Transformer

Locality-Attending Vision Transformer

Intermediate
Sina Hajimiri, Farzad Beizaee et al.Mar 5arXiv

Vision Transformers (ViTs) are great at recognizing what is in a whole image but often blur the tiny details needed to label each pixel (segmentation).

#Vision Transformer#self-attention#segmentation

VidEoMT: Your ViT is Secretly Also a Video Segmentation Model

Intermediate
Narges Norouzi, Idil Esen Zulfikar et al.Feb 19arXiv

VidEoMT shows that a single, well‑trained Vision Transformer (ViT) can segment and track objects in videos without extra tracking gadgets.

#Video Segmentation#Vision Transformer#Encoder-only

Masked Depth Modeling for Spatial Perception

Intermediate
Bin Tan, Changjiang Sun et al.Jan 25arXiv

The paper turns the 'holes' (missing spots) in depth camera images into helpful training hints instead of treating them as garbage.

#Masked Depth Modeling#RGB-D cameras#Depth completion

OpenVision 3: A Family of Unified Visual Encoder for Both Understanding and Generation

Intermediate
Letian Zhang, Sucheng Ren et al.Jan 21arXiv

OpenVision 3 is a single vision encoder that learns one set of image tokens that work well for both understanding images (like answering questions) and generating images (like making new pictures).

#Unified Visual Encoder#VAE#Vision Transformer

Implicit Neural Representation Facilitates Unified Universal Vision Encoding

Intermediate
Matthew Gwilliam, Xiao Wang et al.Jan 20arXiv

This paper introduces HUVR, a single vision model that can both recognize what’s in an image and reconstruct or generate images from tiny codes.

#Implicit Neural Representation#Hyper-Networks#Vision Transformer

LightOnOCR: A 1B End-to-End Multilingual Vision-Language Model for State-of-the-Art OCR

Intermediate
Said Taghadouini, Adrien Cavaillès et al.Jan 20arXiv

LightOnOCR-2-1B is a single, compact AI model that reads PDF pages and scans and turns them into clean, well-ordered text without using fragile multi-step OCR pipelines.

#LightOnOCR-2-1B#end-to-end OCR#vision-language model

InfiniDepth: Arbitrary-Resolution and Fine-Grained Depth Estimation with Neural Implicit Fields

Intermediate
Hao Yu, Haotong Lin et al.Jan 6arXiv

InfiniDepth is a new way to predict depth that treats every image location as a smooth, continuous place you can ask for depth, not just the fixed pixels of a grid.

#monocular depth estimation#neural implicit fields#arbitrary resolution depth

Towards Scalable Pre-training of Visual Tokenizers for Generation

Intermediate
Jingfeng Yao, Yuda Song et al.Dec 15arXiv

The paper tackles a paradox: visual tokenizers that get great pixel reconstructions often make worse images when used for generation.

#visual tokenizer#latent space#Vision Transformer