How I Study AI - Learn AI Papers & Lectures the Easy Way

Papers6

All Beginner Intermediate Advanced

All Sources arXiv

#visual tokens

SeeThrough3D: Occlusion Aware 3D Control in Text-to-Image Generation

Vaibhav Agrawal, Rishubh Parihar et al.Feb 26arXiv

SeeThrough3D teaches image generators to understand what should be visible and what should be hidden when objects overlap, just like in real life.

#occlusion-aware generation#3D layout control#text-to-image

Not triaged yet

CodeOCR: On the Effectiveness of Vision Language Models in Code Understanding

Yuling Shi, Chaoxiang Xie et al.Feb 2arXiv

The paper tests a simple but bold idea: show code to AI as pictures instead of plain text, then shrink those pictures to save tokens and time.

#multimodal language models#code as images#visual code understanding

Not triaged yet

LatentLens: Revealing Highly Interpretable Visual Tokens in LLMs

Benno Krojer, Shravan Nayak et al.Jan 31arXiv

LatentLens is a simple, training-free way to translate what a model "sees" in image patches into clear words and phrases.

#LatentLens#visual tokens#contextual embeddings

Not triaged yet

DeepSeek-OCR 2: Visual Causal Flow

Haoran Wei, Yaofeng Sun et al.Jan 28arXiv

DeepSeek-OCR 2 teaches a computer to “read” pictures of documents in a smarter order, more like how people read.

#DeepSeek-OCR 2#DeepEncoder V2#visual tokens

Not triaged yet

AgentOCR: Reimagining Agent History via Optical Self-Compression

Lang Feng, Fuchao Yang et al.Jan 8arXiv

AgentOCR turns an agent’s long text history into pictures so it can remember more using fewer tokens.

#AgentOCR#optical self-compression#visual tokens

Not triaged yet

VTCBench: Can Vision-Language Models Understand Long Context with Vision-Text Compression?

Hongbo Zhao, Meng Wang et al.Dec 17arXiv

Long texts are expensive for AI to read because each extra token costs a lot of compute and memory.

#vision‑text compression#VTCBench#vision‑language models

Not triaged yet