🎓How I Study AIHISA
📖Read
📄Papers📰Blogs🎬Courses
💡Learn
🛤️Paths📚Topics💡Concepts🎴Shorts
🎯Practice
📝Daily Log🎯Prompts🧠Review
SearchSettings
How I Study AI - Learn AI Papers & Lectures the Easy Way

Papers6

AllBeginnerIntermediateAdvanced
All SourcesarXiv
#visual question answering

ExStrucTiny: A Benchmark for Schema-Variable Structured Information Extraction from Document Images

Intermediate
Mathieu Sibue, Andres Muñoz Garza et al.Feb 12arXiv

ExStrucTiny is a new test (benchmark) that checks if AI can pull many connected facts from all kinds of documents and neatly put them into JSON, even when the question style and schema change.

#structured information extraction#document understanding#vision-language models

Causal-JEPA: Learning World Models through Object-Level Latent Interventions

Beginner
Heejeong Nam, Quentin Le Lidec et al.Feb 11arXiv

This paper introduces Causal-JEPA (C-JEPA), a world model that learns by hiding entire objects in its memory and forcing itself to predict them from other objects.

#C-JEPA#object-centric world model#object-level masking

Vision-DeepResearch Benchmark: Rethinking Visual and Textual Search for Multimodal Large Language Models

Intermediate
Yu Zeng, Wenxuan Huang et al.Feb 2arXiv

The paper introduces VDR-Bench, a new test with 2,000 carefully built questions that truly require both seeing (images) and reading (web text) to find answers.

#multimodal large language model#visual question answering#vision deep research

Vision-DeepResearch: Incentivizing DeepResearch Capability in Multimodal Large Language Models

Intermediate
Wenxuan Huang, Yu Zeng et al.Jan 29arXiv

The paper tackles a real problem: one-shot image or text searches often miss the right evidence (low hit-rate), especially in noisy, cluttered pictures.

#multimodal deep research#visual question answering#ReAct reasoning

What Users Leave Unsaid: Under-Specified Queries Limit Vision-Language Models

Beginner
Dasol Choi, Guijin Son et al.Jan 7arXiv

Real people often ask vague questions with pictures, and today’s vision-language models (VLMs) struggle with them.

#vision-language models#under-specified queries#query explicitation

Differences That Matter: Auditing Models for Capability Gap Discovery and Rectification

Intermediate
Qihao Liu, Chengzhi Mao et al.Dec 18arXiv

AuditDM is a friendly 'auditor' model that hunts for where vision-language models get things wrong and then creates the right practice to fix them.

#AuditDM#model auditing#cross-model divergence