🎓How I Study AIHISA
📖Read
📄Papers📰Blogs🎬Courses
💡Learn
🛤️Paths📚Topics💡Concepts🎴Shorts
🎯Practice
🧩Problems🎯Prompts🧠Review
Search
How I Study AI - Learn AI Papers & Lectures the Easy Way

Papers7

AllBeginnerIntermediateAdvanced
All SourcesarXiv
#visual reasoning

LoopViT: Scaling Visual ARC with Looped Transformers

Intermediate
Wen-Jie Shu, Xuerui Qiu et al.Feb 2arXiv

Loop-ViT is a vision model that thinks in loops, so it can take more steps on hard puzzles and stop early on easy ones.

#ARC-AGI#visual reasoning#Looped Transformer

AdaReasoner: Dynamic Tool Orchestration for Iterative Visual Reasoning

Intermediate
Mingyang Song, Haoyu Sun et al.Jan 26arXiv

AdaReasoner teaches AI to pick the right visual tools, use them in the right order, and stop using them when they aren’t helping.

#AdaReasoner#dynamic tool orchestration#multimodal large language models

CoF-T2I: Video Models as Pure Visual Reasoners for Text-to-Image Generation

Intermediate
Chengzhuo Tong, Mingkun Chang et al.Jan 15arXiv

This paper turns a video model into a step-by-step visual thinker that makes one final, high-quality picture from a text prompt.

#Chain-of-Frame#visual reasoning#text-to-image

BabyVision: Visual Reasoning Beyond Language

Intermediate
Liang Chen, Weichu Xie et al.Jan 10arXiv

BabyVision is a new test that checks if AI can handle the same basic picture puzzles that young children can do, without leaning on language tricks.

#BabyVision#visual reasoning#multimodal large language models

Latent Implicit Visual Reasoning

Intermediate
Kelvin Li, Chuyi Shang et al.Dec 24arXiv

Large Multimodal Models (LMMs) are great at reading text and looking at pictures, but they usually do most of their thinking in words, which limits deep visual reasoning.

#Latent Implicit Visual Reasoning#latent tokens#bottleneck attention masking

Puzzle Curriculum GRPO for Vision-Centric Reasoning

Intermediate
Ahmadreza Jeddi, Hakki Can Karaimer et al.Dec 16arXiv

This paper teaches vision-language models to reason about pictures using puzzles instead of expensive human labels.

#vision-language models#reinforcement learning#group-relative policy optimization

Thinking with Images via Self-Calling Agent

Intermediate
Wenxi Yang, Yuzhong Zhao et al.Dec 9arXiv

This paper teaches a vision-language model to think about images by talking to copies of itself, using only words to plan and decide.

#Self-Calling Chain-of-Thought#sCoT#interleaved multimodal chain-of-thought