🎓How I Study AIHISA
📖Read
📄Papers📰Blogs🎬Courses
💡Learn
🛤️Paths📚Topics💡Concepts🎴Shorts
🎯Practice
📝Daily Log🎯Prompts🧠Review
SearchSettings
How I Study AI - Learn AI Papers & Lectures the Easy Way

Papers41

AllBeginnerIntermediateAdvanced
All SourcesarXiv
#vision-language models

Utonia: Toward One Encoder for All Point Clouds

Intermediate
Yujia Zhang, Xiaoyang Wu et al.Mar 3arXiv

Utonia is a single brain (encoder) that learns from many kinds of 3D point clouds, like indoor rooms, outdoor streets, tiny toys, and even city maps.

#Utonia#point cloud#self-supervised learning

AgilePruner: An Empirical Study of Attention and Diversity for Adaptive Visual Token Pruning in Large Vision-Language Models

Intermediate
Changwoo Baek, Jouwon Song et al.Mar 1arXiv

Big picture: Vision-language models look at hundreds of image pieces (tokens), which makes them slow and sometimes chatty with mistakes called hallucinations.

#visual token pruning#attention-based pruning#diversity-based pruning

Compositional Generalization Requires Linear, Orthogonal Representations in Vision Embedding Models

Intermediate
Arnas Uselis, Andrea Dittadi et al.Feb 27arXiv

The paper asks a simple question: what must a vision model’s internal pictures (embeddings) look like if it can recognize new mixes of things it already knows?

#compositional generalization#linear representation hypothesis#orthogonal representations

Retrieve and Segment: Are a Few Examples Enough to Bridge the Supervision Gap in Open-Vocabulary Segmentation?

Intermediate
Tilemachos Aravanis, Vladan Stojnić et al.Feb 26arXiv

This paper teaches an AI to segment any object you name (open-vocabulary) much better by adding a few example pictures with pixel labels and smart retrieval.

#open-vocabulary segmentation#vision-language models#retrieval-augmented

From Perception to Action: An Interactive Benchmark for Vision Reasoning

Beginner
Yuhao Wu, Maojia Song et al.Feb 24arXiv

The paper introduces CHAIN, a hands-on 3D playground that tests if AI can not only see objects but also plan and act under real physics.

#interactive benchmark#vision-language models#physical reasoning

TOPReward: Token Probabilities as Hidden Zero-Shot Rewards for Robotics

Intermediate
Shirui Chen, Cole Harrison et al.Feb 22arXiv

Robots learn better when they get small hints at every step instead of only a final thumbs-up or thumbs-down.

#TOPReward#token probabilities#logits

AI Gamestore: Scalable, Open-Ended Evaluation of Machine General Intelligence with Human Games

Intermediate
Lance Ying, Ryan Truong et al.Feb 19arXiv

The paper argues that the fairest way to check how generally smart an AI is, is to see how quickly and well it learns lots of different human-made games, just like a person with the same time and practice.

#general intelligence#evaluation benchmark#game-based testing

DODO: Discrete OCR Diffusion Models

Beginner
Sean Man, Roy Ganz et al.Feb 18arXiv

OCR is like reading a page exactly as it is, and that strictness makes it perfect for fast, parallel generation.

#OCR#vision-language models#discrete diffusion

ExStrucTiny: A Benchmark for Schema-Variable Structured Information Extraction from Document Images

Intermediate
Mathieu Sibue, Andres Muñoz Garza et al.Feb 12arXiv

ExStrucTiny is a new test (benchmark) that checks if AI can pull many connected facts from all kinds of documents and neatly put them into JSON, even when the question style and schema change.

#structured information extraction#document understanding#vision-language models

DeepImageSearch: Benchmarking Multimodal Agents for Context-Aware Image Retrieval in Visual Histories

Intermediate
Chenlong Deng, Mengjie Deng et al.Feb 11arXiv

Most image search systems judge each photo by itself, which fails when clues are split across many photos taken over time.

#context-aware image retrieval#multimodal agents#visual history exploration

NarraScore: Bridging Visual Narrative and Musical Dynamics via Hierarchical Affective Control

Beginner
Yufan Wen, Zhaocheng Liu et al.Feb 9arXiv

NarraScore turns a video's changing story into a matching soundtrack by using emotion as the bridge.

#video-to-music generation#affective computing#valence-arousal

Unveiling Implicit Advantage Symmetry: Why GRPO Struggles with Exploration and Difficulty Adaptation

Intermediate
Zhiqi Yu, Zhangquan Chen et al.Feb 5arXiv

The paper finds a hidden symmetry inside GRPO’s advantage calculation that accidentally stops models from exploring new good answers and from paying the right attention to easy versus hard problems at the right times.

#GRPO#GRAE#A-GRAE
1234