🎓How I Study AIHISA
📖Read
📄Papers📰Blogs🎬Courses
💡Learn
🛤️Paths📚Topics💡Concepts🎴Shorts
🎯Practice
🧩Problems🎯Prompts🧠Review
SearchSettings
How I Study AI - Learn AI Papers & Lectures the Easy Way

Papers807

AllBeginnerIntermediateAdvanced
All SourcesarXiv

M3DR: Towards Universal Multilingual Multimodal Document Retrieval

Intermediate
Adithya S Kolavi, Vyoman JainDec 3arXiv

The paper introduces M3DR, a way for computers to find the right document image no matter which of 22 languages the query or the document uses.

#multilingual retrieval#multimodal retrieval#document image search

SPARK: Stepwise Process-Aware Rewards for Reference-Free Reinforcement Learning

Intermediate
Salman Rahman, Sruthi Gorantla et al.Dec 2arXiv

SPARK teaches AI to grade its own steps without needing the right answers written down anywhere.

#SPARK#Process Reward Model#PRM-CoT

Self-Improving VLM Judges Without Human Annotations

Intermediate
Inna Wanyin Lin, Yushi Hu et al.Dec 2arXiv

The paper shows how a vision-language model (VLM) can train itself to be a fair judge of answers about images without using any human preference labels.

#vision-language model#VLM judge#reward model

Fairy2i: Training Complex LLMs from Real LLMs with All Parameters in $\{\pm 1, \pm i\}$

Intermediate
Feiyu Wang, Xinyu Tan et al.Dec 2arXiv

Fairy2i turns any pre-trained real-valued Transformer layer into an exactly equivalent complex form, so nothing changes before quantization.

#LLM quantization#complex-valued neural networks#widely-linear transformation

ReVSeg: Incentivizing the Reasoning Chain for Video Segmentation with Reinforcement Learning

Intermediate
Yifan Li, Yingda Yin et al.Dec 2arXiv

ReVSeg teaches an AI to segment objects in videos by thinking step-by-step instead of guessing everything at once.

#Reasoning Video Object Segmentation#Vision-Language Models#Temporal Grounding

PaCo-RL: Advancing Reinforcement Learning for Consistent Image Generation with Pairwise Reward Modeling

Intermediate
Bowen Ping, Chengyou Jia et al.Dec 2arXiv

This paper teaches image models to keep things consistent across multiple pictures—like the same character, art style, and story logic—using reinforcement learning (RL).

#consistent image generation#pairwise reward modeling#reinforcement learning

From Imitation to Discrimination: Toward A Generalized Curriculum Advantage Mechanism Enhancing Cross-Domain Reasoning Tasks

Intermediate
Changpeng Yang, Jinyang Wu et al.Dec 2arXiv

This paper teaches AI models to reason better by first copying only good examples and later learning from mistakes too.

#Curriculum Advantage Policy Optimization#advantage-based RL#imitation learning

See, Hear, and Understand: Benchmarking Audiovisual Human Speech Understanding in Multimodal Large Language Models

Intermediate
Le Thien Phuc Nguyen, Zhuoran Yu et al.Dec 1arXiv

This paper introduces AV-SpeakerBench, a new test that checks if AI can truly see, hear, and understand who is speaking, what they say, and when they say it in real videos.

#audiovisual reasoning#speaker attribution#temporal grounding

Reinventing Clinical Dialogue: Agentic Paradigms for LLM Enabled Healthcare Communication

Intermediate
Xiaoquan Zhi, Hongke Zhao et al.Dec 1arXiv

Clinical conversations are special because they mix caring feelings with precise medical facts, and old AI systems struggled to do both at once.

#clinical dialogue#agentic AI#large language models

RealGen: Photorealistic Text-to-Image Generation via Detector-Guided Rewards

Intermediate
Junyan Ye, Leiqi Zhu et al.Nov 29arXiv

RealGen is a new way to make computer-made pictures look so real that they can fool expert detectors and even careful judges.

#photorealistic text-to-image#detector-guided rewards#reinforcement learning

Visual Generation Tuning

Intermediate
Jiahao Guo, Sinan Du et al.Nov 28arXiv

Before this work, big vision-language models (VLMs) were great at understanding pictures and words together but not at making new pictures.

#Visual Generation Tuning#VGT-AE#Vision-Language Models

VQRAE: Representation Quantization Autoencoders for Multimodal Understanding, Generation and Reconstruction

Intermediate
Sinan Du, Jiahao Guo et al.Nov 28arXiv

VQRAE is a new kind of image tokenizer that lets one model both understand images (continuous features) and generate/reconstruct them (discrete tokens).

#VQRAE#Vector Quantization#Representation Autoencoder
6465666768