🎓How I Study AIHISA
📖Read
📄Papers📰Blogs🎬Courses
💡Learn
🛤️Paths📚Topics💡Concepts🎴Shorts
🎯Practice
🧩Problems🎯Prompts🧠Review
Search
How I Study AI - Learn AI Papers & Lectures the Easy Way

Papers25

AllBeginnerIntermediateAdvanced
All SourcesarXiv
#benchmarking

COMPASS: A Framework for Evaluating Organization-Specific Policy Alignment in LLMs

Intermediate
Dasol Choi, DongGeon Lee et al.Jan 5arXiv

COMPASS is a new framework that turns a company’s rules into thousands of smart test questions to check if chatbots follow those rules.

#policy alignment#allowlist denylist#enterprise AI safety

Video-Browser: Towards Agentic Open-web Video Browsing

Beginner
Zhengyang Liang, Yan Shu et al.Dec 28arXiv

The paper tackles how AI agents can truly research the open web when the answers are hidden inside long, messy videos, not just text.

#agentic video browsing#pyramidal perception#video understanding

T2AV-Compass: Towards Unified Evaluation for Text-to-Audio-Video Generation

Beginner
Zhe Cao, Tao Wang et al.Dec 24arXiv

T2AV-Compass is a new, unified test to fairly grade AI systems that turn text into matching video and audio.

#Text-to-Audio-Video generation#multimodal evaluation#cross-modal alignment

InSight-o3: Empowering Multimodal Foundation Models with Generalized Visual Search

Intermediate
Kaican Li, Lewei Yao et al.Dec 21arXiv

This paper builds a tough new test called O3-BENCH to check if AI can truly think with images, not just spot objects.

#multimodal reasoning#generalized visual search#reinforcement learning

SWE-EVO: Benchmarking Coding Agents in Long-Horizon Software Evolution Scenarios

Intermediate
Minh V. T. Thai, Tue Le et al.Dec 20arXiv

SWE-EVO is a new test (benchmark) that checks if AI coding agents can upgrade real software projects over many steps, not just fix one small bug.

#SWE-EVO#software evolution#coding agents

4D-RGPT: Toward Region-level 4D Understanding via Perceptual Distillation

Intermediate
Chiao-An Yang, Ryo Hachiuma et al.Dec 18arXiv

This paper teaches a video-understanding AI to think in 3D plus time (4D) so it can answer questions about specific objects moving in videos.

#4D perception#multimodal large language models#perceptual distillation

Skyra: AI-Generated Video Detection via Grounded Artifact Reasoning

Intermediate
Yifei Li, Wenzhao Zheng et al.Dec 17arXiv

Skyra is a detective-style AI that spots tiny visual mistakes (artifacts) in videos to tell if they are real or AI-generated, and it explains its decision with times and places in the video.

#AI-generated video detection#artifact reasoning#multimodal large language model

GRAN-TED: Generating Robust, Aligned, and Nuanced Text Embedding for Diffusion Models

Intermediate
Bozhou Li, Sihan Yang et al.Dec 17arXiv

This paper is about making the words you type into a generator turn into the right pictures and videos more reliably.

#diffusion models#text encoder#multimodal large language model

MMGR: Multi-Modal Generative Reasoning

Intermediate
Zefan Cai, Haoyi Qiu et al.Dec 16arXiv

MMGR is a new benchmark that checks whether AI image and video generators follow real-world rules, not just whether their outputs look pretty.

#multi-modal generative reasoning#video generation evaluation#physical commonsense

ShowTable: Unlocking Creative Table Visualization with Collaborative Reflection and Refinement

Intermediate
Zhihang Liu, Xiaoyi Bao et al.Dec 15arXiv

ShowTable is a new way for AI to turn a data table into a beautiful, accurate infographic using a think–make–check–fix loop.

#creative table visualization#multimodal large language model#diffusion model

DentalGPT: Incentivizing Multimodal Complex Reasoning in Dentistry

Intermediate
Zhenyang Cai, Jiaming Zhang et al.Dec 12arXiv

DentalGPT is a special AI that looks at dental images and text together and explains what it sees like a junior dentist.

#DentalGPT#multimodal large language model#dentistry AI

The FACTS Leaderboard: A Comprehensive Benchmark for Large Language Model Factuality

Beginner
Aileen Cheng, Alon Jacovi et al.Dec 11arXiv

The FACTS Leaderboard is a four-part test that checks how truthful AI models are across images, memory, web search, and document grounding.

#LLM factuality#benchmarking#multimodal evaluation
123