🎓How I Study AIHISA
📖Read
📄Papers📰Blogs🎬Courses
💡Learn
🛤️Paths📚Topics💡Concepts🎴Shorts
🎯Practice
🧩Problems🎯Prompts🧠Review
Search
How I Study AI - Learn AI Papers & Lectures the Easy Way

Papers7

AllBeginnerIntermediateAdvanced
All SourcesarXiv
#multimodal evaluation

AgentIF-OneDay: A Task-level Instruction-Following Benchmark for General AI Agents in Daily Scenarios

Beginner
Kaiyuan Chen, Qimin Wu et al.Jan 28arXiv

This paper builds a new test called AgentIF-OneDay that checks if AI helpers can follow everyday instructions the way people actually give them.

#AgentIF-OneDay#instruction following#AI agents

Everything in Its Place: Benchmarking Spatial Intelligence of Text-to-Image Models

Beginner
Zengbin Wang, Xuecai Hu et al.Jan 28arXiv

Text-to-image models draw pretty pictures, but often put things in the wrong places or miss how objects interact.

#text-to-image#spatial intelligence#occlusion

Rethinking Composed Image Retrieval Evaluation: A Fine-Grained Benchmark from Image Editing

Intermediate
Tingyu Song, Yanzhao Zhang et al.Jan 22arXiv

This paper introduces EDIR, a new and much more detailed test for Composed Image Retrieval (CIR), where you search for a target image using a starting image plus a short text change.

#Composed Image Retrieval#EDIR#fine-grained benchmark

Agent-as-a-Judge

Beginner
Runyang You, Hongru Cai et al.Jan 8arXiv

This survey explains how AI judges are changing from single smart readers (LLM-as-a-Judge) into full-on agents that can plan, use tools, remember, and work in teams (Agent-as-a-Judge).

#Agent-as-a-Judge#LLM-as-a-Judge#multi-agent collaboration

T2AV-Compass: Towards Unified Evaluation for Text-to-Audio-Video Generation

Beginner
Zhe Cao, Tao Wang et al.Dec 24arXiv

T2AV-Compass is a new, unified test to fairly grade AI systems that turn text into matching video and audio.

#Text-to-Audio-Video generation#multimodal evaluation#cross-modal alignment

Video Reality Test: Can AI-Generated ASMR Videos fool VLMs and Humans?

Intermediate
Jiaqi Wang, Weijia Wu et al.Dec 15arXiv

This paper builds a new test called Video Reality Test to see if AI-made ASMR videos can fool both people and AI video watchers (VLMs).

#ASMR#audio-visual coupling#AI-generated video detection

The FACTS Leaderboard: A Comprehensive Benchmark for Large Language Model Factuality

Beginner
Aileen Cheng, Alon Jacovi et al.Dec 11arXiv

The FACTS Leaderboard is a four-part test that checks how truthful AI models are across images, memory, web search, and document grounding.

#LLM factuality#benchmarking#multimodal evaluation