🎓How I Study AIHISA
📖Read
📄Papers📰Blogs🎬Courses
💡Learn
🛤️Paths📚Topics💡Concepts🎴Shorts
🎯Practice
📝Daily Log🎯Prompts🧠Review
SearchSettings
How I Study AI - Learn AI Papers & Lectures the Easy Way

Papers17

AllBeginnerIntermediateAdvanced
All SourcesarXiv
#benchmark

Learning Situated Awareness in the Real World

Intermediate
Chuhan Li, Ruilin Han et al.Feb 18arXiv

SAW-Bench is a new test that checks if AI can understand the world from a first-person view, like wearing smart glasses.

#situated awareness#egocentric video#observer-centric reasoning

ResearchGym: Evaluating Language Model Agents on Real-World AI Research

Intermediate
Aniketh Garikaparthi, Manasi Patwardhan et al.Feb 16arXiv

ResearchGym is a new "gym" where AI agents are tested on real research projects end to end, not just on toy problems.

#ResearchGym#closed-loop research#objective evaluation

EcoGym: Evaluating LLMs for Long-Horizon Plan-and-Execute in Interactive Economies

Intermediate
Xavier Hu, Jinxiang Xia et al.Feb 10arXiv

EcoGym is a new open test playground where AI agents run small businesses over many days to see if they can plan well for the long term.

#EcoGym#long-horizon planning#LLM agents

CL-bench: A Benchmark for Context Learning

Beginner
Shihan Dou, Ming Zhang et al.Feb 3arXiv

CL-bench is a new test that checks whether AI can truly learn new things from the information you give it right now, not just from what it memorized before.

#context learning#benchmark#rubric-based evaluation

Vision-DeepResearch Benchmark: Rethinking Visual and Textual Search for Multimodal Large Language Models

Intermediate
Yu Zeng, Wenxuan Huang et al.Feb 2arXiv

The paper introduces VDR-Bench, a new test with 2,000 carefully built questions that truly require both seeing (images) and reading (web text) to find answers.

#multimodal large language model#visual question answering#vision deep research

AgentIF-OneDay: A Task-level Instruction-Following Benchmark for General AI Agents in Daily Scenarios

Beginner
Kaiyuan Chen, Qimin Wu et al.Jan 28arXiv

This paper builds a new test called AgentIF-OneDay that checks if AI helpers can follow everyday instructions the way people actually give them.

#AgentIF-OneDay#instruction following#AI agents

WorldVQA: Measuring Atomic World Knowledge in Multimodal Large Language Models

Intermediate
Runjie Zhou, Youbo Shao et al.Jan 28arXiv

WorldVQA is a new test that checks if multimodal AI models can correctly name what they see in pictures without doing extra reasoning.

#WorldVQA#atomic visual knowledge#multimodal large language models

Everything in Its Place: Benchmarking Spatial Intelligence of Text-to-Image Models

Beginner
Zengbin Wang, Xuecai Hu et al.Jan 28arXiv

Text-to-image models draw pretty pictures, but often put things in the wrong places or miss how objects interact.

#text-to-image#spatial intelligence#occlusion

AVMeme Exam: A Multimodal Multilingual Multicultural Benchmark for LLMs' Contextual and Cultural Knowledge and Thinking

Intermediate
Xilin Jiang, Qiaolin Wang et al.Jan 25arXiv

AVMeme Exam is a new test made by humans that checks if AI can understand famous internet audio and video clips the way people do.

#AVMeme Exam#multimodal large language models#audio-visual memes

Rethinking Video Generation Model for the Embodied World

Beginner
Yufan Deng, Zilin Pan et al.Jan 21arXiv

Robots need videos that not only look pretty but also follow real-world physics and finish the task asked of them.

#robot video generation#embodied AI#benchmark

PROGRESSLM: Towards Progress Reasoning in Vision-Language Models

Intermediate
Jianshu Zhang, Chengxuan Qian et al.Jan 21arXiv

This paper asks a new question for vision-language models: not just 'What do you see?' but 'How far along is the task right now?'

#progress reasoning#vision-language models#episodic retrieval

Patient-Similarity Cohort Reasoning in Clinical Text-to-SQL

Intermediate
Yifei Shen, Yilun Zhao et al.Jan 14arXiv

This paper introduces CLINSQL, a 633-task benchmark that turns real clinician-style questions into SQL challenges over the MIMIC-IV v3.1 hospital database.

#clinical text-to-SQL#EHR#MIMIC-IV
12